Goodreads DB setup
Using LOAD, CSV files, and subqueries to create a larger database
Goodreads Bookdb, slightly larger dataset
First, We need to allow local infile loading, which is double disabled on MySQL 8.x
su[enter]and type in the LINUX ROOT password, as we need to edit a file- For AWS cloud just do
sudo suor putsudoin front of the vim command vim /etc/my.cnfifor INSERT mode- under the [mysqld] section put this line local_infile=1
-
go to the VERY BOTTOM of the file and type the next two lines
[mysql] local_infile=1
-
[esc]then:wqand[enter] systemctl restart mysqldand make sure it successfully restarts!
Now to get the files you need
- In the files section on canvas, download the GoodReads_Bookdb20200304.zip file
- unzip the GoodReads.... file on your local machine where you can find it
- Open the folder, should see a file called "goodreadscleaned.csv"
- use Filezilla (put image here) to upload "goodreadscleaned.csv" to your ec2-user directory
- inside of your terminal window (putty, ssh etc) where you are logged into ec2-user
lsshould show that file-
mysql -u username -puse theforce; create table goodreadstemp ( goodreadsid int unsigned, title varchar(250), firstname varchar(30), lastname varchar(30), isbn varchar(12), isbn13 varchar(15), publisher varchar(70), pages smallint unsigned, pubyear smallint unsigned, authorder tinyint unsigned ) engine = InnoDB Default Charset utf8mb4 collate=utf8mb4_unicode_ci; LOAD DATA LOCAL INFILE '/home/ec2-user/goodreadscleaned.csv' INTO TABLE theforce.goodreadstemp FIELDS TERMINATED by ',' OPTIONALLY ENCLOSED BY '"'; /* Additional Options for LOAD DATA */ # To ignore a certain number of lines, usually the first one that # has the column labels /* Students, DO NOT DO this version, do the one above */ LOAD DATA LOCAL INFILE '/path/to/file.csv' INTO TABLE theforce.goodreadstemp FIELDS TERMINATED by ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES;
ERD or Schema for our grbooks database
- There are still many areas where we could add fields and improve the design
Now for the permanent tables
create table grbooks ( book_id int unsigned PRIMARY KEY auto_increment, goodreads_id INT unsigned, UNIQUE KEY (goodreads_id), title varchar(250), isbn varchar(12), UNIQUE KEY (isbn), UNIQUE KEY (isbn13), isbn13 varchar(15), publisher_id INT unsigned, pages smallint unsigned, pubyear smallint unsigned ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; create table grauthors( author_id int unsigned PRIMARY KEY auto_increment, firstname varchar(30), lastname varchar(30), UNIQUE KEY (lastname,firstname) ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; create table grpublishers( publisher_id INT unsigned PRIMARY KEY auto_increment, pubname varchar(70), UNIQUE KEY (pubname) ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; create table grbooks_authors( book_id INT unsigned, author_id INT unsigned, UNIQUE KEY (book_id,author_id), authorder tinyint unsigned ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; ALTER TABLE grbooks add foreign key (publisher_id) references grpublishers (publisher_id); ALTER TABLE grbooks_authors add foreign key (book_id) references grbooks (book_id); ALTER TABLE grbooks_authors add foreign key (author_id) references grauthors (author_id);
Now to do the inserts from the temp table to the permanent ones
-
NOTE: the "UNIQUE KEY" we have on the tables allow
INSERT IGNOREto WORK, otherwise we would have to
use a complicated subquery to prevent duplicationINSERT IGNORE INTO grbooks (goodreads_id, title,isbn,isbn13,pages,pubyear) Select DISTINCT gt.goodreadsid,gt.title,gt.isbn,gt.isbn13,gt.pages,gt.pubyear From goodreadstemp gt; INSERT IGNORE INTO grauthors (firstname, lastname) Select DISTINCT gt.firstname,gt.lastname From goodreadstemp gt; INSERT IGNORE INTO grbooks_authors (author_id,book_id,authorder) select DISTINCT ga.author_id, gb.book_id,gt.authorder from grauthors ga, grbooks gb, goodreadstemp gt where gt.firstname = ga.firstname and gt.lastname = ga.lastname and gt.title = gb.title; INSERT IGNORE INTO grpublishers (pubname) select DISTINCT gt.publisher from goodreadstemp gt;
-
IF we had put the
INSERTfor grpublishers BEFORE grbooks, we could rewrite theINSERTfor grbooks so we do not have to do that bizarre update laterINSERT IGNORE INTO grbooks (goodreads_id, title,isbn,isbn13,pages,pubyear,publisher_id) Select DISTINCT gt.goodreadsid,gt.title,gt.isbn,gt.isbn13,gt.pages,gt.pubyear,gp.publisher_id From goodreadstemp gt,grpublishers gp WHERE gt.publisher = gp.pubname;
If we did not have the UNIQUE KEY field in each table
INSERT into grauthors (lastname,firstname) Select DISTINCT gt.lastname,gt.firstname From goodreadstemp gt WHERE (gt.lastname,gt.firstname) NOT IN ( Select DISTINCT gt.lastname, gt.firstname from goodreadstemp gt,grauthors ga where gt.lastname = ga.lastname AND gt.firstname = ga.firstname) order by gt.lastname; INSERT into grbooks (goodreads_id,title,isbn,isbn13,pages,pubyear) SELECT DISTINCT gt.goodreadsid, gt.title, gt.isbn,gt.isbn13,gt.pages, gt.pubyear FROM goodreadstemp gt Where (gt.title) NOT IN (select DISTINCT gt.title from goodreadstemp gt, grbooks gb where gt.title = gb.title) order by gt.title; INSERT INTO grbooks_authors (book_id,author_id,authorder) select DISTINCT gb.book_id, ga.author_id,gt.authorder from grauthors ga, grbooks gb, goodreadstemp gt where gt.firstname = ga.firstname and gt.lastname = ga.lastname and gt.title = gb.title and (gb.book_id,ga.author_id) NOT IN (select book_id,author_id from grbooks_authors);


