Goodreads DB setup

Using LOAD, CSV files, and subqueries to create a larger database

Goodreads Bookdb, slightly larger dataset

First, We need to allow local infile loading, which is double disabled on MySQL 8.x

Now to get the files you need

  1. In the files section on canvas, download the GoodReads_Bookdb20200304.zip file
  2. unzip the GoodReads.... file on your local machine where you can find it
  3. Open the folder, should see a file called "goodreadscleaned.csv"
  4. use Filezilla (put image here) to upload "goodreadscleaned.csv" to your ec2-user directory
  5. inside of your terminal window (putty, ssh etc) where you are logged into ec2-user
  6. ls should show that file
  7. mysql -u username -p

    use theforce;
    
    create table goodreadstemp (
    goodreadsid int unsigned,
    title varchar(250),
    firstname varchar(30),
    lastname varchar(30),
    isbn varchar(12),
    isbn13 varchar(15),
    publisher varchar(70),
    pages smallint unsigned,
    pubyear smallint unsigned,
    authorder tinyint unsigned
    ) engine = InnoDB Default Charset utf8mb4 collate=utf8mb4_unicode_ci;
    
    LOAD DATA LOCAL INFILE '/home/ec2-user/goodreadscleaned.csv' 
    INTO TABLE theforce.goodreadstemp
    FIELDS TERMINATED by ','
    OPTIONALLY ENCLOSED BY '"';
    
    /* Additional Options for LOAD DATA */
    
    # To ignore a certain number of lines, usually the first one that
    # has the column labels
    
    /* Students, DO NOT DO this version, do the one above */
    LOAD DATA LOCAL INFILE '/path/to/file.csv' 
    INTO TABLE theforce.goodreadstemp
    FIELDS TERMINATED by ',' 
    OPTIONALLY ENCLOSED BY '"'
    IGNORE 1 LINES;
    

ERD or Schema for our grbooks database

Now for the permanent tables

    create table grbooks (
    book_id int unsigned PRIMARY KEY auto_increment,
    goodreads_id INT unsigned,
    UNIQUE KEY (goodreads_id),
    title varchar(250),
    isbn varchar(12),
    UNIQUE KEY (isbn),
    UNIQUE KEY (isbn13),
    isbn13 varchar(15),
    publisher_id  INT unsigned,
    pages smallint unsigned,
    pubyear smallint unsigned
    ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

    create table grauthors(
    author_id int unsigned PRIMARY KEY auto_increment,
    firstname varchar(30),
    lastname varchar(30),
    UNIQUE KEY (lastname,firstname)
    ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

    create table grpublishers(
    publisher_id INT unsigned PRIMARY KEY auto_increment,
    pubname varchar(70),
    UNIQUE KEY (pubname)
    ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

    create table grbooks_authors(
    book_id INT unsigned,
    author_id INT unsigned,
    UNIQUE KEY (book_id,author_id),
    authorder tinyint unsigned
    ) engine = InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

    ALTER TABLE grbooks add foreign key (publisher_id) references grpublishers (publisher_id);
    ALTER TABLE grbooks_authors add foreign key (book_id) references grbooks (book_id);
    ALTER TABLE grbooks_authors add foreign key (author_id) references grauthors (author_id);

Now to do the inserts from the temp table to the permanent ones

If we did not have the UNIQUE KEY field in each table

    INSERT into grauthors (lastname,firstname)
    Select DISTINCT gt.lastname,gt.firstname 
    From goodreadstemp gt
    WHERE (gt.lastname,gt.firstname) NOT IN (
    Select DISTINCT gt.lastname, gt.firstname 
    from goodreadstemp gt,grauthors ga
    where gt.lastname = ga.lastname 
    AND gt.firstname = ga.firstname)
    order by gt.lastname;

    INSERT into grbooks (goodreads_id,title,isbn,isbn13,pages,pubyear)
    SELECT DISTINCT gt.goodreadsid, gt.title, gt.isbn,gt.isbn13,gt.pages, gt.pubyear
    FROM goodreadstemp gt
    Where (gt.title) NOT IN
    (select DISTINCT gt.title 
    from goodreadstemp gt, grbooks gb
    where gt.title = gb.title)
    order by gt.title;

    INSERT INTO grbooks_authors (book_id,author_id,authorder)
    select DISTINCT gb.book_id, ga.author_id,gt.authorder
    from grauthors ga, grbooks gb, goodreadstemp gt   
    where gt.firstname = ga.firstname
    and   gt.lastname = ga.lastname
    and  gt.title = gb.title
    and (gb.book_id,ga.author_id) NOT IN 
    (select book_id,author_id
    from grbooks_authors);

Resources

Page last updated on 20231010

HTML 5 Validation CSS 3 Validation Web Accessibility EValuator