goodbooks-10k EXPANSION PACK

This is an extended version of the Goodbooks 10k dataset, originally scraped from the Goodreads API in September 2017 by Zygmunt Zając. Additional fields are included in the books_enriched.csv file. The biggest advantage of this new version that it adds a text description field for 9943 of the 10 000 books. Please consult the original repository for additional information on the original files. As a reminder, the dataset contains six million numerical ratings of the platform’s ten thousand most popular books, with data collected from 53 424 different users.

A detailed analysis as well as modelling strategies are presented here.

Additional fields have been added to the original books.csv file via two strategies:

Pulling attributes from cross-referenced titles in the Best Books Ever Dataset, collected from the Goodreads website in Fall 2020 by Lorena Casanova Lozano and Sergio Costa Planells. From what I gather, they used the Selenium package to parse book webpages.
For the 1833 books missing from the above dataset, extended fields were scraped with the Goodreads api in October 2021. Although this API has officially been retired, I was able to find a developer key online that still worked.

The four new fields added to the original books.csv file are :

description : a free text summarizing the book's content. On average the description is 900 characters long, with 95% of book descriptions counting less than 1797 characters. This column is 99,43% complete.
pages : the total page count. This column is 99.27% complete.
publishDate : the publication date. This column is 99.92% complete.
genres : the genre tags taken from the top shelves users have assigned to a book. Only the main Goodreads genres have been retained. On average, there are 4.7 genres per title, with 75% of books containing 6 genres or less. This column is 100% complete.

The two updated fields integrated in this version are :

authors : a newly scraped list of all book contributors. This can include illustrators and collaborators. This column is 100% complete.
language_code : abbreviated language tags for all books, computed by scanning the book titles with the langid package. This column is 100% complete.

Importing the file

import pandas as pd
from ast import literal_eva
 
books_df = pd.read_csv('https://raw.githubusercontent.com/malcolmosh/goodbooks-10k/master/books_enriched.csv', index_col=[0], converters={"genres": literal_eval})

#machinelearning #dataset #books #recommendersystem #opendata

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
books_xml		books_xml
contrib		contrib
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THANKS.md		THANKS.md
book_tags.csv		book_tags.csv
books_enriched.csv		books_enriched.csv
quick_look.ipynb		quick_look.ipynb
ratings.csv		ratings.csv
tags.csv		tags.csv
to_read.csv		to_read.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

goodbooks-10k EXPANSION PACK

Importing the file

About

Releases

Packages

Languages

License

malcolmosh/goodbooks-10k-extended

Folders and files

Latest commit

History

Repository files navigation

goodbooks-10k EXPANSION PACK

Importing the file

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages