I love to read, a lot. It is rare to find me without a novel or kindle on my hand. In fact, I often can’t sit still and watch a show because I can’t keep my hands off my book. I was so excited to find this GoodReads dataset because for a while now I have been having all these questions about book correlations or marketing and the influence of BookTok, but I was too lazy to pull the data myself. SO, thank you to the kind soul who made this dataset for us.

I am going to split it up into some straightforward EDA and then just for funsies I want to play around with the Transformers library and see if we can come up with credible A Court of Thorns and Roses (ACOTAR) by Sarah J. Maas quotes. If you don’t know what ACOTAR is, get out and go read the books now. Sorry, I am obsessed with these series and I can’t stop.

Some early questions…

  • Is there a correlation between rating and release year?
  • Can different types of visualization help reveal cool stuff about what we read?
  • Can we generate a fake book quotes?

Let’s get set up.

 

#Set up
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import datetime

# create color palette
palette = sns.color_palette(['#da572e', '#8a5650', '#72302e','#b29e90', '#503e35', '#299691'])
sns.set_context("paper", font_scale=1.1)
sns.palplot(['#da572e', '#8a5650', '#72302e','#b29e90', '#503e35', '#299691'])

plt.title("GoodReads Color Palette ",loc='left',fontfamily='serif',fontsize=15,y=1.2)

I find that I waste an inordinate amount of time choosing colors, so this time, we’re getting that out of way fast.

Then, let’s read the data, inspect and do some quick cleanups. I also wanted to add a few extra variables to the dataset, mainly, year and month.

 

df = pd.read_csv("data/books.csv", on_bad_lines='skip')
df.head()

# check for missing data
msno.matrix(df)

# inspect data
# rows and columns
row_count = str(df.shape[0])
num_cols = str(df.shape[1])
print("Number Rows and Columns:" )
print(row_count+", "+num_cols)
# inspect data types
print("\n"+"Dats Types:")
print(df.dtypes)

# fix duplicate authors and make appropriate filters
df.replace(to_replace='J.K. Rowling-Mary GrandPré', value = 'J.K. Rowling', inplace=True)
df.replace(to_replace='J.K. Rowling/Mary GrandPré', value = 'J.K. Rowling', inplace=True)
df.replace(to_replace='J.K. Rowling/Gemma Rovira Ortega', value = 'J.K. Rowling', inplace=True)
df.replace(to_replace='Neil Gaiman/Mike Dringenberg/Chris Bachalo/Michael Zulli/Kelly Jones/Charles Vess/Colleen Doran/Malcolm Jones III/Steve Parkhouse/Daniel Vozzo/Lee Loughridge/Steve Oliff/Todd Klein/Dave McKean/Sam Kieth', value = 'Neil Gaiman', inplace=True)
df.replace(to_replace='Neil Gaiman/Matt Wagner/George Pratt/Dick Giordano/Kelley Jones/P. Craig Russell/Mike Dringenberg/Malcolm Jones III/Todd Klein/Harlan Ellison', value = 'Neil Gaiman', inplace=True)
df.replace(to_replace='Neil Gaiman/Michael Zulli/Jon J. Muth/Charles Vess/Mikal Gilmore', value = 'Neil Gaiman', inplace=True)
df.replace(to_replace="John             Lewis/Michael D'Orso", value = 'John Lewis', inplace=True)


df['publication_date'] = pd.to_datetime(df['publication_date'], format='%m/%d/%Y',  errors='coerce').dt.date
df = df[(df['  num_pages'] >= 100)]


# feature engineering
df['year'] = pd.DatetimeIndex(df['publication_date']).year
df['month'] = pd.DatetimeIndex(df['publication_date']).month
df['month_name'] = pd.DatetimeIndex(df['publication_date']).month_name()

EDA

I always love to start out with a pairwise plot. I find that it gives a quick view of any possible linear relationships!

 

sub = df[['average_rating', '  num_pages', 'ratings_count', 'year', 'month']]
sns.pairplot(sub, palette="#B29E90")

I think we can potentially see a vague relationship between average_rating and year. Interestingly, I was expecting to see some months perform better than others in terms of publishing volumes, but it seems to be pretty even throughout the year.

I then wanted to look at the top 10 most published authors. Here’s a quick plotting recipe I saw the other day for a Netflix EDA and couldn’t wait to replicate.

 

# Most Published Authors
data = df.groupby('authors')['title'].count().sort_values(ascending=False)[:10]
fig, ax = plt.subplots(1,1, figsize=(12, 6))
color_map = [palette[3] for _ in range(10)]
color_map[0] = color_map[1] = color_map[2] =  '#b20710' # color highlight

ax.bar(data.index, data, width=0.5, 
       edgecolor='darkgray',
       linewidth=0.6,color=color_map)

#annotations
for i in data.index:
    ax.annotate(f"{data[i]}", 
                   xy=(i, data[i] + 150), #i like to change this to roughly 5% of the highest cat
                   va = 'center', ha='center',fontweight='light', fontfamily='serif')

# Remove border from plot

for s in ['top', 'left', 'right']:
    ax.spines[s].set_visible(False)
    
# Tick labels
ax.set_xticklabels(data.index, fontfamily='serif', rotation=0)

# Title and sub-title
fig.text(0.09, 1, 'Top 10 Most Published Authors', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.09, 0.95, 'The three authors with the most publications are highlighted.', fontsize=12, fontweight='light', fontfamily='serif')

fig.text(1.01, 0.95, 'Insight', fontsize=15, fontweight='bold', fontfamily='serif')

fig.text(1.01, 0.60, 
'''

From this dataset, it
seems Wodehouse has published
the most. 

He's followed by
Stephen King and Takahashi.

All languages accounted for.
Only kept entries of 100+ pages.
'''
         , fontsize=12, fontweight='light', fontfamily='serif')

ax.grid(axis='y', linestyle='-', alpha=0.4)   

grid_y_ticks = np.arange(0, 50, 5) # y ticks, min, max, then step
ax.set_yticks(grid_y_ticks)
ax.set_axisbelow(True)
    
# thicken the bottom line 
plt.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)
ax.tick_params(axis='both', which='major', labelsize=12)

import matplotlib.lines as lines
l1 = lines.Line2D([1, 1], [0, 1], transform=fig.transFigure, figure=fig,color='black',lw=0.2)
fig.lines.extend([l1])
ax.tick_params(axis=u'both', which=u'both',length=0,labelrotation=45)

Eeeek, so excited with how this plot turned out! Unsurprisingly so, Takahashi and King top the charts! Tbh, kinda happy JK Rowling wasn’t here, come at me.

I then wanted to look at different distributions such as highest rated authors and books. I also wish the dataset had been a bit more exhaustive, but alas… there’s just no way Sarah J. Maas doesn’t even come up. Or… this is probably my own bias.

 
# highest rated authors
high_rated_author = df[df['average_rating']>=4.3]
data = high_rated_author[high_rated_author['text_reviews_count'] >= 200].groupby(['authors'])['average_rating'].mean().sort_values(ascending=False)[:10]
data

# highest rated books
data = df[df['text_reviews_count'] >= 200].groupby(['title', 'authors'])['average_rating'].mean().sort_values(ascending=False)[:10]
data

I def cracked up at Full Metal Alchemist, but according to my husband who claims to be an anime connoisseur, this makes total sense. Lord of the Rings and The Sandman so no surprise there. No comment on Harry Potter 🙅🏽‍♀️.

Transformers, Text Generation

Unless you live under a rock, transformers have taken the NLP world by storm in recent years. It’s gotten to the point where you can’t even mention attempting to do something aside of transformers without it being frowned upon. And, I agree to an extent that transformers truly have revolutionized the game, especially when it comes to text classification and text generation.

There are so many amazing tutorials out there that I would be doing a disservice to rehash the theory here all over again. Instead, I want to spend a bit more time understanding the transformers package itself and how one can fine tune a quick text generation model. Obviously, how fun would it be to try and generate ACOTAR quotes!!!

First things first, in order for our text generator to work, we need to starter sequence of text so that our transformer model can try and imitate Sarah J. Maas. I picked this one! The transformer model will then attempt to continue the story,

Let’s get the library loaded up.

 
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# initialize tokenizer and model from pretrained GPT2 model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

sequence = """
“Her magic sent him sprawling, and it then hurled into Rhysand again - so hard that his head cracked against the stones and the knife dropped from his splayed fingers. No one made a move to help him, and she struck him once more with her power. The red marble splintered where he hit it, spiderwebbing toward me. With wave after wave she hit him. Rhys groaned.
"Stop," I breathed, blood filling my mouth as I strained a hand to reach her feet. "Please."
Rhys's arms buckled as he fought to rise, and blood dripped from his nose, splattering on the marble. His eyes met mine.
"""

Next, let’s encode our text in a way that our model can understand. In this case, I am using pytorch; hence, I will specify pt, but you can also use tensorflow.

 
inputs = tokenizer.encode(sequence, return_tensors='pt')

Then, to generate the model, we simply do the following, and it is here where we can really start fine tuning our model.

 
outputs = model.generate(inputs, max_length=200, do_sample=True)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
text

Let’s break up this code to understand what is going on. _.generate is going to call and run our model and where most of the parameters we can change will come up.

max_length; for example, let’s us specify the maximum number of characters we want our model to generate.

do_sample; specifies to the model to use a sampling approach in selecting which words to generate. This means that it’ll randomly choose the next word based on its conditional probability distribution; mainly, P(“the” | “he went to”). This will also stop the model from just picking the most likely word at every prediction.

With these two parameters alone, let’s see what we get. The red text is generated.

Not bad…Rhys DOES constantly struggle with being a “monster.” Let’s try to do better! We can also combine different approaches to generate the next term in the sentence to try an improve our predictions.

 
outputs = model.generate(inputs, max_length=200, do_sample=True, top_p=0.92, top_k=20, temperature=5.0)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
text

Ok, we’ve added a few more parameters this time. Let’s break them up.

temperature has to do with the randomness of the prediction- heavily inspired in thermodynamics. The higher the number, the crazier it gets.

top_p is best understood as so, out of all the possibilities, which combination of these terms have the highest probability.

top_k, controls the diversity of terms predicted.

I do think having a combination of these parameters does give us a much more human-readable result! I mean I could totally believe this was definitely part of the book. The red text is generated.

There are many other things we could fine-tune, but this a good starting point and I had so much fun making these!

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s