Everyone knows South Park. I’ll admit I haven’t watched an episode in at least 5 years, but once I saw this data set I had to dig in. It’s a prime sample to do some quick and dirty linguistic pattern search. It also provide us with a way to understand how to “measure” language.
The data lives here. Just FYI, this will be less of a coding tutorial and more of a linguistics based tutorial so the code I show here is not optimized. In addition, I wanted to into making fancy plots so I am back at trying different types of data visualization. Channeling Mona Chalabi…For the non-plot visualizations I used Canva.
Phrases
The first thing I wanted to look at was common phrases. There multiple ways to do this, but a straight forward way is looking at n-grams. Put simply, n-grams are sequences of words that commonly appear together. They provide context and meaning to a given word or combination of words. Good examples are “New York City”, “White Chocolate Mocha.” Here’s what I found from South Park.
Here are the top 2-grams to 7-grams and the quick recipe I used to get them.
from sklearn.feature_extraction.text import CountVectorizer word_vectorizer = CountVectorizer(ngram_range=(2,4), analyzer='word') sparse_matrix = word_vectorizer.fit_transform(df['Line']) frequencies = sum(sparse_matrix).toarray()[0] pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
“Oh, my God, they killed Kenny! You, Bastards!”
Parts of Speech
Looking at parts of speech on its own won’t solve any NLP problems, but it can help us understand the text we are working with. This is particularly important if we want to find any sort of linguistic patterns. To be honest, I am not expecting much out of the words that will come up here since it’s South Park and the show is not known for using a diverse lexicon, haha. The NLTK library has a lot of tools to flag parts of speech. A really approachable one is pos_tag. I encourage you to look over the documentation! Here are some of the parts of speech we can flag with this function:
- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: “there is” … think of it like “there exists”)
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective ‘big’
- JJR adjective, comparative ‘bigger’
- JJS adjective, superlative ‘biggest’
- LS list marker 1)
- MD modal could, will
- NN noun, singular ‘desk’
- NNS noun plural ‘desks’
- NNP proper noun, singular ‘Harrison’
- NNPS proper noun, plural ‘Americans’
- PDT predeterminer ‘all the kids’
- POS possessive ending parent’s
- PRP personal pronoun I, he, she
- PRP$ possessive pronoun my, his, hers
- RB adverb very, silently,
- RBR adverb, comparative better
- RBS adverb, superlative best
- RP particle give up
- UH interjection
- VB verb, base form take
- VBD verb, past tense took
- VBG verb, gerund/present participle taking
- VBN verb, past participle taken
- VBP verb, sing. present, non-3d take
- VBZ verb, 3rd person sing. present takes
- WDT wh-determiner which
- WP wh-pronoun who, what
- WP$ possessive wh-pronoun whose
- WRB wh-abverb where, when
from nltk.tag import pos_tag from nltk import word_tokenize verbs = [] adjectives = [] nouns = [] for i in range(len(df['Line'])): #print(i) test = df['Line'][i].decode('utf-8') tokens = word_tokenize(test) tagged = pos_tag(tokens) tags = zip(*tagged)[1] word = zip(*tagged)[0] for j in range(len(tags)): if tags[j] == 'JJ': adjectives.append(word[j]) if tags[j] == 'VB': verbs.append(word[j]) if tags[j] == 'NN': nouns.append(word[j])
Here are the results! PS. I ignored some words that should have been stop-words or were repetitive.
So, this seems right up there with the sort of words we hear in South Park conversations.
Lexical Diversity
Lexical density or diversity is a great way to measure how many unique words are in a given context. Here we measure it by diving the number of unique words over the total number of words.
Also, I took a look at number of lines, unique words and word count by the top characters. I defined top characters as those with the most amount of lines spoken in the show. Here’s the table of results!
Name | Number of Lines | Total Word Count | Average Sentence Length | Unique Words | Lexical Diversity |
Cartman | 9774 | 172682 | 17.667485164722734 | 9621 | 0.05571512954448061 |
Stan | 7680 | 92493 | 12.043359375 | 5366 | 0.058015201150357326 |
Kyle | 7099 | 85120 | 11.990421186082546 | 5329 | 0.06260573308270677 |
Randy | 2467 | 41598 | 16.861775435751927 | 3891 | 0.0935381508726381 |
Butters | 2602 | 39729 | 15.268639508070715 | 3912 | 0.09846711470210677 |
Mr. Garrison | 1002 | 19795 | 19.755489021956087 | 2522 | 0.12740591058348066 |
Chef | 917 | 14890 | 16.23773173391494 | 1967 | 0.13210208193418402 |
Kenny | 881 | 7807 | 8.861520998864926 | 1034 | 0.1324452414499808 |
Sharon | 862 | 11435 | 13.265661252900232 | 1530 | 0.13379973764757325 |
Mr. Mackey | 633 | 14299 | 22.589257503949447 | 1969 | 0.13770193719840548 |
Sheila | 566 | 7521 | 13.287985865724382 | 1195 | 0.1588884456854142 |
Liane | 582 | 6987 | 12.005154639175258 | 1134 | 0.16230141691713182 |
Gerald | 626 | 8961 | 14.314696485623003 | 1465 | 0.16348621805602054 |
Jimmy | 597 | 11172 | 18.71356783919598 | 1841 | 0.16478696741854637 |
Jimbo | 556 | 9202 | 16.550359712230215 | 1546 | 0.1680069550097805 |
Wendy | 585 | 7906 | 13.514529914529914 | 1339 | 0.16936503921072604 |
LUL, I mean, I guess, sure, Kenny is the most articulate? Let’s visualize! Who is the most articulate out of the four boys?
Let’s now plot a few of the others metrics; mainly the number of lines per character and the number of unique words per character.
So, it seems that Cartman has both the most unique words and number of lines used in the show, followed by Stan and Kyle.
Finally, I wanted to see if there was any sort of correlation between these text features.
From here we see, that lexical diversity is actually inversely proportional to the number of lines and word count. Meanwhile, total word count and number of lines goes hand in hand with the number of unique terms spoken.
Here are the plotting recipes!
import seaborn as sns import matplotlib as plt #Number of Lines df = df.sort_values(['Number of Lines'], ascending=False).reset_index(drop=True) sns.set(rc={'figure.figsize':(11.7,8.27)}) sns.set_style("whitegrid") ax = sns.barplot(x = 'Name', y = 'Number of Lines', data = character_quotes_parameters_df, palette = "YlGnBu") ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right") ax.grid(False) sns.despine() #Unique Words df = df.sort_values(['Unique Words']).reset_index(drop=True) sns.set(rc={'figure.figsize':(11.7,8.27)}) sns.set_style("whitegrid") ax = sns.barplot(x = 'Name', y = 'Unique Words', data = character_quotes_parameters_df, palette = "PuBu") ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right") ax.grid(False) sns.despine() #Correlation Matrix # Generate a mask for the upper triangle mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True # Set up the matplotlib figure f, ax = plt.subplots(figsize=(11, 9)) # Generate a custom diverging colormap cmap = sns.diverging_palette(199, 250, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})
And there you have it, folks! Hope it’s been fun!