Everyone knows South Park. I’ll admit I haven’t watched an episode in at least 5 years, but once I saw this data set I had to dig in. It’s a prime sample to do some quick and dirty linguistic pattern search. It also provide us with a way to understand how to “measure” language.

The data lives here. Just FYI, this will be less of a coding tutorial and more of a linguistics based tutorial so the code I show here is not optimized. In addition, I wanted to into making fancy plots so I am back at trying different types of data visualization. Channeling Mona Chalabi…For the non-plot visualizations I used Canva.


The first thing I wanted to look at was common phrases. There multiple ways to do this, but a straight forward way is looking at n-grams. Put simply, n-grams are sequences of words that commonly appear together. They provide context and meaning to a given word or combination of words. Good examples are “New York City”, “White Chocolate Mocha.” Here’s what I found from South Park.

Here are the top 2-grams to 7-grams and the quick recipe I used to get them.

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(2,4), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['Line'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])


“Oh, my God, they killed Kenny! You, Bastards!”

Parts of Speech

Looking at parts of speech on its own won’t solve any NLP problems, but it can help us understand the text we are working with. This is particularly important if we want to find any sort of linguistic patterns.  To be honest, I am not expecting much out of the words that will come up here since it’s South Park and the show is not known for using a diverse lexicon, haha.  The NLTK library has a lot of tools to flag parts of speech. A really approachable one is pos_tag. I encourage you to look over the documentation! Here are some of the parts of speech we can flag with this function:

  • CC coordinating conjunction
  • CD cardinal digit
  • DT determiner
  • EX existential there (like: “there is” … think of it like “there exists”)
  • FW foreign word
  • IN preposition/subordinating conjunction
  • JJ adjective ‘big’
  • JJR adjective, comparative ‘bigger’
  • JJS adjective, superlative ‘biggest’
  • LS list marker 1)
  • MD modal could, will
  • NN noun, singular ‘desk’
  • NNS noun plural ‘desks’
  • NNP proper noun, singular ‘Harrison’
  • NNPS proper noun, plural ‘Americans’
  • PDT predeterminer ‘all the kids’
  • POS possessive ending parent’s
  • PRP personal pronoun I, he, she
  • PRP$ possessive pronoun my, his, hers
  • RB adverb very, silently,
  • RBR adverb, comparative better
  • RBS adverb, superlative best
  • RP particle give up
  • UH interjection
  • VB verb, base form take
  • VBD verb, past tense took
  • VBG verb, gerund/present participle taking
  • VBN verb, past participle taken
  • VBP verb, sing. present, non-3d take
  • VBZ verb, 3rd person sing. present takes
  • WDT wh-determiner which
  • WP wh-pronoun who, what
  • WP$ possessive wh-pronoun whose
  • WRB wh-abverb where, when
from nltk.tag import pos_tag
from nltk import word_tokenize

verbs = []
adjectives = []
nouns = []
for i in range(len(df['Line'])):
    test = df['Line'][i].decode('utf-8')
    tokens = word_tokenize(test)
    tagged = pos_tag(tokens)
    tags = zip(*tagged)[1]
    word = zip(*tagged)[0]
    for j in range(len(tags)):
        if tags[j] == 'JJ':
        if tags[j] == 'VB':
        if tags[j] == 'NN':

Here are the results! PS. I ignored some words that should have been stop-words or were repetitive.
parts of speech

So, this seems right up there with the sort of words we hear in South Park conversations.

Lexical Diversity

Lexical density or diversity is a great way to measure how many unique words are in a given context.  Here we measure it by diving the number of unique words over the total number of words.

Also, I took a look at number of lines, unique words and word count by the top characters. I defined top characters as those with the most amount of lines spoken in the show. Here’s the table of results!

Name Number of Lines Total Word Count Average Sentence Length Unique Words Lexical Diversity
Cartman 9774 172682 17.667485164722734 9621 0.05571512954448061
Stan 7680 92493 12.043359375 5366 0.058015201150357326
Kyle 7099 85120 11.990421186082546 5329 0.06260573308270677
Randy 2467 41598 16.861775435751927 3891 0.0935381508726381
Butters 2602 39729 15.268639508070715 3912 0.09846711470210677
Mr. Garrison 1002 19795 19.755489021956087 2522 0.12740591058348066
Chef 917 14890 16.23773173391494 1967 0.13210208193418402
Kenny 881 7807 8.861520998864926 1034 0.1324452414499808
Sharon 862 11435 13.265661252900232 1530 0.13379973764757325
Mr. Mackey 633 14299 22.589257503949447 1969 0.13770193719840548
Sheila 566 7521 13.287985865724382 1195 0.1588884456854142
Liane 582 6987 12.005154639175258 1134 0.16230141691713182
Gerald 626 8961 14.314696485623003 1465 0.16348621805602054
Jimmy 597 11172 18.71356783919598 1841 0.16478696741854637
Jimbo 556 9202 16.550359712230215 1546 0.1680069550097805
Wendy 585 7906 13.514529914529914 1339 0.16936503921072604

LUL, I mean, I guess, sure, Kenny is the most articulate? Let’s visualize! Who is the most articulate out of the four boys?

Lexical Diversity

Let’s now plot a few of the others metrics; mainly the number of lines per character and the number of unique words per character.



So, it seems that Cartman has both the most unique words and number of lines used in the show, followed by Stan and Kyle.

Finally, I wanted to see if there was any sort of correlation between these text features.


From here we see, that lexical diversity is actually inversely proportional to the number of lines and word count. Meanwhile, total word count and number of lines goes hand in hand with the number of unique terms spoken.

Here are the plotting recipes!

import seaborn as sns
import matplotlib as plt

#Number of Lines
df = df.sort_values(['Number of Lines'], ascending=False).reset_index(drop=True)

ax = sns.barplot(x = 'Name', y = 'Number of Lines',
              data = character_quotes_parameters_df, palette = "YlGnBu")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")

#Unique Words
df = df.sort_values(['Unique Words']).reset_index(drop=True)
ax = sns.barplot(x = 'Name', y = 'Unique Words',
              data = character_quotes_parameters_df, palette = "PuBu")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")

#Correlation Matrix
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(199, 250, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

And there you have it, folks! Hope it’s been fun!


Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s