Often we will be dealing with text; short or long, and often we only want to find out the important keywords of said text. As with most things, there are several algorithms out there that you could use for this. A common one that you’ve likely even used before is TF-IDF, but there are others like the RAKE algorithm that can perform very well.  The crucial step in any of these keyword algorithms is to spend time carefully cleaning and preparing your data in order to pull out the keywords that really matter to that text.

For this tutorial, we will use TF-IDF. Sure, there are many other more complicated algorithms out there, but I always find that understanding and working through the simplest algorithms first can provide us with a wealth of knowledge before diving into more complicated adventures. TF-IDF will help us pull out terms that are undervalued aka, important keywords.

In a nutshell, these are the steps you’d need to take in order to perform a keyword extraction:

  • Clean your text (remove punctuations and stop words)
  • Remove stop-words
  • Remove common and uncommon words
  • Tokenize the text
  • Stem the tokens
  • Apply algorithm of choice
  • Rank terms

For this tutorial we will a dataset containing 11,000 amendments made to the USA constitution between 1788 – 2014. The data lives here. Let’s now look at these steps in more detail.

First, let’s try to understand what TF-IDF is and how it works. In quick terms tf-idf is the product of the term frequency and the inverse document frequency. The term frequency measure how frequently a word or term occurs in a given document. Because every document is of different lengths said term can appear multiple times in a given document. This then can cause unneeded bias towards a word just because it appears 20 times in one document.  To deal with this, we first consider documents to be bag of words – where word order does not matter – and we bring in the inverse term frequency to punish these words. Mainly, just because a term appears 30 times it does not mean that it is 30 times more relevant.  If you take anything from this post, let it be that keyword relevancy is not proportional to frequency.

Mathematically, tf-idf looks like this,

TF:Screen Shot 2019-02-24 at 10.37.04 AM

IDF:

Screen Shot 2019-02-24 at 10.37.07 AM

Combining them together:

Screen Shot 2019-02-24 at 10.37.10 AM

 

 

As you can see, tf-idf increases with the rarity of a word. With that out of the way, let’s start implementing this baby! These are the libraries you will need!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
import os
import nltk
from nltk.util import ngrams
from nltk.corpus import stopwords
import datetime
from collections import Counter
import re

Now, let’s define a few functions that will come in handy in cleaning our data and finding the ranked terms later on.

#---User-defined functions
stop = stopwords.words('english')

def rank_words(terms, feature_matrix):
    """
    Display top comments by topic.
    """
    sums = feature_matrix.sum(axis=0)
    data = []
    for col, term in enumerate(terms):
        data.append( (term, sums[0,col]) )
    ranked = pd.DataFrame(data, columns=['term','rank']).sort_values('rank', ascending=False)
    return ranked

def removeHTMLtags(text):
    """
    Removes html tags, anything between .
    Takes in a single string.

    """
    cleanCondition = re.compile('')
    cleanText = re.sub(cleanCondition, '', text)
    return cleanText

def sanitize(text):
    """
    Removes any leftover HTML tags.
    Takes in a single string.

    """
    newText = text.replace(" ", "")
    newText = newText.replace("&", "")
    newText = newText.replace("\xa0", "")
    newText = newText.replace("\t", "")
    return newText

Now, let’s read in the data and make sure our columns are of the right data type.

df = pd.read_csv("us-nara-amending-america-dataset-raw-2016-02-25.csv")
df.dtypes

#change to string
df['title_or_description_from_source'] = df['title_or_description_from_source'].astype(str)

Great!

Then, let’s start cleaning our data with the functions we define above. Do note, that this will DEPEND ON YOUR DATA. Don’t just blindly remove stop-words or clean a certain way just because I am doing it here. You will need to decide what is best for your dataset.

df['cleanedText'] = df['title_or_description_from_source'].apply(removeHTMLtags)
df['cleanedText'] = df['cleanedText'].apply(sanitize)
df['cleanedText'] = df['cleanedText'].str.replace('[^\w\s]','') #remove punctuation
df['cleanedText'] = df['cleanedText'].str.replace('\d+', '') #remove numbers
df['cleanedText'] = df['cleanedText'].str.lower()

#remove stopwords
df['cleanedText'] = df['cleanedText'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['cleanedText'].head()

Now, we will inspect and decide whether we want to remove common and uncommon words.

#---Remove common/rare words
#word frequency
freq_com = pd.Series(' '.join(df['cleanedText']).split()).value_counts()[:10]
freq_com

freq_uncom = pd.Series(' '.join(df['cleanedText']).split()).value_counts()[-10:]
freq_uncom
president       2629
states          2346
united          1840
amendment       1484
congress        1418
constitution    1380
rights          1334
election        1279
equal           1183
vice            1175

Let’s do the same for uncommon terms.

#---Remove common/rare words
#word frequency
freq_uncom = pd.Series(' '.join(df['cleanedText']).split()).value_counts()[-10:]
freq_uncom
representationin    1
whoever             1
revoke              1
banks               1
profits             1
mines               1
intrastate          1
prosecutor          1
loyal               1
infringement        1

Cool, so by looking at this I feel pretty confident about removing these words from the corpus.

freq_com = list(freq_com.index)
df['cleanedText'] = df['cleanedText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_com ))

freq_uncom = list(freq_uncom.index)
df['cleanedText'] = df['cleanedText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_uncom))

Now, we are ready for some keyword extraction! I will set up the ngram parameter to 2-4 since setting this up to 1 rarely spits out anything useful.

#---Set up TF-IDF for NMF and ranking words
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(2,4))
tfidf = vectorizer.fit_transform(df['cleanedText'])

ranked = rank_words(terms=vectorizer.get_feature_names(), feature_matrix=tfidf)
ranked[0:20]

Wut. Y’all we did it! Let’s try to visualize this initial output.

keywordExt_overallCorpus

This is really interesting because we can clearly see important events in our history through these keywords!

Now, that we have that done, for fun, let’s see what sort of themes can we find. I will use NMF for this. If you need a refresher on this, please refer to my previous posts. Here’s the quick recipe:

#---Set up NMF
n_topics = 12
nmf = NMF(n_components=n_topics,random_state=0)

topics = nmf.fit_transform(tfidf)
top_n_words = 5
t_words, word_strengths = {}, {}
for t_id, t in enumerate(nmf.components_):
    t_words[t_id] = [vectorizer.get_feature_names()[i] for i in t.argsort()[:-top_n_words - 1:-1]]
    word_strengths[t_id] = t[t.argsort()[:-top_n_words - 1:-1]]
t_words

 

0 [u’men women’, u’relative men women’, u’relative men’, u’proposal men’, u’proposal men women’]
1 [u’regardless sex’, u’right regardless’, u’right regardless sex’, u’except military’, u’regardless sex except military’]
2 [u’balancing budget’, u’debt reduction’, u’balancing budget debt reduction’, u’balancing budget debt’, u’budget debt’]
3 [u’prayer public schools’, u’public schools’, u’prayer public’, u’schools institutions’, u’prayer public schools institutions’]
4 [u’right vote’, u’age older’, u’years age older’, u’right vote citizens’, u’vote citizens’]
5 [u’apportionment state’, u’apportionment state legislatures’, u’state legislatures’, u’appointment state’, u’appointment state legislatures’]
6 [u’right life’, u’abortion right life’, u’abortion right’, u’prohibit abortion right’, u’prohibit abortion right life’]
7 [u’prayer public buildings’, u’public buildings’, u’prayer public’, u’public buildings places’, u’buildings places’]
8 [u’popular senators’, u’providing popular senators’, u’providing popular’, u’popular senators popular’, u’popular senators popular senators’]
9 [u’upon god’, u’reliance upon’, u’reliance upon god’, u’upon god governmental’, u’reliance upon god governmental’]
10 [u’representation district columbia’, u’representation district’, u’district columbia’, u’people district columbia’, u’people district’]
11 [u’direct popular’, u’direct popular vote’, u’woman suffrage’, u’popular vote’, u’provide direct’]

From this we can see that women suffrage, popular vote, religion, abortion and debt reduction are all big themes here and it seems that they have been for a long time.

To satisfy my curiosity, I applied the same logic we just did above but I split the data in 30 year buckets.

overTime

In the photo above, we can see the number of amendments throughout our history and the important keywords in those 30 year buckets. I’ve also highlighted the terms that appeared in more than one 30 year bucket. And there you have it – US history in keywords!

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s