NOTE: THIS WAS DONE USING PYTHON 2.7. PLEASE, REFER TO UPDATED POST HERE.

A lot of times we need to dig through hundreds of lines of text. Maybe we are digging through research journals or we need to go through large amounts of text. A great technique to speed up this process is by using a technique called text summarization. As the name suggests, text summarization summarizes long text into bite sized pieces based on the approach we choose.  Needless to say, being able to implement even a simple text summarization algorithm is crucial if you work with text data.

In addition to the benefit of summarizing large texts, this technique can even reduce the human bias when summarizing text. Think about it; think about the last time you were asked to summarize a book or a story – there’s always a degree of subjectivity. Hence, letting the computer decide based on mathematical features will help remove some of this bias.  We could use text summarization to condense things like news articles, survey data, among others.

There are two main approaches to text summarization. You can either go extract route or the abstract route.  With the extract approach, we use text features, e.g. syntax, word frequencies, word weights, etc to summarize text. With the abstract approach, we use advanced modeling techniques, .e.g k-means, neural networks to summarize text. For the purposes on this post, we will focus on the extractive method.

A common model used in this extractive approach is TextRank. This algorithm works in much the same way as Google does when it shows the top results of your query.  In a nutshell, this is how the algorithm works:

  1. Read text from source
  2. Split paragraphs of texts by sentence
  3. Clean text (remove numbers, punctuation, numbers, stop-words.)
  4. Tokenize sentences
  5. Compute rank of words in sentences or sentences
  6. Select top words or sentences

In this post, we’ll see a few different approaches – all based on TextRank. We will first implement it step by step using two different methods. I always like to implement these algorithms the long way because it gives us a chance to really understand what is going on and fine-tune as needed. Then, we will use the package version from the gensim package.  Literally, two lines of code, but I find that it is always useful and sometimes convenient implement these algorithms by hand.

Let’s import the libraries we will need to do this.

from gensim.summarization.summarizer import summarize
import re
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import string
from heapq import nlargest
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

Text sample we will use throughout this tutorial:

text = '''This introduction aims to tell the story of how we
put words into computers. It is part of the story of the field of
natural language processing (NLP), a branch of artificial
intelligence. It targets a wide audience with a basic
understanding of computer programming, but avoids a detailed
mathematical treatment, and it does not present any algorithms.
It also does not focus on any particular application of NLP such
as translation, question answering, or information extraction.
The ideas presented here were developed by many researchers over
many decades, so the citations are not exhaustive but rather direct
the reader to a handful of papers that are, in the author's view,
seminal. After reading this document, you should have a general
understanding of word vectors (also known as word embeddings): why
they exist, what problems they solve, where they come from, how
they have changed over time, and what some of the open questions
about them are. Readers already familiar with word vectors are
advised to skip to Section 5 for the discussion of the most recent
advance, contextual word vectors'''

PS. You should read this paper in full anyways: https://arxiv.org/pdf/1902.06006.pdf

Method 1: Word Embeddings

Let’s now implement this step by step. For this scenario we will use word embeddings. Word embeddings are vector representations of a given word. It helps us understand differences between, say, similar sentences, e.g. “I am happy” vs. “I am glad” and they come in very handy when we want to convert words into vectors. You can find plenty of word embeddings out there so use the one you’re most comfortable with. Wikipedia, Google, etc, offer word embeddings ready to use.

As you might have guessed, the reason we want to use word embeddings for text summarization is so we can tell the similarities and differences amongst the words in the sentences in the paragraphs and assign weights accordingly.

Let’s start by splitting the paragraph into sentences as described in the introduction.

#Split paragraph into sentences. We want to know how similar each sentence is with each other.
sentences = sent_tokenize(text)
sentences

[‘This introduction aims to tell the story of how we put words into computers.’,
‘It is part of the story of the field of natural language processing (NLP), a branch of artificial intelligence.’,
‘It targets a wide audience with a basic understanding of computer programming, but avoids a detailed mathematical treatment, and it does not present any algorithms.’,
‘It also does not focus on any particular application of NLP such as translation, question answering, or information extraction.’,
“The ideas presented here were developed by many researchers over many decades, so the citations are not exhaustive but rather direct the reader to a handful of papers that are, in the author’s view, seminal.”,
‘After reading this document, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are.’,
‘Readers already familiar with word vectors are advised to skip to Section 5 for the discussion of the most recent advance, contextual word vectors’]

Woo!

Now, let’s start cleaning our sentences.
First, let’s load in the stopwords list from NLTK.

#load stopwords
stop = nltk.corpus.stopwords.words('english')

#Pre-process your text. Remove punctuation, special characterts, numbers, etc. As the only thing we care
#about are the actual words in the text.
clean_sentences = [s.translate(None, string.punctuation) for s in sentences]
clean_sentences  = [s.translate(None, string.digits) for s in clean_sentences]
#lowercase
clean_sentences = [s.lower() for s in clean_sentences]

#remove stopwords as for text summarization purposes,
#these words add no value to word ranking.
def remove_stopwords(sentence):
    filtered_sentence = " ".join([i for i in sentence if i not in stop])
    return filtered_sentence
clean_sentences = [remove_stopwords(s.split()) for s in clean_sentences]
clean_sentences

[‘introduction aims tell story put words computers’,
‘part story field natural language processing nlp branch artificial intelligence’,
‘targets wide audience basic understanding computer programming avoids detailed mathematical treatment present algorithms’,
‘also focus particular application nlp translation question answering information extraction’,
‘ideas presented developed many researchers many decades citations exhaustive rather direct reader handful papers authors view seminal’,
‘reading document general understanding word vectors also known word embeddings exist problems solve come changed time open questions’,
‘readers already familiar word vectors advised skip section discussion recent advance contextual word vectors’]

Good!

Okay, now let’s load in the word embeddings.

word_embeddings = {}
file_ = open('word_embeddings.txt')
for line in file_:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float')
    word_embeddings[word] = coefs
f.close()

Then, let’s compute the vector values on the words in our sentences

sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        vector = sum([word_embeddings.get(w, np.zeros((200,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        vector = np.zeros((200,))
    sentence_vectors.append(vector)

Finally, let’s compute the similarities between sentences, generate the scores based on the similarities and word embeddings and show the text summary.

#Compute sentence similaritiy, initiate with zeros
similarity_matrix = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,200), sentence_vectors[j].reshape(1,200))[0,0]

#Populate matrix
sim_graph = nx.from_numpy_matrix(similarity_matrix)
scores = nx.pagerank(sim_graph)

#Sentence Ranking
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
#Choose desired number of sentences
for i in range(3):
    print(ranked_sentences[i][1])
  • Summary:“After reading this document, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are. It also does not focus on any particular application of NLP such as translation, question answering, or information extraction. It targets a wide audience with a basic understanding of computer programming, but avoids a detailed mathematical treatment, and it does not present any algorithms.”

Woo hoo!

Method 2: Weighted Frequencies

In this approach, we will compute weighted frequencies, similar to a weighted average. The weighted factor will based on the maximum frequency term in the text.

The beginning will be very similar to what we did in Method 1.

#Split paragraph into sentences. We want to know how similar each sentence is with each other.
sentences = sent_tokenize(text)

#load stopwords
stop = nltk.corpus.stopwords.words('english')

clean_sentences = [s.translate(None, string.punctuation) for s in sentences] clean_sentences = [s.translate(None, string.digits) for s in clean_sentences] #lowercase
clean_sentences = [s.lower() for s in clean_sentences]
clean_sentences = [remove_stopwords(s.split()) for s in clean_sentences]

Now, let’s compute the frequency of each word in our sentences.

word_frequencies = {}
for i in range(len(clean_sentences)):
    for word in nltk.word_tokenize(clean_sentences[i]):
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

Then, let’s get the maximum frequency and multiply this to each word frequency to get the weighted frequency.

maximum_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

Finally, let’s apply the computed frequencies to our UNCLEANED sentences and then show our summary.

#Apply scores to each UNCLEANED SENTENCE
sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

#Choose number of sentences you want in your summary
summary_sentences = nlargest(4, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
  • Summary: “Readers already familiar with word vectors are advised to skip to Section 5 for the discussion of the most recent advance, contextual word vectors It also does not focus on any particular application of NLP such as translation, question answering, or information extraction. It targets a wide audience with a basic understanding of computer programming, but avoids a detailed mathematical treatment, and it does not present any algorithms. This introduction aims to tell the story of how we put words into computers.”

Very similar to method 1! Gucci!

Method 3: Gensim

print(summarize(text))
  • Summary: “After reading this document, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are.”

Literally that is it! If you do choose this method, PLEASE, go over the documentation. There are some very important parameters you can change to fit your needs. I see this all the time where people ignore reading docs and don’t realize they can fine-tune these functions – don’t be that way! From their website,

Parameters:
  • text (str) – Given text.
  • ratio (floatoptional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
  • word_count (int or Noneoptional) – Determines how many words will the output contain. If both parameters are provided, the ratio will be ignored.
  • split (booloptional) – If True, list of sentences will be returned. Otherwise joined strings will be returned.

As you can see all three methods gave us reasonable and similar results!

All the Code!

Here’s all the code in one snippet!

#Import libraries
from gensim.summarization.summarizer import summarize
import re
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import string
from heapq import nlargest
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

#Gensim
text = '''This introduction aims to tell the story of how we put words into computers. It is part of the story of the field of natural language processing (NLP), a branch of artificial intelligence. It targets a wide audience with a basic understanding of computer programming, but avoids a detailed mathematical treatment, and it does not present any algorithms. It also does not focus on any particular application of NLP such as translation, question answering, or information extraction. The ideas presented here were developed by many researchers over many decades, so the citations are not exhaustive but rather direct the reader to a handful of papers that are, in the author's view, seminal. After reading this document, you should have a general understanding of word vectors (also known as word embeddings): why they exist, what problems they solve, where they come from, how they have changed over time, and what some of the open questions about them are. Readers already familiar with word vectors are advised to skip to Section 5 for the discussion of the most recent advance, contextual word vectors'''
print(summarize(text))

#Step by Step
stop = nltk.corpus.stopwords.words('english')
#Split paragraph into sentences. We want to know how similar each sentence is with each other.
sentences = sent_tokenize(text)
sentences

#Pre-process your text. Remove punctuation, special characterts, numbers, etc. As the only thing we care
#about are the actual words in the text.
clean_sentences = [s.translate(None, string.punctuation) for s in sentences]
clean_sentences  = [s.translate(None, string.digits) for s in clean_sentences]
#lowercase
clean_sentences = [s.lower() for s in clean_sentences]
clean_sentences

#remove stopwords as for text summarization purposes, these words add no value to word ranking.
def remove_stopwords(sentence):
    filtered_sentence = " ".join([i for i in sentence if i not in stop])
    return filtered_sentence
clean_sentences = [remove_stopwords(s.split()) for s in clean_sentences]
clean_sentences

#At this point we can use whatever method we fancy to rank words or sentences.

#Method 1: EMBEDDINGS | Sentence Focused (use word embeddings of your choice)
# Extract word vectors
word_embeddings = {}
file_ = open('word_embeddings.txt')
for line in file_:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

#Let's compute weights for each word based on the loaded embeddings.
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        vector = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        vector = np.zeros((100,))
    sentence_vectors.append(vector)

#Compute sentence similartiy, initiate with zeros
similarity_matrix = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,200), sentence_vectors[j].reshape(1,200))[0,0]

#Populate matrix
sim_graph = nx.from_numpy_matrix(similarity_matrix)
scores = nx.pagerank(sim_graph)

#Sentence Ranking
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
#Choose desired number of sentences
for i in range(3):
    print(ranked_sentences[i][1])

#Method 2: Weighted Word Frequencies | Word Focused
#Compute word frequencies for each sentence
word_frequencies = {}
for i in xrange(len(clean_sentences)):
    for word in nltk.word_tokenize(clean_sentences[i]):
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

#Find max frequency in text and compute the weighted frequency based on the maximum frequency.
maximum_frequency = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

#Apply scores to each UNCLEANED SENTENCE
sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;">&#65279;</span>&lt; 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

#Choose number of sentences you want in your summary
summary_sentences = nlargest(4, sentence_scores, key=sentence_scores.get)
summary = &#039; &#039;.join(summary_sentences)
print(summary)
Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s