Text summarization, as the name implies, is a technique to summarize large pieces of text. Our goal is to generate summaries that make sense – remember high school and summarizing papers and books? Yes, we have to get the machine to do that. This far from an easy task; it is actually an area of active research. However, there are some tools in our arsenal we can use to make this task less daunting.

There are two main approaches:

  • We can use keyword extraction techniques, such as TextRank, to extract the main keywords of your text – think of, maybe, adding sticky notes to pages as you’re summarizing a book. We’ll get into the details of this later.
  • We can use training data to teach a model to recreate sentences, e.g. via some sort of neural network.

Here, I will expand on the TextRank approach from my previous article and we will discuss evaluation techniques, mainly, Recall-Oriented Understudy for Gisting Evaluation (ROUGE).

Click here to read articles discussing TextRank and ROUGE.

These are the libraries we will need for this demo.

from gensim.summarization.summarizer import summarize
import re
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import string
from heapq import nlargest
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import string

Let’s define some functions that will come in handy.

stop_words = nltk.corpus.stopwords.words('english')
def remove_stopwords(sentence):
    """
    Takes a string and removes stopwords.
    """
    filtered_sentence = " ".join([i for i in sentence if i not in stop_words])
    return filtered_sentence

def sanitize_text(sentence):
    """
    Takes in a string and cleans it up.
    """
    sentence = sentence.lower()
    #Replace all none alphanumeric characters with spaces
    sentence = re.sub(r'[^a-zA-Z0-9\s]', ' ', sentence)
    return sentence

def generate_ngrams(sentence, n):
    """
    Takes in a string and the number of ngrams you want to produce.
    """
    #Clean text
    sentence = sanitize_text(sentence)
    #Split sentence into tokens
    tokens = [token for token in word_tokenize(sentence) if token != ""]
    #Create ngrams
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

To test this out, I am using an article from Byrdie about why we should use snail (literally snail slime) on our faces. What can I say? I am really into skincare.

Before we keep going, I want to mention that there are a lot of packages out there with ready-to-go TextRank implementations.

gensim summary

Here’s how you would do it if you were using Gensim:

print(summarize(text))

“Snail mucin is a mega multi-tasker, with the ability to do everything from moisturize to boost the production of collagen, the protein responsible for strong, youthful skin. Stimulates collagen production: Because snail mucin is a stress-induced excretion, it’s comprised of ingredients meant to repair or protect from injury,” Lain explains. There aren’t any well-documented side effects of snail mucin, says Desai-Solomon, though both dermatologists point out that, as with any ingredient, people can be allergic to it. According to Desai-Solomon, many people like using snail mucin for moisturizing purposes, in which case she suggests opting for a night cream that contains it.”

Pretty cool, right? It’s that simple. To really understand how this works, we are going to implement this ourselves.

text rank implementation

Let’s first talk about how TextRanks works. In the simplest terms, we will give each sentence in our text sample a score and we will then sort them in away that it matches the position they’re in the text sample. Imagine we have a set number of sentences, let’s call this N. Let’s also assume there’s some sort of relationship between these sentences, a score of sorts.

image2

In order, to rank the sentences in order of importance, we need to compute said scores. To do this, let’s make our sentences vectors and create a matrix. Each element of this matrix denotes the similarity between two vector-sentences. A good way to compute similarity is to use the cosine distance. The cosine similarity computes the cosine of the angle between 2 vectors. If the vectors are identical, the cosine is 1.0 and if he vectors are orthogonal, the cosine is 0.0 and we will initialize our probabilities as such.

image1

Now that we know how the algorithm works, let’s get our data ready. Step 1 is for us to tokenize the text by SENTENCES. We want to do this as we want to compare and rank the sentences in the article.

article_tokenize = sent_tokenize(text)

For TextRank to be effective, we must have our data as clean as possible. Removing stop words and punctuation will suffice. Remember that we don’t care for useless terms, all we care about are those key terms to compute the similarities and rank the sentences by. Hence, this will be step number 2, clean text data.

clean_article = [sanitize_text(i) for i in article_tokenize]

Next, let’s remove stop words.

clean_article = [remove_stopwords(s.split()) for s in clean_article]

In order to turn our sentences into vectors, we will use word embeddings. Recall that word embeddings are vector representations of a particular word. Do note that we could have used anything from word frequencies to TF-IDF to do this. You can also use the word embeddings of your choice.

Let’s now load our embeddings.

word_embeddings = {}
file_ = open('word_embeddings.txt')
for line in file_:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
file_.close()

Next, we will compute the weights for each word based on the word embeddings, initialize our matrix with zeros, replace zeros where cosine does not equal zero, graph it out and sort the sentences by their score.

#Compute weights for each word based on word embeddings
sentence_vectors = []
for i in clean_article:
    if len(i) != 0:
        vector = sum([word_embeddings.get(w, np.zeros((200,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        vector = np.zeros((200,))
    sentence_vectors.append(vector)

#Cosine similarity, initiate with zeroes, iterate to replace value.
similarity_matrix = np.zeros([len(article_tokenize), len(article_tokenize)])
for i in range(len(article_tokenize)):
    for j in range(len(article_tokenize)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,200), sentence_vectors[j].reshape(1,200))[0,0]

#TextRank Graph
sim_graph = nx.from_numpy_matrix(similarity_matrix)
scores = nx.pagerank(sim_graph)
#Sentence Ranking
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(article_tokenize)), reverse=True)

Let’s now print it out and see our summary!

‘There aren’t any well-documented side effects of snail mucin, says Desai-Solomon, though both dermatologists point out that, as with any ingredient, people can be allergic to it. Moisturizes the skin: According to Lain, snail mucin contains moisturizing agents that work to repair the barrier function of the skin, both locking out irritants from the environment while also simultaneously locking in moisture. If you’re looking to use snail mucin as a multi-purpose anti-ager, seek it out in a serum, as these will have a higher concentration of the ingredient. Avoid allergic reactions by testing a small amount of any new product on the inside of your forearm before slathering it all over your face. According to Desai-Solomon, many people like using snail mucin for moisturizing purposes, in which case she suggests opting for a night cream that contains it. Snail mucin is a mega multi-tasker, with the ability to do everything from moisturize to boost the production of collagen, the protein responsible for strong, youthful skin.’

rouge score

A good question after carriying this algorithm out, is deciding how to evaluate it. Was it a good summary? We can use something called a ROUGE score. ROUGE scores compare the contents of the summary to the contents of the original text. This will work the same way that computing recall and precision for non-text data sets work. In the context of ROUGE, we will be comparing n-grams betweent the summary and the original text. Recall will be computed as the division of the number of common ngrams over the total number of ngrams in the original text. Precision will be computed as the division of the number of common ngrams over the number of ngrams in the summary.

We will use sets to implement this. Let’s start with unigrams.

summary = " ".join(summary)
unigrams_sum = generate_ngrams(summary, n=1)
unigrams_orig = generate_ngrams(text, n= 1)
unigrams_sum = set(unigrams_sum)
unigrams_orig = set(unigrams_orig)

matches = unigrams_sum.intersection(unigrams_orig)
#Recall
recall = float(len(matches)/len(unigrams_orig))
#Precision
precision = float(len(matches)/len(unigrams_sum))
print(recall,precision)

0.41198501872659177, 1.0

Let’s look at bigrams.

bigrams_sum = generate_ngrams(summary, n=2)
bigrams_orig = generate_ngrams(text, n= 2)
bigrams_sum = set(bigrams_sum)
bigrams_orig = set(bigrams_orig)

matches = bigrams_sum.intersection(bigrams_orig)
#Recall
recall = float(len(matches)/len(bigrams_orig))
#Precision
precision = float(len(matches)/len(bigrams_sum))
print(recall,precision)

0.3192771084337349, 0.9695121951219512

I would personally pick the bigram score over the unigram score, mainly because bigrams carry slightly more context; hence, we can measure how much context from the original text is in the summary.

There are other types of rouge scores describe in the paper linked at the top of this post, but I will let you decide which method is more suitable for your purposes.

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

2 replies on “Extractive Text Summarization Using Word Vector Embeddings and Rouge_N Calculation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s