Instead of a lengthy tutorial, I hope this post serves as a quick reference for those who work with text data on almost a daily basis. Come back to this post for updates as I will be posting them here as I develop them!

Strip Accents:

On many occasions, I have had to work with data in Spanish or Portuguese and given the the punctuation marks of these languages, further manipulation may be needed. Below is a quick function to remove accents from text. This function takes in a string and checks its encoding and turns it to unicode.

#---Import libraries
import unicodedata

def remove_accents(row):
    return "".join(c for c in unicodedata.normalize("NFD", row) if unicodedata.category(c) != "Mn"

#example
df['column'].apply(remove_accents)

 

Remove HTML Tags:

I often have to pull online data and it is not rare, actually quite the opposite, that it comes with undesired characters, like HTML encodings. We can easily remove this using regular expressions. This functions takes a string and using regular expressions it removes anything between

#---Import libraries
import re

def removeHTMLtags(row):
    encodings_to_remove = re.compile("")
    clean_row = re.sub(encodings_to_remove, "", row)
    return clean_row

#example
df['column'].apply(removeHTMLtags)

 

Sanitize:

Sometimes I have run into situations where removing HTML tags or removing punctuation, etc, won’t work and this is where this little function came in. It simply uses regular expressions to find these funky characters and it substitutes them with blanks. You can add more replacements here; these are just the common characters I’ve run into.

#---Import libraries
import re

def sanitize(row):
   sanitizedText = row.replace(" ", "")
   sanitizedText = sanitizedText.replace("&", "")
   sanitizedText = sanitizedText.replace("\xa0", "")
   sanitizedText = sanitizedText.replace("\t", "")
   return sanitizedText

#example
df['column'].apply(sanitize)

 

Remove Stopwords:

You can easily use stopwords from NLTK, gensim, etc, but what is really crucial when cleaning your text data is to fit it to your needs; mainly, removing words that you know add nothing to your specific dataset context. For example, the word “computer” it is not a stop word by definition, but perhaps it is a common word in YOUR dataset and it has a different meaning in YOUR dataset and it would make sense to remove it. Here’s a recipe to do that.

#---Import libraries
#Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

def stop_words(row):
   cleanedColumn = row.apply(lambda x: [item for item in x if item not in stop])
   return cleanedColumn

#example
df['no_stopwords'] = df['column'].apply(stop_words)

 

Numbers and Punctuation: 

In order to prepare your text for most text modeling approaches, you’ll need to remove punctuation and numbers from text data set.  Check out the function below.

def removePunctNum(row):
   cleanedColumn = row.str.replace('[^\w\s]','')
   cleanedColumn = cleanedColumn.str.replace('\d+', '')
   return cleanedColumn

#example
df['cleaned_text'] = df['column'].apply(removePunctNum)

 

Happy Coding!
PS. All these functions can be grouped into a big function

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s