It is not uncommon, at least, it hasn’t been uncommon in my years in data science that a stakeholder will come to you with the most impossible, unrealistic data need | question | idea, etc., and you go back to your desk and you scratch your head and you think to yourself “How the hell am I going to do this?”

Actual scene between data scientists and stakeholder.

One particular example I can remember was being asked to do some sort of sentiment analysis on some random text data, no labels, no nothing. Now, most machine learning folk will tell you sentiment analysis is actually quite complex. Semantic context is not an easy thing to solve for. “This was terrible” vs. I “am terribly sorry” – the word terrible has a very different meaning in both these sentences. There are metrics like polarity that attempt to assign a score to positive or negative words, but the example I gave above would fail terribly – see what I did there 😂. I am sorry, I will stop now.

There are other more complex, accurate methods to compute a text’s sentiment, but it requires labels. Any machine learning method you use will require labels. So, now you’ve reached conundrum. You have few options. You can do a quick polarity and call it a day. You can ask for time to create labels (these may take months). You can try saying no. Or, you can try to get creative. How can you measure a person’s intent, sentiment based on typed text? Keep in mind, this is no trivial task, it won’t be perfect, but it is possible and can get us a bit closer to our goal.

Words can carry a lot of meaning on their own; parts of speech even more so. In fact, we express or change the meaning on a sentence based on the verb, the voice, the positioning of words, etc. It’s the purpose of any language after all: to express what feel, want or need. Now, in the age of the internet, even emojis carry a lot of meaning as well.

A few ideas that quickly come to mind are:

  • Emojis? We could scan for emojis and use that in combination with other parts of speech to paint a picture of a user’s intent.
  • Capitalization. Capitals tend to signify anger or frustration.
  • Grammar, are there are lot of misspellings, exclamation marks? This could point to frustration, shock, surprise as well.
  • We could take a closer look at the passive voice in a sentence.
  • We could look into mood and modality.

Let’s focus on those last two bullet points. For the sake of this exercise, let’s assume we tasked with presenting a stakeholder with customers’ feelings towards restaurants in the area, e.g. whether they like, whether they would back, what were the problems, etc. (Source here).

Taking a trip down memory lane, you’ll remember that when we use the passive voice the subject is acted upon by some other performer of the verb. Often, the passive voice does a better job of presenting an idea or giving us more detail about an event, e.g. “The code broke” vs. “The code was broken by Alex.” Knowing this, this can be a good start in trying to decipher someone’s intent based on text alone. Let’s apply this to our restaurant data. Spacy has a neat function called matcher when you can specify any grammatical rule you’d like to search for. The passive voice is shaped as such [subject]+[verb (performed by the subject)]+[optional object] where the verb is is/was.

 
#Passive Voice
matcher = Matcher(nlp.vocab)
pass_is_was = [{"TEXT": {"REGEX": "(is|was)"}}, {"POS": "ADV", "OP": "*"}, {"TAG": "VBN"}]
pass_have_been = [{"LEMMA": "have"}, {"TEXT":"been"}, {"POS": "ADV", "OP": "*"}, {"TAG": "VBN"}]
matcher.add("pass_is_was", None, pass_is_was)
matcher.add("pass_have_been", None, pass_have_been)
for i in range(200):
    doc = nlp(df['Review'][i])
    matches = matcher(doc)
    for match_id, start, end in matches:
        # Get the string representation 
        string_id = nlp.vocab.strings[match_id]  
        span = doc[start-6:end+6]  # The matched span
        print(match_id, string_id, start, end, span.text)
 

Already, by just looking at the rules of passive voice, but we can start to paint a picture of these reviews. We could later on even try to match passive voice to the actual numerical review, if available, and see if there are any possible correlations.

Let’s say we know wanted to add another tool on top of passive voice. At the end of the day, looking just a the passive voice will be time consuming and still pretty rudimentary. We could attempt to look at mood and modality. If you call your English class days, mood refers to the verb form which we can use to express a fact, a command, a question, a condition or a possibility. Modality, on the other hand, refers to semantic meaning or the degree to which a sentences expresses said feelings. So, by creating some quick rules we can get a good idea as to whether the reviewer’s intent was that of need, possibility, etc. I created a quick infographic with some examples.

Examples of modal sentences.

Let’s create some rules and apply this to our data.

#Modality
matcher = Matcher(nlp.vocab)
modals = [{"TEXT": {"REGEX": "(can|would|could|can|might|may|must|will)"}, "TAG": "MD"}]
matcher.add("modals", None, modals)
for i in range(len(df['Review'])):
    doc = nlp(df['Review'][i])
    matches = matcher(doc)
    for match_id, start, end in matches:
        # Get the string representation 
        string_id = nlp.vocab.strings[match_id]  
        span = doc[start-6:end+6]  # The matched span
        print(match_id, string_id, start, end, span.text)

Lastly, let’s try to look at the future. The goal here is to try and see if we can get an idea of whether a person would want to come back. So, let’s create some rules with future tense words and adverbs of time.

#Future
matcher = Matcher(nlp.vocab)
future_modal = [{"TEXT": "will", "TAG": "MD"}]
time_expr= [{"TEXT": {"REGEX": "((next|oncoming)(week|month|year|summer|winter|autumn|fall|)|the day after tomorrow)"}}]
matcher.add("future_modal", None, future_modal)
matcher.add("time_expr", None, time_expr)
for i in range(len(df['Review'])):
    doc = nlp(df['Review'][i])
    matches = matcher(doc)
    for match_id, start, end in matches:
        # Get the string representation 
        string_id = nlp.vocab.strings[match_id]  
        span = doc[start-6:end+6]  # The matched span
        print(match_id, string_id, start, end, span.text)

So, this did indeed give us an idea, or at least got us closer to know whether or not some of these people will be back!

Obviously, all the work we just did is still rudimentary, BUT, when we are out of options whether it be because of bad or lack of data, parts of speech can still offer us a wealth of knowledge. Enough to paint a picture of the problem at hand.

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s