A while ago I saw this article from The Pudding where they ranked rappers based on the size of their song vocabulary. I love rap and the Puerto Rican equivalent, reggeaton, and I wanted to recreate a similar analysis for the genre.  Reggaeton was always considered a “low” quality music genre with incorrect Spanish, vulgar etc.; mainly because of the socioeconomic realities of the people who sang and consumed this music. I think it is amazing how massive this genre has become all throughout Latin America in recent years against all odds.

I gathered the data from letras.com and  you can download it here.  I removed a handful of singers who were mislabeled as reggeatoneros by the website. Also, I removed artists with less than 5 songs and capped it at 5 songs per artist – this way new singers won’t be punished by others who have been in the business for longer.  I ended up with with 86 different reggeaton singers.

Then, I removed all punctuation and digits from the text and tokenized and lemmatized each song.  I added a few extra Puerto Rican terms to nltk’s standard Spanish stop words list. There’s definitely more we could do to this data, but these steps will standardize it enough for our analysis. The full Python code for this process can be found here.

In order to quantitatively measure the uniqueness of their vocabulary, I will use a measure called lexical diversity. Lexical diversity is a measure of the variety of the vocabulary in a piece of text. There are several ways to compute lexical diversity, but in it’s simplest form, it is a type-token ratio (TTR) ; for example, the  TTR for “Dame mas gasolina, papi dame mas gasolina, como te gusta la gasolina” would be 8/12. So, 8 unique tokens to 12 total words.

Another way to quantitatively measure text is to use a score called readability which is a unique to every language. Readability measures how easy a piece of text is to read. It can include elements of complexity, familiarity, legibility and typography. Readability formulas usually look at factors like sentence length, syllable density and word familiarity as part of their calculations. There’s a plethora of scores and I computed 3 of them:

  • Fernández Huerta’s readability score
  • Gutiérrez de Polini’s score
  • Szigriszt-Pazos’ perspicuity score

The Huerta formula is usually presented as 206.84 – (0.60 * P) – (1.02 *F), where P = number of syllables and F = number of sentence. The Gutierrez and Szigriszt-Pazos scores have similar shapes to that of Huerta’s.

lexical diversity

In this sample, I found that Bad Bunny’s Desde el Corazon is the most lexically diverse song while Don Omar’s Luna is the least diverse.  Don Omar’s songs tend to be quite repetitive – mostly the same verse repeated 10 times.


I then ranked all the artists by their average lexical diversity score and Arcangel reigned supreme. For the full, list go here.


It’s interesting that many of the singers with the most lexically diverse songs are singers that were famous back in the 90s (e.g. Vico C, Tego Calderon, etc.) when the genre was very socio-political and not as pop-like as it is now. Like rap, and as Matt Daniel’s points out in his article, “the genre has evolved; it has moved away from complex lyricism toward elements traditionally associated with pop music: repetitive song structure and singing (Joe Carmanica recently wrote about this trend for the New York Times, arguing that it was led by Drake, who popularized the rapping-and-singing formula over the past decade).”


As far as readability goes, 47% of the reggaeton songs are very difficult to understand, 22% is normal or fairly easy to understand. I attribute that 47% not to any grammatical structures or lack-thereof, but to the fact that Puerto Rican Spanish is quite a unique dialect, very different from all other Spanish-dialects. It has plenty of indigenous words and English words, e.g. parkear (from parking), tripear (from tripping), etc.

Taking a looking at readability from the singers’ perspective reveals that Ivy Queen has the most readable songs while Vico C has the least readable songs.


Vico C used to sing almost exclusively about socio-economic/political issues which hints us that the words he used were likely more complex whereas Ivy Queen mostly sang about discos and parties.

reggaeton vs. the world, de puerto rico para el mundo

I was interested to see how this genre compared to other Latin American music genres. I had to normalize the number of songs for all three categories since the Reggaeton data outnumbered all the other genres.


I ran through the same analysis for salsa and merengue – two big Latin American genres. Salsa is considered more upscale than reggeaton. In terms of readability, there isn’t really much difference between the three genres.

Reggeaton Lyrics

Surprisingly, salsa is the least lexically diverse while reggeaton is the most diverse out of the three.

parts of speech

Lastly, I was interested in the parts of speech; mainly, the most commonly used nouns and verbs in reggaeton songs. Reggaeton songs can be quite the trip, haha.

Reggeaton Lyrics (1)

Bed, ass, body, women, touch, feel – yup this is the core of reggaeton songs.

So, what does this tell us about Reggeaton? It tells us it is a lexically rich and readable genre contrary to popular opinion.


Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s