Okay, I have a confession to make… I don’t get why people like avocados so much. They’re ‘okay’ at best. I grew up in the Caribbean and avocados were plentiful, in fact, we had an avocado tree in our backyard, twice the size of those i see in stores here, but…I was just never an avocado person.  ALSO, why are they so expensive? Makes no sense. HOWEVER, given the fact my generation likes them so much, when I saw this dataset I had to give it a go.

Also, I wanted to get better at making plots in Python. I tend to get lazy and use Illustrator for data visualizations, but I forget how powerful the seaborn and matplotlib libraries can be. In addition, we’ll be looking at time series to predict avocado trends!

This data came from Kaggle.com and you can click here to download the data.  Let’s get started!

Import the libraries we’ll need.

%matplotlib inline
#---Import libraries
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
from fbprophet import Prophet
import seaborn as sns
sns.set(style="whitegrid")

Notice the line “sns.set(style=”whitegrid”)“; this line specifies the background I want in my plots. I tend to prefer a white grid, but you could also use a “darkgrid” option.

Let’s then read the data and give it a quick view.

#---Read Data
file_ = "avocado.csv"
df = pd.read_csv(file_)

#---Data Exploration
df.head()

#---Let's check what timeline we are working with
max(df['Date'])
'2018-03-25'

min(df['Date'])
'2015-01-04'

Okay, now after doing that, let’s start diving into Seaborn and data visualization. Seaborn is a great Python library for visualization. If you’re an R-person, it reminds me a bit of ggplot. Here‘s the plotting gallery for it. I usually use seaborn in combination with matplotlib. I find that they work better together than either of them individually. I like the aesthetics of seaborn, but I like the freedom of matplotlib.

The first thing I wanted to look at was the volume of avocados per year. Like, how much avocados do we actually consume? Here’s a recipe to make a decent-looking barplot. I’ve commented each line so you can modify this recipe to your needs.

cfont = {'fontname':'Calibri Light'} #set font

plt.figure(figsize=(13, 14)) # set figure size
ax = plt.subplot(111)    # set up figure box
ax.spines["top"].set_visible(False)    # remove spines
ax.spines["bottom"].set_visible(False)    # remove spines
ax.spines["right"].set_visible(False)    # remove spines
ax.spines["left"].set_visible(False)  # remove spines
ax.get_xaxis().tick_bottom()    # remove tick marks
ax.get_yaxis().tick_left()    # remove tick marks
plt.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on") # remove labels
plt.yticks(fontsize=14)  # set x-axis font size
plt.xticks(fontsize=14)  # set y-axis font size
ax.text(0.01, 1000000, 'Average Avocado Volume Per Year'+"\n", fontsize=16, fontweight="bold") # i hate titles, so let's add text inside our plot
sns.barplot(x=grouped_data['datetime'], y=grouped_data['Total Volume'], palette="coolwarm") # this is our actual plot
ax.set(xlabel='', ylabel='') # i also don't like label names, but it's up to you
plt.savefig("avocado_volume_per_year.jpeg", bbox_inches="tight") #save figure

avocado_volume_per_year

Seaborn also let’s you pick palettes, click here for more.
Woooow, please someone explain to me why do we eat so many avocados?

ops.meme_.nba_-1024x768

Next, I wanted to see the how the avocado prices have changed over time.
For this plot we have to do a few extra steps; mainly, we have make sure our dates are in a datetime format and sort them.

df.sort_values('datetime', inplace=True) # sort values
df.set_index('datetime', inplace=True) # set index
#check datatype of index
df.index 

byDate=df.groupby('Date').mean() #get the mean by date so the plot is less noisy

#---Avocado over Time
plt.figure(figsize=(15,10))
ax = plt.subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
plt.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")
plt.yticks(fontsize=14)
plt.xticks(fontsize=14)
ax.text(5, 1.8, 'Average Avocado Price Over Time'+"\n", fontsize=16, fontweight="bold", color='#4b514a')
byDate['AveragePrice'].plot(linewidth=2.0, color='#5d7404')
ax.set(xlabel='', ylabel='')
plt.savefig("avocado_price_per_year.jpeg", bbox_inches="tight")

avocado_price_per_year

I can definitely see some seasonality in this plot, which makes sense since I assume produce changes in demand and price depending on the season. We can also see the avocado’s rise to glory in the past in two years.

Lastly, let’s look at how prices have changed per city.

#---Price Per Region Per Year

sns.factorplot('AveragePrice','region',data=df,
                   hue='year',
                   size=13,
                   aspect=0.8,
                   palette='coolwarm',
                   join=False,
              )
ax.set(xlabel='', ylabel='')
plt.savefig("price_per_year_per_region.jpeg", bbox_inches="tight");

Okay, Portland, SF, yes normal, Hartford/Springfield? What?

price_per_year_per_region

Now, let’s look at the difference in price between organic and conventional avocados.

#---Comparison between Organic and Conventional
plt.figure(figsize=(20,10))
ax = plt.subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
plt.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")
plt.yticks(fontsize=14)
plt.xticks(fontsize=14)
ax.text(2.1, 1.5, 'Organic vs. Conventional Avocado Prices'+"\n", fontsize=22, fontweight="bold", color='#4b514a')
sns.boxplot(y="type", x="AveragePrice", data=df, palette = 'BuGn_r')
ax.set(xlabel='', ylabel='')
plt.savefig("avocado_organic_vs_conventional.jpeg", bbox_inches="tight")

Holy Molly. So, yes, organic avocados are twice the price of conventional avocados. No surprise here, though?
avocado_organic_vs_conventional

Cool, cool. I think we’re now ready to create our time series. For the time series, I am using the Prophet library and I am not mad at it. The Prophet library was developed by Facebook with the aim to make time series’ straight-forward to run. The only requirement is that you have a dateframe made up of two columns, and they have to be called ds and y. The ds (datestamp) column should be of a date format expected by the pandas library. The y column must be numeric, and represents the measurement we wish to forecast. As far as running a model, this library offers familiar options, fit, and predict that you may have already seen if you use libraries like scikit-learn.

Mathematically speaking, the model looks like this,
eqn_ts

  • g(t): piecewise linear or logistic growth curve for modelling non-periodic changes in time series
  • s(t): periodic changes (seasonality)
  • h(t): effects of holidays with irregular schedules (you’d have to specify this_
  • εt: error term

 
Okay, let’s set it up. First, let’s filter our region to ‘TotalUS’ to avoid duplicate data/noise.

#---US Trends
df2 = df[df['region'] == 'TotalUS']
keep = ['AveragePrice', 'Date']
df2 = df2[keep]

date_price = df2.rename(columns={'Date':'ds', 'AveragePrice':'y'})
m = Prophet()
m.fit(date_price)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
fig1 = m.plot(forecast)

#---SF Trends
df3 = df[df['region'] == 'SanFrancisco']
keep = ['AveragePrice', 'Date']
df3 = df3[keep]
date_price = df3.rename(columns={'Date':'ds', 'AveragePrice':'y'})
m = Prophet()
m.fit(date_price)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
fig2 = m.plot(forecast)

#---Midsouth Trends: Are we catching up?
df4 = df[df['region'] == 'Midsouth']
keep = ['AveragePrice', 'Date']
df4 = df4[keep]
date_price = df4.rename(columns={'Date':'ds', 'AveragePrice':'y'})
m = Prophet()
m.fit(date_price)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
fig3 = m.plot(forecast)

In the top left you have the Total US trends, in the bottom left you have the SF trends and on the right you have the MidSouth US trends. Overall, avocado prices seem to be going up. Interestingly, SF prices seem to have normalized, while in the Mid-South prices seem to be rising. Are we catching up with the cool kids?

That’s all for now, happy coding!

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s