As a female in tech, I am no stranger to sexist remarks. I’ve been on the receiving end of this since my teenage years. It’s almost a rite of passage for girls out there even in 2023. In my professional career, I have a whole collection of stories I could share. Aside from my career in tech, I am a lover of the arts. I love to cross stitch and I love to illustrate. Like tech, art is no stranger to sexism, racism, classicism and any other “-isms” you can think of.

There have always been societal hierarchies that decide what art is considered good. Now, because women throughout history have been seen as inferior to men, it was then naturally to assume that art made by women was also inferior. Moreover, “because art is considered a freedom of expression, women were not to be viewed as artists nor were they allowed to express themselves. Women artists were seen as a threat.” (Debbie Nuno, Sexism in Art: from the Fundamental to Art Critiques, 2017). If you stop to think of who do we remember today as “Masters of the Arts”, the vast majority will be men (white men.) Renee Sandell makes an interesting point in her essay titled “Feminist Art Education: An Analysis of the Women’s Art Movement as an Educational Force”, in that women still found other artistic ways to express themselves via knitting, crochet, etc., but those were considered to be a minor form of art in comparison to the fine arts.

It is also worth noting that while women behind the canvas were not often recognized, women on canvas surely were. Throughout the history of art, women have been painted and idolized- just go take a stroll in any museum and count how many naked depictions of women there are compared then men’s portraits (source: Guerrilla Girls). According to Guerrilla Girls, only 4% of MoMa’s are made by women while 72% of the nudes in the gallery are women.

Given my similar experiences in tech and my passion for art, I was interested in doing a numbers deep dive of women in art history and how they compare the total number of works done by their men counterparts. MoMa has an extensive record of artworks from the last 150 years and it is the data I will be using today. Scroll to the bottom of this post for the full code.

When we look at the amount of works published since the 1700s to today, we don’t really see women works popping up until the mid to late 1800s.

Data Source: MoMa. Code used to generate plot is at the end of the post.

This is corroborated by PCD Art who state that “it was not till three centuries later in the 19-century women began to exist in the art sphere; however, they continued to be viewed as separate and unequal to their male counterparts. Yet, in 1876, a milestone was reached during the Philadelphia Centennial Exposition where women artists created around a tenth of the artworks presented. American portrait painter and engraver, Emily Sartain, became the first and only woman to receive a Centennial Gold Medal awarded for her a painting titled ‘The Reproof.'”

Moreover, while not the focus of this article, but important to call out, the majority of artists have historically come from white, western nations as seen by numbers below from our MoMa dataset.

Data Source: MoMa. Code used to generate the data used for this graphic is at the end of the post. Graphic made with Canva.

Jumping over to 20th century we see another big change for women in art. Women become a big part of war propaganda during World War I and II, see Susie the Riveter and and their role in society changes. According to PCD Art, “when determining how these changes impacted on the position of women in art, one of the most important things to consider is the changing role of women during World War 1 (1914-1918). Due to the lack of men in society due to their part in the war, women were suddenly allowed into the workforce to do “men’s jobs”. They were suddenly presented as strong independent individuals who were essential to the success of the country. Which dramatically and irrevocably changed women’s role in society as they now had to carry the burden of supporting the household both financially and emotionally. This translated into the art world with the emergence of female artists such as Kathe Kollwitz, a German artist.”

Unfortunately, once the 1950s hit, a new wave of a male prevalent culture that prioritized the sanctity of family life where women were responsible for the home and kids and women arts diminished.

Data Source: MoMa. Code used to generate plot is at the end of the post.

As can be seen from the plot above, there was a decrease in the 50s and it wasn’t until the mid 1990s were things pick up again for women and seem to stabilize from then on.

Overall, the acceptance for women’s work in the arts of elsewhere has improved significantly. I would argue that there is still more work to be done – change takes time.

Before closing the chapter, I wanted to run some resampling and bootstrap regressions to really cement the fact that the experiences for women and men in the arts in terms of their yearly output are statistically different.

What is resampling and why is it applicable here?

Resampling is when you take a sample of your data and then you a sample of that sample with the goal of seeing how much variation there would have been if I had a more complete dataset. In theory, this new distribution is related to the patterns and uncertainties of the underlying population. As to why we would want to resample our data in this case based on our definition : (1) to understand variation in our data and (2) to augment limited data.

There are many ways to resample data, each method with their own pros and cons. Some popular methods are bootstrap, monte carlo, jacknife, cross-validation, etc. Here we will use bootstrap first to understand the populations and then we’ll building a straightforward linear regression with bootstrapping (a linear regression is not fully correct, but for a quick conclusion it’ll do just fine!).

I first split the data into two genders and look at the two distributions.

Male and Female Artists Yearly Works Distribution. Code used to generate this plot can be found at the end of this post.

As with our first scatterplot, we can see the differences in quantities between men and women, but let’s try and proof that these differences are statistically different over time.

After bootstrapping our data, our distributions look as so.

Bootstrapped Populations. Code used to generate this plot can be found at the end of this post.

Now, because there’s no overlap of the means, we can say that the changes between these two groups is indeed statistically significant.

Bootstrapped Population. Code used to generate this plot can be found at the end of this post.

Moreover, combining both samples and resampling, we can see that the mean is far from zero, once again indicating, statistical significance between the two groups.

Let’s try to fit a linear regression with bootstrap to our two populations. Here I am using sklearn’s Bagging Regressor.

Bagging Regressor Men Population. Code used to generate this plot can be seen at the end of this post.

Bagging Regressor Women Population. Code used to generate this plot can be seen at the end of this post.

The blue or yellow dots are our actual data points. The solid red line is fit of the model and the grey lines are all the bootstrap model estimators. Here we can see the benefit of resampling as we were able to fit multiple models (grey lines) and get a better understanding of what’s going with the relationships.

Full code below:

#!/usr/bin/env python
# coding: utf-8

import seaborn as sns
import pandas as pd
import re
import missingno as msno
import matplotlib.pyplot as plt
from matplotlib import rcParams
import numpy as np
# figure size in inches
rcParams['figure.figsize'] = 8,6
get_ipython().run_line_magic('matplotlib', 'inline')
pd.set_option('display.max_columns', None)

import warnings

# ### UDFs

def get_yr(year):
    # Clean up year column
    temp = re.findall(r"(?<!\d)\d{4,7}(?!\d)", str(year))
    if temp:
        return list(map(int, temp))[0]

# ### Data Input

artists = pd.read_csv("Artists.csv")
artworks = pd.read_csv("Artworks.csv")

# Check data types

# Inspect data

# Check for missing data

# ### Feature Engineering

# Clean up data
artworks['Nationality'] = artworks['Nationality'].str.replace('(', '').str.replace(')','')
artworks['Gender'] = artworks['Gender'].str.replace('(', '').str.replace(')','')

# Feature engineering
artworks['Gender_Encode'] = artworks['Gender'].astype('category')
artworks['Nationality_Encode'] = artworks['Nationality'].astype('category')

# Check Date column and if not consistent clean up. 

# Clean up Date column
artworks['Date'] = artworks["Date"].apply(get_yr)

# Clean up Gender column
artworks['Gender'] = artworks['Gender'].str.strip(' ')
artworks['Gender'] = artworks['Gender'].str.replace(' ', '')
artworks['Gender'] = artworks.Gender.str.replace(r'(^.*Male.*$)', 'Male')
artworks['Gender'] = artworks.Gender.str.replace(r'(^.*Female.*$)', 'Female')
artworks['Gender'] = artworks.Gender.str.replace('Non-Binary', 'Non-binary')

# Feature engineering
artworks['Gender_Encode'] = artworks['Gender'].astype('category')
artworks['Nationality_Encode'] = artworks['Nationality'].astype('category')

# ### Data Analysis

# Artworks per gender
print("Total Number of Male Artists: "+str(len(artworks['Gender'][(artworks['Gender'] == 'Male') | (artworks['Gender'] == 'male')])))
print("Total Number of Female Artists: "+str(len(artworks['Gender'][(artworks['Gender'] == 'Female') | (artworks['Gender'] == 'female')])))
print("Total Number of Non-Binary Artists: "+str(len(artworks['Gender'][(artworks['Gender'] == 'Non-binary')])))

# Clean up nationality
a = artworks['Nationality'][5]

nationality = artworks[artworks['Nationality'] != a].reset_index(drop=True)

# See top nationalities


# Gender by nationality
nationality[nationality['Nationality'] == 'American']['Gender'].value_counts()

# Early years

# When do we see female artists start to pop up? Around mid 1800s

# Modern times

# When does non-binary start showing up?
gender_dist[gender_dist['Gender'] == 'Non-binary']

# Create df for plotting
gender_dist = artworks.groupby("Date")['Gender'].value_counts().reset_index(name="Count")

# All Time
sns.set_context("paper", font_scale=1, rc={"lines.linewidth": 1.5})
sns.lineplot(data=gender_dist[(gender_dist['Gender'] == "Male") | ((gender_dist['Gender'] == "Female"))], x="Date", y="Count", hue="Gender", palette=["#3a86ff", "#fca311"])
sns.despine(offset=10, trim=True)

# 1950s+
sns.set_context("paper", font_scale=1, rc={"lines.linewidth": 1.5})
sns.lineplot(data=gender_dist[((gender_dist['Gender'] == "Male") | (gender_dist['Gender'] == "Female")) & (gender_dist['Date'] >= 1950)], x="Date", y="Count", hue="Gender", palette=["#3a86ff", "#fca311"])
sns.despine(offset=10, trim=True);

# ### Populations and Resampling

# Look at population distributions
def plot_hist(x, p=5):
    # Plot the distribution and mark the mean
    sns.histplot(x, alpha=.5)
    plt.axvline(x.mean(), linewidth=3)
    # 95% confidence interval 
    plt.axvline(np.percentile(x, p/2.), color="#fca311", linewidth=3)
    plt.axvline(np.percentile(x, 100-p/2.), color="#fca311", linewidth=3)

def plot_dists(a, b, nbins, a_label='pop_A', b_label='pop_B', p=5):
 # Create a single sequence of bins to be shared across both
 # distribution plots for visualization consistency.
    combined = pd.concat([a, b])
    breaks = np.linspace(combined.min(), combined.max(), num=nbins+1)
    plt.subplot(2, 1, 1)
    plt.subplot(2, 1, 2)
    sns.despine(offset=10, trim=True);
male = gender_dist[gender_dist.Gender == 'Male']
female = gender_dist[gender_dist.Gender == 'Female']
plot_dists(male.Count, female.Count, 20, a_label='male', b_label='female')

n_replicas = 1000
female_bootstrap_means = pd.Series([
 female.sample(frac=1, replace=True).Count.mean()
 for i in range(n_replicas)])
male_bootstrap_means = pd.Series([
 male.sample(frac=1, replace=True).Count.mean()
 for i in range(n_replicas)])
plot_dists(male_bootstrap_means, female_bootstrap_means, 
 nbins=80, a_label='Male', b_label='Female')

diffs = []
for i in range(n_replicas):
    sample = gender_dist.sample(frac=1.0, replace=True)
    male_sample_mean = sample[sample.Gender == 'Male'].Count.mean()
    female_sample_mean = sample[sample.Gender == 'Female'].Count.mean()
    diffs.append(male_sample_mean - female_sample_mean)
diffs = pd.Series(diffs)
sns.despine(offset=10, trim=True);

import statsmodels.api as sm
fig = sm.qqplot(diffs, line='s')
plt.title('Quantiles of standard Normal vs. bookstrapped mean')

# ### Regressions, Bagging Regressors

plt.figure(figsize=(12, 8))
sns.lmplot(x ='Date', y ='Count', data = gender_dist[(gender_dist['Gender'] == "Male") & (gender_dist['Date']<=2000)])
sns.despine(offset=10, trim=True);

sns.lmplot(x ='Date', y ='Count', data = gender_dist[(gender_dist['Gender'] == "Female") & (gender_dist['Date']<=2000)])
sns.despine(offset=10, trim=True);

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import BaggingRegressor
maleR = male[male['Date'] <=2000]
X = maleR[["Date"]]
y = maleR[["Count"]]

n_estimators = 50

# Initializing estimator
model = BaggingRegressor(LinearRegression(),

# Fitting 50 bootstrapped models, y)

plt.figure(figsize=(12, 8))

# Plotting each model
for m in model.estimators_:
    plt.plot(X, m.predict(X), color="grey", alpha=0.2, zorder=1)

# Plotting data
sns.scatterplot(data=maleR, x="Date", y="Count", s=40)

# Bagged model prediction
plt.plot(X, model.predict(X), color="red")
sns.despine(offset=10, trim=True);

model.score(X, y)


Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s