from 2,000 screenplays, Broken Down by Gender and Age
Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme:
But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?
To begin answering these questions, we Googled our way to 8,000 screenplays and matched each character’s lines to an actor. From there, we compiled the number of lines for male and female characters across roughly 2,000 films, arguably the largest undertaking of script analysis, ever.
Let’s begin by breaking down dialogue, by gender, for just Disney films.
*Domestic gross over $45M, inflation-adjusted. Using IMDB box office, 2,500 have hit this threshold.
Only High-Grossing Films: Ranked in the Top 2,500 by US Box Office*
In January 2016, researchers reported that men speak more often than women in Disney’s princess films. We validated this claim and doubled the sample size to 30 Disney films, including Pixar. The results: 22 of 30 Disney films have a male majority of dialogue. Even films with female leads, such as Mulan, the dialogue swings male. Mushu, her protector dragon, has 50% more lines than Mulan herself.
This dataset isn’t perfect. As with Mulan, a plot can center around a character, even though the dialogue doesn’t reflect it. And all of our data is based on screenplays, not a perfect transcription of a film.
For each screenplay, we mapped characters with at least 100 words of dialogue to a person’s IMDB page (which identifies people as an actor or actress). We did this because minor characters are poorly labeled on IMDB pages. This has unintended consequences: Armageddon, for example, has women with lines, just not over this threshold. Which means a more accurate result would be 99.5% male dialogue instead of our result of 100%. There are other problems with this approach as well: films change quite a bit from script to screen. Directors cut lines. They cut characters. They add characters. They change character names. They cast a different gender for a character. We believe the results are still directionally accurate, but individual films will definitely have errors.
Each screenplay has at least 90% of its lines categorized by gender. If you notice a missing character from the analysis, their lines may be in the remaining 10%. If a character was cut from the film but is present in the screenplay, we inferred his or her gender based on the script’s pronouns.
Across thousands of films in our dataset, it was hard to find a subset that didn’t over-index male. Even romantic comedies have dialogue that is, on average, 58% male. For example,
Pretty Woman and 10 Things I Hate You both have lead women (i.e., characters with the most lines). But the overall dialogue for both films is 52% male, due to the number of male supporting characters.
How many films have women as lead characters?
In 22% of our films, actresses had the most number of lines (i.e., they were the lead). Women are more likely to be in the second place for most number of lines, which occurs in 34% of films. The most abysmal stat is when women occupy at least 2 of the top 3 roles in a film, which occurs in 18% of our films. That same scenario for men occurs in about 82% of films.
For each film, we also determined the age of each cast member at the time of its release. This allowed us to quantify whether there is a bias toward younger women in Hollywood (or conversely, whether men enjoy a longer career).
women versus men. Lines available to women who are over 40 years old decrease substantially. For men, it’s the exact opposite: there are more roles available to older actors.
Here’s another look at the same data, but for every age:
This project was born out of the less-than-stellar response to our analysis of films that fail the Bechdel Test. Commenters were quick to point out that the Bechdel Test is flawed and there are justifiable reasons for films to fail (e.g., they are historic). By measuring dialogue, we have much more objective view of gender in film.
This new data made the discussion far more uncomfortable. Perhaps people will examine the numbers and find a reason that isn’t as depressing as “systemic boys’ club.”
And while this analysis does not address the context of the dialogue (i.e., tropes, gender themes), it does give a great deal of credibility to the unanswered requestsby women for greater inclusivity.
Many of the findings are anecdotally obvious to women in the film industry. But nobody wanted to do the grunt work of gathering the data. We spent weeks just matching scripts to IMDB pages. It’s still not perfect, but we’re now in a much better place than “you know...women are never love-interests when they’re older than 40. ¯\_(ツ)_/¯”
All of our sources are available in this Google Doc and as much data as we can share (without getting sued) is available here on Github. Or if you don’t know how to code, here’s an easy way to comb through every film, genre, and year.
This project was born out of the less-than-stellar response to our analysis of films that fail the Bechdel Test. Commenters were quick to point out that the Bechdel Test is flawed and there are
justifiable reasons for films to fail (i.e., they are historic). By measuring dialogue, we have much more objective measure of gender inclusivity.
Rappers, ranked by the number of unique words used in their lyrics.
Examining the gender of the writers, producers, and directors who make films that fail the Bechdel Test.
Using Spotify to Measure the Popularity of Older Music
New Project! Every hip hop label, sorted by their artists' chart performance on Billboard: http://t.co/kYWjHWNaDi
In case you missed it: we used data to explore the definition of Punk. So, what is it? http://poly-graph.co/punk/