LGBTQ+ Film History: Building A Pride Month Data Flow with Ascend
Shifra
Shifra Isaacs
shifra@ascend.io
June is Pride Month – a time to celebrate, reflect, and honor the stories and struggles of the LGBTQ+ community. This year, our team set out to do something fun: we built a data pipeline to analyze the evolution of LGBTQ+ themes in film, spanning from the earliest days of cinema to the present.
Here’s how we did it, what we learned, and why it matters.
The Vision: Telling a Story Through Data
Film has always been a mirror to society, reflecting its values, struggles, and dreams. But how have LGBTQ+ themes in film changed over time? What stories were told in the shadows, and which ones broke through to the mainstream? When did LGBTQ+ cinema experience its most prolific periods of production?
We wanted to answer these questions with data by building a pipeline that could ingest, process, and summarize over a century of film history.
The data flow in Ascend. Read on for a step by step break down.
Building the Pipeline: From Raw Data to Insight
1. Data Collection & Ingestion
Our journey began with a rich dataset of films, including the release year, plot summary, and popularity score. We focused on films with LGBTQ+ representation, ensuring a diverse and global perspective.
Using Ascend’s platform, we created robust Read Components to pull in the raw film data. These components handled everything from CSV parsing to schema validation, ensuring our data was clean and ready for analysis.
2. Mapping Film Genres: From Raw IDs to Meaningful Genre Labels
Before analyzing film themes, we needed to make genre data meaningful. Our raw dataset stored each film’s genres as a string of numeric IDs (e.g., "[18, 35]"). To enable real analysis, we mapped these IDs to readable genre names like “Drama” or “Comedy.”
How we did it:
Cleaned and parsed the GENRE_IDS column, handling missing or malformed values and converting strings to lists of integers.
Joined these lists with a genre reference table to map each ID to its name.
Aggregated the results so each film record included a genre_name column—a list of genre labels, or an empty list if none were found.
This mapping step was essential: it let us ask questions like “How have LGBTQ+ comedies changed over time?” and made our downstream analytics and AI summaries much more interpretable.
Polars Genre Processing Python Code
Python
# Step 1: Clean and split genre_ids string into list of integers with proper edge case handling
films_df = films_df.with_columns(
pl.when(pl.col("GENRE_IDS").is_null() | (pl.col("GENRE_IDS") == ""))
.then(pl.lit(None, dtype=pl.List(pl.Int64))) # Handle null/empty as empty list
.otherwise(
pl.col("GENRE_IDS")
.str.replace_all(r"[^\d,]", "") # Remove everything except digits and commas
.str.split(",")
.list.eval(
pl.element()
.filter(pl.element() != "") # Filter out empty strings
.filter(pl.element().str.len_chars() > 0) # Additional check for non-empty
.cast(pl.Int64, strict=False) # Use non-strict casting to handle conversion errors
.filter(pl.element().is_not_null()) # Remove any nulls from failed conversions
)
)
.alias("genre_ids")
)
# Step 2: Normalize genre column names
genres_df = genres_df.rename({"ID": "genre_id", "NAME": "genre_name"})
genres_df = genres_df.with_columns(pl.col("genre_id").cast(pl.Int64))
# Step 3: Explode genre_ids and join with genre names
# Handle cases where genre_ids might be null or empty
exploded = (films_df
.select(["genre_ids"])
.with_row_count("row_nr")
.filter(pl.col("genre_ids").is_not_null()) # Filter out null genre_ids
.filter(pl.col("genre_ids").list.len() > 0) # Filter out empty lists
.explode("genre_ids"))
# Only proceed with join if we have data to join
if exploded.height > 0:
joined = exploded.join(genres_df, left_on="genre_ids", right_on="genre_id", how="left")
# Step 4: Group back by row and collect genre names
genre_names = joined.group_by("row_nr").agg(pl.col("genre_name")).sort("row_nr")
else:
# Create empty genre_names DataFrame with correct schema
genre_names = pl.DataFrame({
"row_nr": pl.Series([], dtype=pl.UInt32),
"genre_name": pl.Series([], dtype=pl.List(pl.Utf8))
})
# Step 5: Add genres back to original films_df
films_df = films_df.with_row_count("row_nr").join(genre_names, on="row_nr", how="left").drop("row_nr")
# Fill null genre lists with empty lists
films_df = films_df.with_columns(
pl.col("genre_name").fill_null(pl.lit([], dtype=pl.List(pl.Utf8)))
)
return films_df.to_pandas()
3. Thematic Analysis by Decade
To understand how themes evolved, we grouped films by decade. For each decade, we aggregated plot summaries and thematic tags, creating a “decade summary” that captured the essence of LGBTQ+ storytelling in that era.
Here’s a sample SQL transform that aggregates a sample of themes by decade:
SQL Film Themes Query
SQL
SELECT
decade,
LISTAGG(film_themes, '; ') WITHIN GROUP (ORDER BY TITLE) AS all_film_themes,
COUNT(*) AS num_films
FROM (
SELECT
decade,
TITLE,
film_themes
FROM film_themes
QUALIFY ROW_NUMBER() OVER (PARTITION BY decade ORDER BY RANDOM()) <= 100 -- Limit to 100 films per decade
)
GROUP BY decade
4. Summarization with AI
Here’s where things got exciting. We leveraged Snowflake Cortex’s powerful SUMMARIZE function to generate natural-language summaries of each decade’s themes. This AI-driven approach distilled thousands of words into concise, insightful overviews.
Below is the actual SQL transform from our pipeline:
Snowflake Cortex SQL Query
SQL
SELECT
decade,
SNOWFLAKE.CORTEX.SUMMARIZE(
'Given the following list of movie themes from the ' || decade || 's, summarize the most common themes of this decade: ' ||
LEFT(all_film_themes, 10000) ||
'Reference specific films only in parenthesis and follow this format: The 1950s saw a range of movie themes, including technology and rebellion ("The Mechanical Brain"), exploration of sexuality and identity ("The Trans Woman" and "Olivia"), housewifery and relationships ("The Inexperienced Housewife"), and moral dilemmas and crises of faith ("The Vicar of Bellington" and "A Double Life"). Other themes included communication between prisoners, cover-ups, and escapes.'
) AS decade_themes_summary,
num_films
FROM decade_theme_agg
ORDER BY decade
5. Evolution Over Time
Finally, we brought it all together. By feeding the decade summaries into Cortex again, we generated an overarching narrative: a summary of how LGBTQ+ film themes have evolved from the 1880s to today.
Here’s the SQL that does this final summarization:
Key Findings: The Shifting Landscape of LGBTQ+ Cinema
So, what did we discover? Here are some of the most compelling insights from our analysis:
LGBTQ Films: Temporal Analysis
LGBTQ Films: Temporal Analysis
Exploring trends in LGBTQ film production from 1880s-2022
Exponential Growth
2.7x
increase from 2000s to 2010s decade
Peak Year
2020
348 films - highest single year
Peak Release Month
January
Peak month (618 films total)
Yearly Film Production (1960-2022)
Films by Decade
Historical Phases of LGBTQ Cinema
Early Era
1882-1960
41
films
Civil Rights
1961-1980
287
films
AIDS Crisis
1981-1995
325
films
Modern Era
1996-2010
1,175
films
Digital Era
2011-2021
2,338
films
Seasonal Release Patterns
Genre Evolution by Decade
Top Genre
All Time
Drama
2,647 films
2010s Leader
Peak Decade
Drama
1,391 films
Romance Growth
2000s-2010s
2.2x
increase
Documentary Rise
Steady Growth
403
in 2010s
Language Distribution
Top 12 languages by film count
English Dominance
Primary Language
2,786
films (58%)
Spanish Films
Second Most
393
films (8%)
French Cinema
Third Place
363
films (7.6%)
Global Reach
Total Languages
64
languages
Most Popular Films by Century
20th Century Classics
1. American Beauty Suburban mid-life crisis drama
2. Basic Instinct Psychological thriller
3. Caligula Historical epic drama
4. Clueless Teen coming-of-age comedy
5. Interview with the Vampire Gothic romance
21st Century Hits
1. Crush Teen romantic comedy
2. Yes, No, or Maybe Half? Animated romance
3. Given Animated musical romance
4. Heart Shot Contemporary drama
5. Call Me by Your Name Coming-of-age romance
Key Findings
Historical Trends
1960s breakthrough: Transition from sporadic releases to consistent production
1980s-90s surge: Coinciding with AIDS activism and increased visibility
2000s explosion: Digital filmmaking democratized production
2010s peak: Mainstream acceptance and streaming platforms
2020-2022 decline: COVID-19 impact on film industry
Seasonal Patterns
January peak: Sundance Film Festival and awards season launches
October second: Fall festival circuit and awards preparation
June third: Pride Month and summer festival season
December low: Holiday season shows lowest activity
Festival alignment: Releases timed to major film events
Why This Matters
Data can do more than count and categorize—it can help us see the arc of history, the progress we’ve made, and the work still to be done. By analyzing LGBTQ+ film through the lens of data, we honor the storytellers who paved the way and inspire the next generation to keep telling their truth.