LGBTQ+ Film History: Building A Pride Month Data Flow with Ascend

Shifra

Shifra Isaacs

shifra@ascend.io

June is Pride Month – a time to celebrate, reflect, and honor the stories and struggles of the LGBTQ+ community. This year, our team set out to do something fun: we built a data pipeline to analyze the evolution of LGBTQ+ themes in film, spanning from the earliest days of cinema to the present.

‍

Here’s how we did it, what we learned, and why it matters.

‍

The Vision: Telling a Story Through Data

Film has always been a mirror to society, reflecting its values, struggles, and dreams. But how have LGBTQ+ themes in film changed over time? What stories were told in the shadows, and which ones broke through to the mainstream? When did LGBTQ+ cinema experience its most prolific periods of production?

‍

We wanted to answer these questions with data by building a pipeline that could ingest, process, and summarize over a century of film history.

The data flow in Ascend. Read on for a step by step break down.

‍

Building the Pipeline: From Raw Data to Insight

1. Data Collection & Ingestion

Our journey began with a rich dataset of films, including the release year, plot summary, and popularity score. We focused on films with LGBTQ+ representation, ensuring a diverse and global perspective.

‍

Using Ascend’s platform, we created robust Read Components to pull in the raw film data. These components handled everything from CSV parsing to schema validation, ensuring our data was clean and ready for analysis.

‍

2. Mapping Film Genres: From Raw IDs to Meaningful Genre Labels

Before analyzing film themes, we needed to make genre data meaningful. Our raw dataset stored each film’s genres as a string of numeric IDs (e.g., "[18, 35]"). To enable real analysis, we mapped these IDs to readable genre names like “Drama” or “Comedy.”

‍

How we did it:

Cleaned and parsed the GENRE_IDS column, handling missing or malformed values and converting strings to lists of integers.
‍
Joined these lists with a genre reference table to map each ID to its name.
‍
Aggregated the results so each film record included a genre_name column—a list of genre labels, or an empty list if none were found.
‍
This mapping step was essential: it let us ask questions like “How have LGBTQ+ comedies changed over time?” and made our downstream analytics and AI summaries much more interpretable.

Polars Genre Processing Python Code

Python

# Step 1: Clean and split genre_ids string into list of integers with proper edge case handling
films_df = films_df.with_columns(
    pl.when(pl.col("GENRE_IDS").is_null() | (pl.col("GENRE_IDS") == ""))
    .then(pl.lit(None, dtype=pl.List(pl.Int64)))  # Handle null/empty as empty list
    .otherwise(
        pl.col("GENRE_IDS")
        .str.replace_all(r"[^\d,]", "")  # Remove everything except digits and commas
        .str.split(",")
        .list.eval(
            pl.element()
            .filter(pl.element() != "")  # Filter out empty strings
            .filter(pl.element().str.len_chars() > 0)  # Additional check for non-empty
            .cast(pl.Int64, strict=False)  # Use non-strict casting to handle conversion errors
            .filter(pl.element().is_not_null())  # Remove any nulls from failed conversions
        )
    )
    .alias("genre_ids")
)

# Step 2: Normalize genre column names
genres_df = genres_df.rename({"ID": "genre_id", "NAME": "genre_name"})
genres_df = genres_df.with_columns(pl.col("genre_id").cast(pl.Int64))

# Step 3: Explode genre_ids and join with genre names
# Handle cases where genre_ids might be null or empty
exploded = (films_df
            .select(["genre_ids"])
            .with_row_count("row_nr")
            .filter(pl.col("genre_ids").is_not_null())  # Filter out null genre_ids
            .filter(pl.col("genre_ids").list.len() > 0)  # Filter out empty lists
            .explode("genre_ids"))

# Only proceed with join if we have data to join
if exploded.height > 0:
    joined = exploded.join(genres_df, left_on="genre_ids", right_on="genre_id", how="left")
    # Step 4: Group back by row and collect genre names
    genre_names = joined.group_by("row_nr").agg(pl.col("genre_name")).sort("row_nr")
else:
    # Create empty genre_names DataFrame with correct schema
    genre_names = pl.DataFrame({
        "row_nr": pl.Series([], dtype=pl.UInt32),
        "genre_name": pl.Series([], dtype=pl.List(pl.Utf8))
    })

# Step 5: Add genres back to original films_df
films_df = films_df.with_row_count("row_nr").join(genre_names, on="row_nr", how="left").drop("row_nr")

# Fill null genre lists with empty lists
films_df = films_df.with_columns(
    pl.col("genre_name").fill_null(pl.lit([], dtype=pl.List(pl.Utf8)))
)

return films_df.to_pandas()

‍

3. Thematic Analysis by Decade

To understand how themes evolved, we grouped films by decade. For each decade, we aggregated plot summaries and thematic tags, creating a “decade summary” that captured the essence of LGBTQ+ storytelling in that era.

‍

Here’s a sample SQL transform that aggregates a sample of themes by decade:

SQL Film Themes Query

SQL

SELECT
    decade,
    LISTAGG(film_themes, '; ') WITHIN GROUP (ORDER BY TITLE) AS all_film_themes,
    COUNT(*) AS num_films
FROM (
    SELECT
        decade,
        TITLE,
        film_themes
    FROM film_themes
    QUALIFY ROW_NUMBER() OVER (PARTITION BY decade ORDER BY RANDOM()) <= 100  -- Limit to 100 films per decade
)
GROUP BY decade

‍

4. Summarization with AI

Here’s where things got exciting. We leveraged Snowflake Cortex’s powerful SUMMARIZE function to generate natural-language summaries of each decade’s themes. This AI-driven approach distilled thousands of words into concise, insightful overviews.

‍

Below is the actual SQL transform from our pipeline:

Snowflake Cortex SQL Query

SQL

SELECT
    decade,
    SNOWFLAKE.CORTEX.SUMMARIZE(
        'Given the following list of movie themes from the ' || decade || 's, summarize the most common themes of this decade: ' ||
        LEFT(all_film_themes, 10000) || 
        'Reference specific films only in parenthesis and follow this format: The 1950s saw a range of movie themes, including technology and rebellion ("The Mechanical Brain"), exploration of sexuality and identity ("The Trans Woman" and "Olivia"), housewifery and relationships ("The Inexperienced Housewife"), and moral dilemmas and crises of faith ("The Vicar of Bellington" and "A Double Life"). Other themes included communication between prisoners, cover-ups, and escapes.'
    ) AS decade_themes_summary,
    num_films
FROM decade_theme_agg
ORDER BY decade

‍

5. Evolution Over Time

Finally, we brought it all together. By feeding the decade summaries into Cortex again, we generated an overarching narrative: a summary of how LGBTQ+ film themes have evolved from the 1880s to today.

‍

Here’s the SQL that does this final summarization:

‍

Key Findings: The Shifting Landscape of LGBTQ+ Cinema

So, what did we discover? Here are some of the most compelling insights from our analysis:

LGBTQ Films: Temporal Analysis

Exponential Growth

2.7x

increase from 2000s to 2010s decade

Peak Year

2020

348 films - highest single year

Peak Release Month

January

Peak month (618 films total)

Yearly Film Production (1960-2022)

Films by Decade

Historical Phases of LGBTQ Cinema

Early Era

1882-1960

films

Civil Rights

1961-1980

287

films

AIDS Crisis

1981-1995

325

films

Modern Era

1996-2010

1,175

films

Digital Era

2011-2021

2,338

films

Seasonal Release Patterns

Genre Evolution by Decade

Top Genre

All Time

Drama

2,647 films

2010s Leader

Peak Decade

Drama

1,391 films

Romance Growth

2000s-2010s

2.2x

increase

Documentary Rise

Steady Growth

403

in 2010s

Language Distribution

Top 12 languages by film count

English Dominance

Primary Language

2,786

films (58%)

Spanish Films

Second Most

393

films (8%)

French Cinema

Third Place

363

films (7.6%)

Global Reach

Total Languages

languages

Key Findings

Historical Trends

1960s breakthrough: Transition from sporadic releases to consistent production
1980s-90s surge: Coinciding with AIDS activism and increased visibility
2000s explosion: Digital filmmaking democratized production
2010s peak: Mainstream acceptance and streaming platforms
2020-2022 decline: COVID-19 impact on film industry

Seasonal Patterns

January peak: Sundance Film Festival and awards season launches
October second: Fall festival circuit and awards preparation
June third: Pride Month and summer festival season
December low: Holiday season shows lowest activity
Festival alignment: Releases timed to major film events

‍

Why This Matters

Data can do more than count and categorize—it can help us see the arc of history, the progress we’ve made, and the work still to be done. By analyzing LGBTQ+ film through the lens of data, we honor the storytellers who paved the way and inspire the next generation to keep telling their truth.

‍

Happy Pride! 🏳️‍🌈

‍

Interested in the technical details or want to contribute? Check out the Ascend pride repo!

‍