Building Faster Data Pipelines with Ascend and DuckLake

DuckLake's SQL-first lakehouse architecture works with Ascend's agentic data engineering platform to deliver faster, simpler data pipelines than traditional Iceberg and Delta Lake setups.

Cody

Cody Peterson

cody.peterson@ascend.io

Announced in May by the makers of DuckDB, DuckLake brings table formats to a fast, simple experience that anyone can try out in a matter of minutes. After years of wrestling with the complexity of Iceberg and Delta Lake—the manifest files, the compaction nightmares, the catalog coordination—it feels like a real step forward.

The DuckDB team's approach is elegantly simple: store all metadata in a standard SQL database while keeping data in Parquet files. Instead of traversing file hierarchies to understand your tables, you just query a database. No more scattered manifest files, no complex catalog services to coordinate.

What makes this particularly interesting is how well DuckLake's approach aligns with how we built Ascend's data automation platform. I wanted to walk through what we're seeing when teams combine the two.

What we all actually want

Iceberg and Delta Lake promised us database-like features on cheap storage, but they delivered that with a side of operational complexity that's been eating our lunch.

Every query requires multiple sequential HTTP requests to piece together metadata from scattered files. Small changes create cascading file rewrites. Concurrent operations conflict because the critical path for commits is so long.

The DuckLake manifesto puts it perfectly: Iceberg and Delta Lake were designed to avoid using databases, but then had to add databases anyway for consistency. They never revisited their file-centric architecture after making that fundamental compromise.

What data teams actually want is simple: fast pipelines that don't break, cheap storage that scales, and metadata management that doesn't require a PhD in distributed systems.

Why this works so well with Ascend

Here's what's interesting: we already had most of the pieces needed to support DuckLake well.

We've had DuckDB running in-process in Ascend flows for a while—it's incredibly fast for analytical workloads. We already support connecting to Postgres, MySQL, and other SQL databases natively. Our cloud storage connectors work seamlessly with S3, Azure Blob, and GCS. But until DuckLake, we didn't have an elegant way to persist and manage that DuckDB data in a production lakehouse setup.

DuckLake changes everything because it gives us that persistence layer while embracing what databases do best: managing metadata efficiently.

Think about the architectural elegance here. Instead of encoding metadata in a maze of JSON and Avro files, DuckLake just uses SQL tables in a database you already know how to operate. When you want to query a table, instead of multiple round trips to object storage to reconstruct state, you send one SQL query to get exactly the files you need to read.

This isn't just theoretically cleaner—it's practically transformational when combined with Ascend's agentic intelligence.

How Ascend works with DuckLake

Our Intelligence Core already tracks metadata, and optimizes pipeline execution using intelligent fingerprinting. Our embedded AI agents are fully context-aware—they understand your schemas, your data patterns, your performance trends based upon that metadata.When schema changes happen, when new data arrives, when optimization opportunities emerge—our agents can respond intelligently because they're working with the same metadata you use.

Furthermore, our Smart Tables work synergistically with DuckLake by partitioning queries across massive datasets into many independent queries, enabling us to scale up DuckDB workloads that weren't previously possible or performant by launching many DuckDB workers in parallel. And with DuckLake's efficient SQL-based metadata operations, this parallel processing becomes even more effective—each worker can quickly determine exactly which data it needs to process without the file traversal overhead that slows down traditional table formats.

What this looks like in practice

Let me break down the practical benefits:

Faster query planning: Query planning goes from multiple HTTP requests to a single database query. As the DuckDB team explains, DuckLake eliminates the file I/O bottlenecks that create latency floors in traditional formats. With Ascend's optimization layer, pipeline execution is noticeably faster.

Less operational complexity: There are no Avro manifest files to debug, no complex compaction schedules to maintain, no separate catalog services to babysit. It's just SQL databases (which you already know how to operate) plus Parquet files (which are portable and fast). Plus, with Ascend, agents help you build, deploy, and monitor your data pipelines.

Deployment flexibility: This is where the practical options get interesting:

Bring Your Own: Already have Postgres or MySQL? Connect it to Ascend, point to your object storage, and you've got a production lakehouse running in minutes. Your data stays exactly where you want it.
Ascend Managed: Want the full experience without infrastructure overhead? We'll run the database and manage the storage for you, giving you lakehouse functionality with zero operational burden.

The best part? Your data never gets locked in. DuckLake writes standard Parquet files that any tool can read, and the metadata schema is intentionally simple SQL that you can export anywhere.

How This Affects Development

I've been working with teams using this setup, and the feedback has been consistent: development feels more straightforward.

Pipeline development is faster because there's no complex catalog setup—just attach to your database and start building. The cost benefits are meaningful too. You're storing data in cheap object storage while running metadata operations against efficient database systems that are much less expensive to operate than traditional data warehouses.

What I find most valuable is how this changes where engineering time gets spent. Instead of managing lakehouse complexity, teams can focus more on building the datasets their businesses are demanding from them. And the agents handle optimization, monitoring, and maintenance tasks that usually consume significant engineering bandwidth.

Trying This Out

DuckLake support is available in Ascend now. Whether you want to bring your own database infrastructure or have us manage it for you, you can try this combination today.

If you want to understand more of the technical details, I'd recommend the DuckLake manifesto and this technical deep dive that explains the SQL-first approach. The DuckDB team also has videos explaining the design decisions behind DuckLake.

If you're curious about how this works with Ascend specifically, book a demo and we can show you the setup process and performance characteristics firsthand.

‍