Ascend’s Architecture

Frameworks vs Control Plane

The status quo for building data pipelines is painful and cumbersome. Sure, building the first couple isn’t too bad…it’s the many that follow. And how they impact each other. And how they become harder and harder to maintain as the dependencies grow and the codebase becomes increasingly unruly. For us, this was a terminal path for our happiness as engineers.

To solve this, there are two options: Frameworks and Control Planes. Most other companies went the framework route. They are great at making it easier to build at a static point-in-time. But they lack a feedback loop, which is critical for dynamic systems.

Dynamic systems, like data pipelines, are where control planes shine. Instead of the “when X happens, do Y” framework mentality, a control plane takes the approach of “no matter what happens, make the system look like Z.” As engineers, we wanted the latter. So we built the Dataflow Control Plane.

In architecting this, we solved for three key areas:

1) User-defined “blueprints” of pipelines

2) Translating the “blueprints” into jobs and infrastructure

3) Persisting bidirectional feedback to always make it happen

Architecting the
Dataflow Control Plane

Defining Blueprints

In Ascend, you build out declarative DAGs to define these blueprints. Just tell us the inputs, the transforms you want to happen at each stage, and the outputs, and Ascend creates the blueprint from there. SQL was the first language supported, since that let us infer a lot of patterns from the code provided – making it a great training ground for the Control Plane. We now also support PySpark, with more options on the way.

Translating the Blueprints

With the blueprint in hand, the Control Plane understands the expected end-state of the pipelines and automates the Elastic Data Fabric to deliver on it. The current fabric utilizes Spark clusters deployed on Kubernetes, combined with the selected cloud provider’s persistent object store. By managing this underlying infrastructure as a service, the Control Plane has complete control to auto-scale and auto-orchestrate with a high degree of precision based on the data, code, concurrency, and SLAs.

Persisting Bidirectional Feedback

With data pipelines, change is the only constant. Which is why we built the Metadata Service into the Control Plane. This Service persists the lineage and history of all jobs that ran before, down to partition-level details on the input files, transforms, and outputs. With this information in hand, the Control Plane continuously monitors the data, logic, and Elastic Data Fabric layers for change or failure and automatically determines the optimized path forward. For example, if new data arrives, the Control Plane can see which subset of the partitions are affected and will recompute only what’s required.