Data orchestration is on the minds of a lot of people these days, with productivity at a premium and a lot of talk of what orchestration means and its potential for accelerating the pace of innovation with data. In this post, we demystify data orchestration, break down its steps and phases, and discuss why it’s crucial for companies to have a platform that intelligently orchestrates their data and automates their workflows.
What is the meaning of data orchestration, anyway?
At its simplest, data orchestration is the systems and methodologies that automatically move and process data, optimally and efficiently, across our vast data ecosystem. Other frequently used terms are big data orchestration and ETL or ELT orchestration. All essentially refer to the process of coordinating, executing, and monitoring one or many data pipelines to create the ultimate capability: the ability to extract maximum value from data assets.
The steps in the data orchestration process sound easy enough: ingest and transform the data to be used by business intelligence (BI) tools, artificial intelligence (AI), or machine learning (ML) platforms. A closer look unveils more complexity. There’s the question of whether processes should be run real-time, near real-time, hourly, daily, or kicked off by specified events. There are also many interdependencies that need to be coordinated. A robust data orchestration process should at the very least be able to execute different tasks in the desired order, detect potential errors and solve them, and generate alerts and logs. The most powerful platforms can do much more.
More data, more complexity = more data orchestration
To be clear, there are times when data orchestration may not be required, especially for teams working with data sets that are relatively static, and in these cases, they can often use raw data to achieve their business goals without the need for pipelines or orchestration. This approach is usually sufficient so long as cost and performance remain acceptable. What typically happens, however, is that this approach does become expensive or slow, particularly as teams introduce new data into the process. Enter data orchestration.
At the outset, data orchestration was done using a timer-based model, running jobs at regular intervals based on a simple workflow (e.g., once a night, count the number of website visitors by browser and geography and write it into a database table that powers a BI tool for the marketing team). This type of timer-based data orchestration is still common, useful, and relatively efficient, especially when companies receive big blocks of data at once or when they have simple questions that need answering.
Over time, data orchestration inevitably gets more complex and sophisticated, as do business needs. The number of data sources grows, more groups start realizing that curated data helps them make far better business moves, and everybody wants the right data faster. This has led to approaches such as trigger-based data orchestration, which kicks off workflows when an external event occurs such as the availability of an incremental new piece of data. Other models have also emerged, such as more advanced rules and dependency-based orchestration (e.g., if my input data is all there, the data-quality checks are complete, and the compute resources are available, then trigger the workflow).
Today, workflows can get ultra-complex and still be successfully orchestrated. Companies might have data coming from hundreds if not thousands of different sources ranging from cloud APIs to cloud warehouses, or on-prem databases to data lakes, using the power of orchestration to create derived datasets and data models previously unimaginable. A lot of exciting things are made possible through successful data orchestration, and that’s enabling more people to work on refining and using data and adding tremendous business value — without exploding infrastructure costs, or worse, demands on data teams’ time.
It’s important when doing data ingestion to take the approach that optimizes each system’s strengths. For example, queue-based systems are awesome for real-time, streaming data. However, queues quickly hit the limits of what they’re designed for if you try to have them enrich records as they are streaming through (especially as the window of what records need to be joined widens). A better option in this instance might be a data orchestration or big data tool. When it comes to data ingestion, it’s vital to apply the right tool for the right use case.
When workflows get crazy complex: DAGs come into play
When things get more complex, people typically start to introduce DAGs, or directed acyclic graphs (graphs that do not loop back on themselves). In their simplest form, DAGs allow developers to specify sequences of steps that can branch out, converge, and more, waiting to perform any particular step until its upstream steps are complete. With DAGs, pipelines can become increasingly powerful by following a series of sophisticated dependency chains to automate workflows.
A common pattern also emerges involving DAGs as a wider variety of teams all embrace DAGs: they want to share their data and findings with each other, and subsequently build their own ETL (or ELT) logic that depends on the other teams’ data and logic. This spurs the need to start building more sophisticated data pipelines and automated workflows. The business value with DAGs then begins to grow exponentially.
Only a data orchestration platform that supports concurrent DAG orchestration and “DAG of DAGs” can handle this level of complexity. It can create an output of cleaned, transformed data in one pipeline and turn around to use that as an input into another. Any change made to the transformation upstream will automatically flow through to the downstream pipeline.
By now, you might be asking yourself: doesn’t orchestrating multiple DAGs at once lead to potential resource contention for your underlying infrastructure? Yes, it can. As volume increases, multiple DAGs may end up competing for the same processing power of your underlying infrastructure. That’s why any data orchestration platform should support an increasing number of pipelines by aligning to a few key principles:
- Allocate resources to workflows based on their assigned priorities, including inferring priorities of workloads by traversing DAGs.
- Parallelize workloads whenever possible. If you do have extra capacity, run workloads if at all possible, even if those workloads are lower priority (such as when high priority workloads are waiting for other data to become available). This reduces the likelihood of resource contention later on as you traverse the DAGs.
- Allow for resource isolation. At times you may still want dedicated resources for different workloads and should be able to provide reserved and/or dedicated resources for separate workloads.
Why invest in orchestration
In addition to adding more business value and optimizing resources, the right data orchestration platform can help teams avoid repeatedly starting jobs manually. Teams can build new things and add more business value without needing a separate orchestration tool to schedule jobs. The more data teams scale, the more they need automation to streamline their day-to-day tasks, from ingestion and transformation to delivery, maintenance, and governance.
So it would stand to reason that automated data orchestration processes and platforms would be the standard; however, it surprisingly hasn’t yet become that for a lot of teams. For one, data automation, while growing more and more prevalent, is still something that many teams are just starting to experiment with and see the benefits of. So, data engineers and data teams rely on more manual ways of orchestrating pipelines—creating a set of instructions: read data from x source, run y code on it, and push the results to z.
However, once a pipeline is codified, it quickly becomes brittle. Everything is forever changing: the data; the workloads and resources; and the assumptions. This can lead to what no data engineer wants to hear—alerts going off at 3 a.m. about the pipeline breaking and part of crucial data infrastructure going down.
Ascend: the solution to common orchestration problems
Current workflow automation systems are imperative by design, putting the burden on developers to design the tasks, connect them together via dependency relations, and pass them to a compute and infrastructure team/layer that manages the execution. This results in inflexible pipelines that break easily and are expensive to run and maintain.
Ascend, on the other hand, is declarative by design. The Ascend platform leverages a sophisticated Control Plane that is non-existent in Imperative systems. It takes care of the mundane tasks, so the only thing developers need to do is create and curate the high-level spec. There’s no need to worry about how the dataflow automation system manages the legion tasks required to implement the spec.
Why go declarative with Ascend’s Data Automation Cloud?
Using this declarative model, orchestration requires less code and less maintenance; it accelerates development cycles and makes pipelines more adaptive to change. Taking that one step further, Ascend’s orchestration platform is:
- DataAware: Ascend’s Data Automation Cloud tracks all the data, for all time, including its profile, where it came from, how it got there, where it went to, how it was used, and why.
- Autonomous: Ascend continuously watches for changes to data and code and dynamically generates tasks and parameters to ensure the right data arrives, always.
- Scalable: Ascend can track trillions of records and billions of data partitions and react to changes within seconds—with single- or multi-DAG orchestration.
- Self-Optimizing: Ascend leverages data collected to drive greater efficiencies than can reasonably be done manually, such as avoiding duplicate and unnecessary computations with mid-pipeline checkpointing, automated backfills, and automated rollbacks.
The result? Data teams can focus more on doing things better and faster and focus less on the monotony & mechanics of how they get done.