Data teams worldwide are building data pipelines with the point solutions that make up the “modern data stack.” But this approach is quite limited, and does not actually provide any true automation. Pipelines built with this approach are slow and require constant manual reprogramming and updates. They drain engineering teams of their efficiency while robbing stakeholders of the just-in-time data they need to inform and accelerate their growth.
Automation, however, enables next-generation, intelligent pipelines that can help engineering teams leapfrog their companies forward with guaranteed data integrity, smart change propagation, and powerful cost controls — all while empowering human intervention. Intelligent data pipelines anticipate and ease the tasks with which engineers drive business change rather than be consumed by busywork and keeping the lights on.
Read on to learn:
- What constitutes intelligent data pipelines
- What challenges continue to badger engineering teams with a “modern data stack” approach
- What benefits lie ahead when adopting intelligent data pipelines
What do we mean by data pipeline automation?
Before we get into the specifics of intelligent data pipelines, let’s explain what we mean by data pipeline automation.
Data pipeline automation is NOT:
- A collection of lots of tools patched together
- A mix of heuristics engineered from scratch with code
- Libraries of code for individual use cases
Data pipeline automation replaces data stacks assembled from multiple tools and platforms to boost efficiency and make order-of-magnitude improvements to productivity and capabilities, at scale. Such results arise from the use of a single unified platform — one with a data control plane that (1) utilizes an immutable metadata model of every aspect of the pipelines and (2) automates every operation with it.
With such a comprehensive, end-to-end approach to automation, engineers can rapidly build new pipelines to meet business needs, and instantly identify and respond to data problems across even the most complex and interwoven data pipelines. For a deep dive into this level of automation, take a look at our What Is Data Pipeline Automation paper.
What are the challenges with data pipelines built without automation?
Traditional methods of building data pipelines can’t deliver results at the rate and accuracy today’s business leaders expect. And while some data engineering teams attempt to give their data pipelines some “intelligence” with the clever use of orchestration, they are often stuck with the characteristics of non-automated pipelines — what we refer to as “legacy data pipelines.”
Legacy data pipelines have thousands of potential points of error. Any one problem can cause major disruption to the business. Companies with an understaffed engineering team suffer from lengthy outages during debug and break-fix cycles.
A Black Box
Let’s face it, probably only the most highly experienced, longest-tenured engineers at your company really know how your data pipelines work. When pipelines aren’t automated, they grow into increasingly complicated webs of code, making it difficult to drive efficiency, diagnose problems, and trust the accuracy of the resulting data. Your new hires are helpless with existing pipelines — they get the new projects to build more black boxes that no one else understands.
Because legacy data pipelines don’t break down the sources of runtime costs, it’s challenging to pinpoint the pipeline inefficiencies and redundancies that cost the company time and money. When finance asks engineering teams to justify skyrocketing compute costs, they’re hard-pressed to reconcile the expense with value.
Lack of Data Integrity
In legacy pipelines, data quality issues can go undetected for months, which means business users and customers often use outdated or incorrect data to make game-time decisions that affect the trajectory of the company. Engineering teams are only notified about the problem after someone in the business discovers the issue. The resulting whack-a-mole remedies burn time and expensive engineering resources that should be building new solutions instead.
Absence of Dedicated Tooling
To build and run legacy data pipelines, engineering teams have to integrate and customize several general-purpose tools from the market to fit a square peg into a round role. They often patch together many brittle integration points with custom code, making their systems difficult to understand and troubleshoot.
Legacy data pipelines involve manual coding of complex heuristics around the simple tools of the modern data stack, and yet these heuristics still only solve one use case at a time. This compounds the bottlenecks of those tools and code libraries, and embeds lots of unnecessary reprocessing. The absence of a unifying platform means that integrations are inefficient and act as barriers to data flows. In the end, these impediments can hamstring the data engineering team and stall business growth.
Defining Intelligence in Data Pipelines
Intelligent data pipelines are the result of decades of experience building data pipelines, encapsulated in a new automation paradigm that considers several dimensions:
- The processes of building and engineering data pipelines
- The processes of running data pipelines
- The processes of maintaining data pipelines
When you adopt this paradigm of designing, building, operating, and maintaining data pipelines, several benefits arise. You can deliver them more quickly. Reuse becomes more efficient. Pinpointing and resolving issues is faster. And, most importantly, business results improve with game-changing velocity and efficiency.
So what do these intelligent data pipelines actually do differently from legacy data pipelines?
10 Characteristics of Intelligent Data Pipelines
Intelligent data pipelines exhibit several key differences from legacy data pipelines that have a compounding effect on the value data engineering teams can deliver. Let’s walk through the top nine characteristics that constitute intelligence. Each of these unique traits is remarkable on its own, yet when woven together, they create a tapestry of innovation and impact.
1. Guarantee data integrity from source to sink.
Business stakeholders need utmost confidence in the data they rely on. Intelligent data pipelines provide data quality and the critical integrity needed to instill that confidence. For a deep dive, read our article How to Ensure Data Integrity at Scale by Harnessing Data Pipelines.
2. Apply actionable data quality.
Central to any data integrity strategy is the ability to apply controls in the form of actionable data quality rules. Intelligent data pipelines enforce acceptable ranges for any number of metrics at every point in every dataflow, and provide configurable actions to take when a rule is activated — notify and continue, alert and stop, etc. These rules detect data problems immediately as data arrives, and eliminate the costs of processing bad data.
3. Link data pipelines end-to-end to propagate change.
In the world of data pipelines, a single change can have immediate ripple effects on multiple downstream pipelines. Intelligent data pipelines are designed to address this challenge. They automatically and incrementally propagate processing from the point of change, seamlessly traversing the entire data lineage and codebase, even across multiple data clouds. And as they do so, they keep engineers informed of any issues.
4. Restart pipelines from point of failure.
From time to time, your engineers are alerted to situations and failures that stump data pipeline automation. Intelligent data pipelines support the engineers by precisely pinpointing the origins of data integrity failures, so that they can intervene and resolve them quickly. Once the issues are resolved, intelligent data pipelines take over and restart from the precise point where the failure occurred, and continue to run without reprocessing any data and with zero manual housekeeping required.
5. Instantly roll back entire pipelines with zero reprocessing.
Using the modern data stack to build legacy pipelines still results in costly and time-consuming code fixes and constant re-runs. When a problem is detected, engineers have to manually infer which branches of the pipeline are affected, unwind bad datasets, reset checkpoints, and program any new dependencies. On the other hand, intelligent data pipelines streamline this process by performing these operations automatically in the background. Engineers can effortlessly roll back pipelines with a single operation, eliminating any need for reprocessing.
6. Pause, inspect, and resume at any point in the pipeline.
Since data pipelines are subject to constant change, your engineers are often called to update the logic or verify some aspect of the operations. Engineers are also constantly looking for ways to optimize and use resources more efficiently. Intelligent data pipelines encourage a harmonious collaboration between humans and machines by automating much of the housekeeping around manual interventions. They provide a level of control that allows engineers to pause at specific points in the pipelines without affecting any other pipelines. Engineers can inspect interim datasets, resolve issues or make changes, and instantly release the data pipelines to resume running, and automatically cascade any needed changes throughout the pipeline network.
7. Automatic retries for cloud, data cloud, and application failures.
Intelligent pipelines will do whatever they can to keep running without compromising data integrity. They automatically retry failures in infrastructure, networks, data clouds, and more, all while maintaining consistency throughout the downstream processing steps. If at any of these steps the data remains unchanged, intelligent pipelines astutely determine that reprocessing is unnecessary, halting the pipeline sequence. This approach can result in substantial cost savings, potentially amounting to hundreds of thousands of dollars over the pipelines’ lifetime.
8. Delete or archive orphaned datasets.
Orphaned datasets are a recipe for confusion and, potentially, disaster in the context of GDPR, HIPAA, and PMI/PII policies. Stakeholders may unknowingly use outdated datasets, resulting in inaccurate reports and analytics. For instance, data scientists could be training machine learning models on inaccurate datasets. Or data products could inadvertently end up with private information in their datasets. Intelligent data pipelines avoid these common scenarios by automatically cleaning up interim and final datasets behind the scenes, minimizing the chances of using outdated data for decision-making.
9. Optimize the costs of your data products.
Intelligent data pipelines equip engineering teams with detailed cost breakdowns of delivering data products, including individual operations like transforms, joins, and merges. Measuring these costs allows for the swift detection of inefficiencies, so you can optimize, refactor, and adapt the business logic to reduce costs. Intelligent data pipelines ensure you do not overprocess your data, to reduce your overall data infrastructure costs. They allow you to aggregate costs end-to-end, and validate the ROI of individual data products against the value they provide to the business.
10. Provide detailed operational visibility in real-time, anytime.
Intelligent data pipelines generate consistent operational metadata that your engineers can view and monitor through a single pane of glass. With this detailed map of your pipelines and all their real-time status at their fingertips, your engineers can onboard quickly and troubleshoot any pipeline with little previous knowledge required. Such operational visibility also gives downstream stakeholders confidence in the data they’re using in their day-to-day work.
3 Key Advantages of Intelligent Data Pipelines
But why embrace intelligent data pipelines, exactly? Intelligent data pipelines lend several distinct advantages, but they center around these three key themes:
Running intelligence at scale: Data engineers can rely on intelligence to autonomously run data pipelines 24/7, without ever reprocessing any datasets. Intelligence also guarantees data integrity and propagates change throughout thousands of interwoven pipelines spanning multiple data clouds without compromising data lineage. Business users can publish, subscribe, and link data products seamlessly within and across data clouds, which is also a key feature of a data mesh. Users can quickly identify if an existing pipeline already delivers data products being requested, and avoid recreating wheels that incur unnecessary costs.
- Observing intelligence at scale: Intelligent data pipelines provide enhanced operational visibility to reduce downtimes, accelerate time to resolution, and reveal the costs of running pipelines. They generate their own operational metadata for full transparency across the data stack. By monitoring their own operations and alerting engineers with smart notifications, intelligent data pipelines help engineers maintain perspective on what’s happening where and when, and quickly pinpoint problems. They also recognize the impact of fixes throughout the pipeline lineage and propagate necessary changes in a cost-conscious manner.
- Building intelligence at scale: Automation provides the intelligence that relieves your engineers from the minutiae of building and operating data pipelines, so they can design and deliver them quickly and accurately. Intelligent data pipelines handle the full lifecycle of data end-to-end, spanning ingestion, transformation, orchestration, and sharing of data. They deliver these functions consistently across any data cloud, enhancing the native capabilities of the different data infrastructure providers. With this extreme flexibility and inherent transparency, intelligent data pipelines empower you to quickly scale your service to stakeholders who rely on data to drive the business.
Join the Data Pipeline Automation Revolution
To knock down the barriers to delivering business value from data, organizations need to envision a new type of intelligence in their data pipelines. Decision-makers can prevent getting caught up by the legacy pipeline mindset that focuses on tools and code that constitute the disjointed “modern data stack”.
Instead, begin by charting a blueprint of your data, and describing requirements for your data products. This will help steer your data teams to embrace a new data pipeline automation paradigm. When they do, they’ll finally propel your business forward with pipelines that:
- Run cheaper than typical orchestration
- Grow faster than the cost to run them
- Handle compute and storage intelligently
- Facilitate faster engineer (and stakeholder) ramp times
- Are fixed in minutes, not days
- Have greater data accessibility to a much broader audience
Additional Reading and Resources