Learn how intelligent data pipelines produce business results at game-changing velocity via a unified tech stack and scaled automation.
Data teams worldwide are building data pipelines with the point solutions that make up the "modern data stack." But this approach is quite limited, and does not actually provide any true automation. Pipelines built with this approach are slow and require constant manual reprogramming and updates. They drain engineering teams of their efficiency while robbing stakeholders of the just-in-time data they need to inform and accelerate their growth.
Automation, however, enables next-generation, intelligent pipelines that can help engineering teams leapfrog their companies forward with guaranteed data integrity, smart change propagation, and powerful cost controls — all while empowering human intervention. Intelligent data pipelines anticipate and ease the tasks with which engineers drive business change rather than be consumed by busywork and keeping the lights on.Read on to learn:
Before we get into the specifics of intelligent data pipelines, let's explain what we mean by data pipeline automation.Data pipeline automation is NOT:
Data pipeline automation replaces data stacks assembled from multiple tools and platforms to boost efficiency and make order-of-magnitude improvements to productivity and capabilities, at scale. Such results arise from the use of a single unified platform — one with a data control plane that (1) utilizes an immutable metadata model of every aspect of the pipelines and (2) automates every operation with it.With such a comprehensive, end-to-end approach to automation, engineers can rapidly build new pipelines to meet business needs, and instantly identify and respond to data problems across even the most complex and interwoven data pipelines. For a deep dive into this level of automation, take a look at our What Is Data Pipeline Automation paper.
Traditional methods of building data pipelines can't deliver results at the rate and accuracy today's business leaders expect. And while some data engineering teams attempt to give their data pipelines some "intelligence" with the clever use of orchestration, they are often stuck with the characteristics of non-automated pipelines — what we refer to as "legacy data pipelines."
Legacy data pipelines have thousands of potential points of error. Any one problem can cause major disruption to the business. Companies with an understaffed engineering team suffer from lengthy outages during debug and break-fix cycles.
Let's face it, probably only the most highly experienced, longest-tenured engineers at your company really know how your data pipelines work. When pipelines aren't automated, they grow into increasingly complicated webs of code, making it difficult to drive efficiency, diagnose problems, and trust the accuracy of the resulting data. Your new hires are helpless with existing pipelines — they get the new projects to build more black boxes that no one else understands.
Because legacy data pipelines don't break down the sources of runtime costs, it's challenging to pinpoint the pipeline inefficiencies and redundancies that cost the company time and money. When finance asks engineering teams to justify skyrocketing compute costs, they're hard-pressed to reconcile the expense with value.
In legacy pipelines, data quality issues can go undetected for months, which means business users and customers often use outdated or incorrect data to make game-time decisions that affect the trajectory of the company. Engineering teams are only notified about the problem after someone in the business discovers the issue. The resulting whack-a-mole remedies burn time and expensive engineering resources that should be building new solutions instead.
To build and run legacy data pipelines, engineering teams have to integrate and customize several general-purpose tools from the market to fit a square peg into a round role. They often patch together many brittle integration points with custom code, making their systems difficult to understand and troubleshoot.
Legacy data pipelines involve manual coding of complex heuristics around the simple tools of the modern data stack, and yet these heuristics still only solve one use case at a time. This compounds the bottlenecks of those tools and code libraries, and embeds lots of unnecessary reprocessing. The absence of a unifying platform means that integrations are inefficient and act as barriers to data flows. In the end, these impediments can hamstring the data engineering team and stall business growth.
Intelligent data pipelines are the result of decades of experience building data pipelines, encapsulated in a new automation paradigm that considers several dimensions:
When you adopt this paradigm of designing, building, operating, and maintaining data pipelines, several benefits arise. You can deliver them more quickly. Reuse becomes more efficient. Pinpointing and resolving issues is faster. And, most importantly, business results improve with game-changing velocity and efficiency.So what do these intelligent data pipelines actually do differently from legacy data pipelines?
Intelligent data pipelines exhibit several key differences from legacy data pipelines that have a compounding effect on the value data engineering teams can deliver. Let's walk through the top nine characteristics that constitute intelligence. Each of these unique traits is remarkable on its own, yet when woven together, they create a tapestry of innovation and impact.
Business stakeholders need utmost confidence in the data they rely on. Intelligent data pipelines provide data quality and the critical integrity needed to instill that confidence. For a deep dive, read our article How to Ensure Data Integrity at Scale by Harnessing Data Pipelines.
Central to any data integrity strategy is the ability to apply controls in the form of actionable data quality rules. Intelligent data pipelines enforce acceptable ranges for any number of metrics at every point in every dataflow, and provide configurable actions to take when a rule is activated — notify and continue, alert and stop, etc. These rules detect data problems immediately as data arrives, and eliminate the costs of processing bad data.
In the world of data pipelines, a single change can have immediate ripple effects on multiple downstream pipelines. Intelligent data pipelines are designed to address this challenge. They automatically and incrementally propagate processing from the point of change, seamlessly traversing the entire data lineage and codebase, even across multiple data clouds. And as they do so, they keep engineers informed of any issues.
From time to time, your engineers are alerted to situations and failures that stump data pipeline automation. Intelligent data pipelines support the engineers by precisely pinpointing the origins of data integrity failures, so that they can intervene and resolve them quickly. Once the issues are resolved, intelligent data pipelines take over and restart from the precise point where the failure occurred, and continue to run without reprocessing any data and with zero manual housekeeping required.
Using the modern data stack to build legacy pipelines still results in costly and time-consuming code fixes and constant re-runs. When a problem is detected, engineers have to manually infer which branches of the pipeline are affected, unwind bad datasets, reset checkpoints, and program any new dependencies. On the other hand, intelligent data pipelines streamline this process by performing these operations automatically in the background. Engineers can effortlessly roll back pipelines with a single operation, eliminating any need for reprocessing.
Since data pipelines are subject to constant change, your engineers are often called to update the logic or verify some aspect of the operations. Engineers are also constantly looking for ways to optimize and use resources more efficiently. Intelligent data pipelines encourage a harmonious collaboration between humans and machines by automating much of the housekeeping around manual interventions. They provide a level of control that allows engineers to pause at specific points in the pipelines without affecting any other pipelines. Engineers can inspect interim datasets, resolve issues or make changes, and instantly release the data pipelines to resume running, and automatically cascade any needed changes throughout the pipeline network.
Intelligent pipelines will do whatever they can to keep running without compromising data integrity. They automatically retry failures in infrastructure, networks, data clouds, and more, all while maintaining consistency throughout the downstream processing steps. If at any of these steps the data remains unchanged, intelligent pipelines astutely determine that reprocessing is unnecessary, halting the pipeline sequence. This approach can result in substantial cost savings, potentially amounting to hundreds of thousands of dollars over the pipelines' lifetime.
Orphaned datasets are a recipe for confusion and, potentially, disaster in the context of GDPR, HIPAA, and PMI/PII policies. Stakeholders may unknowingly use outdated datasets, resulting in inaccurate reports and analytics. For instance, data scientists could be training machine learning models on inaccurate datasets. Or data products could inadvertently end up with private information in their datasets. Intelligent data pipelines avoid these common scenarios by automatically cleaning up interim and final datasets behind the scenes, minimizing the chances of using outdated data for decision-making.
Intelligent data pipelines equip engineering teams with detailed cost breakdowns of delivering data products, including individual operations like transforms, joins, and merges. Measuring these costs allows for the swift detection of inefficiencies, so you can optimize, refactor, and adapt the business logic to reduce costs. Intelligent data pipelines ensure you do not overprocess your data, to reduce your overall data infrastructure costs. They allow you to aggregate costs end-to-end, and validate the ROI of individual data products against the value they provide to the business.
Intelligent data pipelines generate consistent operational metadata that your engineers can view and monitor through a single pane of glass. With this detailed map of your pipelines and all their real-time status at their fingertips, your engineers can onboard quickly and troubleshoot any pipeline with little previous knowledge required. Such operational visibility also gives downstream stakeholders confidence in the data they're using in their day-to-day work.
But why embrace intelligent data pipelines, exactly? Intelligent data pipelines lend several distinct advantages, but they center around these three key themes:
To knock down the barriers to delivering business value from data, organizations need to envision a new type of intelligence in their data pipelines. Decision-makers can prevent getting caught up by the legacy pipeline mindset that focuses on tools and code that constitute the disjointed "modern data stack".
Instead, begin by charting a blueprint of your data, and describing requirements for your data products. This will help steer your data teams to embrace a new data pipeline automation paradigm. When they do, they'll finally propel your business forward with pipelines that:
Additional Reading and Resources: