Data automation is an integral part of the data engineering lifecycle, yet despite its rapidly growing popularity, it is often misunderstood. It is often equated with data orchestration (though we’ll highlight some of the differences). Other times, it is referenced as a piece of analytics, data mesh, data fabric, or DataOps initiatives. The confusion is understandable because data automation is a far reaching concept that encapsulates everything we do as data engineers, from ETL (extract, transform, and load) data pipelines, to data orchestration, to data observability. In short, Data Automation is the very piece that binds these otherwise loosely coupled practices, giving data engineering and analytics engineering teams profoundly greater capabilities to leverage in their day to day.
The best way to think about it is perhaps in contrast to, and as a compliment, to traditional data orchestration. Data orchestration primarily focuses on the generation of specific tasks, with fixed parameters, initiated by timers, triggers, or upstream dependencies in a DAG (a directed, acyclic graph). In contrast, data automation is at the core a metadata-based model—data automation systems collect volumes of statistics about the data under their management including not only fine grain models of data, but they also build historical models, monitor how often new data changes, track the lineage of where that data came from as well as where it went to, and utilize all of this metadata to inform the actions of the rest of the system optimizing the data engineering process. Its benefits include time-savings, scalability, enhanced performance, improved cost-efficiency, and increased data quality. In addition, data automation allows data teams to spend far more time on high-value tasks.
The steps to data automation
Still sounding a bit confusing? Don’t worry, data automation becomes clearer when put into the context of ETL (or ELT, ETLT, or the new kid on the block, reverse ETL). Throughout the process, metadata is the common denominator enabling automation at each step.
Step 1: Extract
Advanced data automation models accomplish things such as maintaining the history of what’s been extracted before, understanding the profile of the data, how often it is changing, and being aware of the load parameters of the system from which the data is being extracted. Why is all of this metadata so important? It helps the automation system answer a myriad of critical questions. For instance, can you read data from a limited number of threads because it’s a small database you do not want to overload? Or, is it an object store where you can parallelize reads across hundreds if not thousands of workers all at the same time? Has this data already been extracted before? If so, has it changed and does it need to be re-extracted? Automation can do the work of understanding the nature of the other parts of our data ecosystem and how it all works together.
Step 2: Transform
When it comes to data transformation, automation goes beyond the notion of orchestration to further optimize the movement and processing of data. Data automation understands the historical resources used to process data, fine grained lineage of data as it moves through systems, and access patterns of resulting data sets. With this context, data automation is able to ensure jobs are more efficient and less prone to error by automatically answering questions such as: Do we need to even run a new data processing job? If so, how many resources should it have? Does it run better on a particular type of engine than another? What depends on this job and others that are competing for resources, and based on those dependencies, is one higher priority than the others?
Step 3: Load
Until recently, few companies automated this step, but that’s starting to change and the introduction of “Reverse ETL” as a category should help accelerate innovation. Today, companies are using automation to answer questions such as “What if data already exists where I want to load new data?” “Is the data correct and does it have the right schema?” and “What do I do when the schemas don’t match?” Data automation, fueled by the volumes of metadata collected, enables us to efficiently and accurately answer these questions, reducing load on those downstream systems and optimizing the time it takes to deliver data to them.
Developing a data automation strategy
If you’re like most companies, you’re coming from a data orchestration based model, and still early in your data automation journey—these are going to be exciting times for you!
When starting with data automation, most companies follow patterns that are similar to those in artificial intelligence. They start with heuristic-based models, over time expanding into statistical models, and with the eventual goal being full fledged machine learning.
As data automation becomes more sophisticated, it requires more metadata, fewer hard-coded rules, and more flexible, dynamic adaptations that allow the system to operate on behalf of developers. Instead of having rigid rules that fit one specific use case, data automation should be able to increasingly adapt as parameters, environments, and data streams morph.
The majority of data pipeline solutions today support only rigid, manual data workflows that are not adaptable to these constant changes. These solutions offer imperative approaches to data pipeline development and advocate a linear approach: classify data, outline transformations, develop and test the ETL process, and schedule data for updates, for example. Others’ step-by-step approaches involve identifying problems, classifying data, prioritizing operations, outlining transformations, executing operations, and scheduling data for updates. These linear, manual approaches to data pipelines are brittle to operate, and as a result of that inherent weakness, encourage waterfall development and release models to ensure stability, only further extending development time.
Instead of putting the burden on developers to automate processes, Ascend takes a declarative approach. Declarative Pipeline Workflows from Ascend enable DataOps initiatives to catch up to the agility and predictability of DevOps. With Ascend, data engineers can rapidly branch, edit, deploy and automate data pipelines with existing developer workflows and tools in a fraction of the time. By leaning on the strengths of full data automation, engineers can move faster and rely on smarter capabilities like data integrity checking, mid-pipeline checkpointing, smart rollbacks, and more.
Using a declarative approach is essential to data automation as it frees the data automation itself to determine the best course of action based on the huge swaths of metadata collected (profiles of the data, how it’s being accessed and by what jobs, what resources are required, etc.). Gathering this metadata can be hard at times, but is well worth the effort. Many are even calling metadata the “new big data.”
The advantages of data automation
Make no mistake: we realize that building automated systems is hard. It takes a lot of time and effort. When faced with the hard work and sheer engineering resources required to build advanced data automation capabilities, it can seem easier to do things manually. After all, teams are already overloaded. Our studies show that 96% of them are at or over capacity. Data teams all have their goals and KPIs that must be achieved on a regular basis. More strategic initiatives such as data automation with long-term benefit often, understandably, get sacrificed at the altar of short-term goals.
It’s a Catch-22: data teams are overtaxed, yet if they had the time and resources to automate their data processes, then they would have more time to meet KPIs, build even more automation, and spend more time on innovation and work that adds business value. Despite the challenges, data automation will come. It’s inevitable. Demand is so high for competent data engineering talent (which is in short supply) that the only way to leverage their expertise and time is through automation.
The key is to forget about taking an “all or nothing” approach. Instead of trying to automate everything at once, data teams can benefit from starting with incremental steps. Even if only portions of the data workload are automated, the rewards are tremendous, including:
- Time savings
- Better performance
- Cost efficiency
- Better use of time and talent
- Improved data quality
Getting started with data automation
You’re ready for data automation if you, (among other things):
- Constantly have to go back to the drawing board and rebuild data pipelines and automate processes every time something changes (e.g. new data, new parameters).
- Avoid modifying existing pipelines for fear of breaking something else.
- Struggle to scale to meet data requests and requirements.
- Are constantly asked to build another new data pipeline when the whole team is completely tapped out.
- Know there’s no way to automate data pipelines and meet business-driven performance indicators at the same time.
If this sounds like you, get to know Ascend. Data engineering and DataOps are still many years behind software engineering and DevOps, but data automation can help data engineers, analysts, and scientists catch up. By leveraging metadata and automation to move away from manual processes, data teams can stay ahead of day-to-day demands more easily, while moving on to more strategic work.
Automation may seem daunting, but with the Ascend Data Automation Cloud, teams can dramatically increase their productivity by unifying and automating data and analytics workloads. Ascend gives data teams access to the full-spectrum of data and analytics engineering automation with unified data ingestion, transformation, delivery, orchestration, and observability—and enables them to build and automate their workloads on any data—anywhere—10X faster than ever before.