Data EngineeringWe've entered a new era for Data Engineering
The Evolution of Data Engineering
Data engineering has evolved from focusing on just loading and storing with data lakes like Hadoop, to computing and extracting with Spark and Hive, to orchestrating with Oozie, Luigi and Airflow. In each generation, a new layer of software was initially built bespoke for each project, then standardized and automated with a new wave of tools. And in each generation, the growing maturity of infrastructure like Kubernetes rendered previous tools obsolete. Today, data engineering has converged on the clean, error-free processing of data through pipelines, opening the next wave of automation.
Data Engineering is by no means “new” — it has a long history rooted in ETL (Extract, Transform, and Load) supported by a variety of technology providers. The pattern of Extract, Transform, and Load has existed for decades, and assisted in the movement of high value data across databases, warehouses, and APIs. Given the long history of this space, it is not surprising that many of these tools have evolved to require little to no code, while maintaining a high degree of flexibility. So what ushered in the next era? Scale & Flexibility.
In the early 2000’s, the dramatic rise in consumers coming online, as well as companies connecting with them, introduced a profound increase in the volume, and variety of data being produced. This ushered in the era of the data lake and ELT, offering far more flexibility than ever before to store incredible volumes of data, and process it later. In 2003, Google published their work on the Google File System and MapReduce, inspiring much of Apache Hadoop. While many of the early adopters continued applying the ETL paradigm with MapReduce, a new paradigm emerged: ELT (Extract, Load, Transform), where data was transformed on the way out of the lake. Query interfaces such as Apache Hive and SparkSQL quickly gained popularity. So with all of this incredible scale and flexibility, why would we need anything else? Performance & Efficiency.
While the introduction of data lakes revolutionized the way we processed data, data engineering teams quickly realized that repeatedly querying raw data was proving to be both expensive and inefficient. Modern data pipelines were born to “pre-process” raw data into compact, efficient, and higher quality data assets. Not only would these pipelines clean, normalize, and enrich raw data, but they would also perform valuable aggregations that compressed, by multiple orders of magnitude, the volume of data needed to process for ad-hoc queries. Data engineers quickly began to see their world not just as ETL, ELT, or even ETLT — rather, it was becoming (ETL)+ with pipelines feeding countless other pipelines. With the addition of greater performance & efficiency? Simplicity & Maintainability.
Ask a Data Scientist what their greatest pain point is, and they’ll tell you having to wait 2-4 weeks for new or modified data sets. Ask a Data Engineer, and they’ll tell you it is maintaining all of the pipelines they already have. Big data systems have lost their excitement, and growing demand for more data pipelines has left many data engineers feeling like an old fashioned switchboard operator simply trying to keep everything connected. Fortunately, the shift from Imperative programming, to Declarative, is happening everywhere from Frontend engineering with React and Redux, to Infrastructure engineering with Kubernetes and Terraform. This shift introduces higher layers of abstraction as we program “the what”, as opposed to “the how”, leaning on underlying context-aware systems to figure out the latter. Declarative pipelines provide a radically lower maintenance burden, faster development cycle, and enables us as data engineers to spend less time plumbing, and more time architecting.
Data Professionals are Overloaded
While data teams are generally growing, business demand for information and insights continues to surge much faster.
of data teams are at or over work capacity, with just 3% citing they have extra capacity for new projects
of data scientists, data engineers, and enterprise architects currently use, or plan to adopt automation, low-code, or no-code technologies
of data professionals cite automation as a career advancement opportunity
All Roads Lead to Data Engineering
Data engineering is the key to unlocking all other data-driven workflows.
Everyone is Feeling the Pain
An Interesting Pattern Emerges
Research Shows, There’s a Better Way
A new wave of automation tools is reducing code and raising productivity.
Modern data engineering solutions provide the intelligence to automatically optimize and orchestrate data pipelines, allowing data engineers to focus on the data itself and truly accelerate digital transformation.
Our latest white papers, reports, and case studies.
Our latest podcasts and radio shows.
Our latest webinars and tech talks on demand.
This TDWI Best Practices Report examines experiences, practices, and technology trends that focus on identifying bottlenecks and latencies in the data’s life cycle, from sourcing and collection to delivery to users, applications, and AI programs for analysis,...
Modern data pipelines run on some of today’s most advanced technologies, yet the process of building, scaling, and maintaining them is as challenging as ever. This is a familiar pattern found across the tech industry, as the innovation focus shifts from raw...
At Ascend we see all sorts of different pipelines. One pattern we see quite often is that of change data capture (“CDC”) from databases and data warehouses, followed by data set reconstitution. Doing this data set reconstitution usually requires a full reduction — a transform in which you iterate over all records to find those representative of the latest state. This can become inefficient over time, however, as greater and greater percentages of any given data set become “stale”.