The Evolution of Data Engineering

Data Engineering

We've entered a new era of Data Engineering

The Evolution of Data Engineering

Data engineering has evolved from focusing on just loading and storing with data lakes like Hadoop, to computing and extracting with Spark and Hive, to orchestrating with Oozie, Luigi and Airflow. In each generation, a new layer of software was initially built bespoke for each project, then standardized and automated with a new wave of data engineering tools. And in each generation, the growing maturity of infrastructure like Kubernetes rendered previous tools obsolete. Today, data engineering has converged on the clean, error-free processing of data through pipelines, opening the next wave of automation.

ETL

Data Engineering is by no means “new” — it has a long history rooted in ETL (Extract, Transform, and Load) supported by a variety of technology providers. The pattern of Extract, Transform, and Load has existed for decades, and assisted in the movement of high value data across databases, warehouses, and APIs. Given the long history of this space, it is not surprising that many of these data engineering tools have evolved to require little to no code, while maintaining a high degree of flexibility. So what ushered in the next era? Scale & Flexibility.

ELT

In the early 2000’s, the dramatic rise in consumers coming online, as well as companies connecting with them, introduced a profound increase in the volume, and variety of data being produced. This ushered in the era of the data lake and ELT, offering far more flexibility than ever before to store incredible volumes of data, and process it later. In 2003, Google published their work on the Google File System and MapReduce, inspiring much of Apache Hadoop. While many of the early adopters continued applying the ETL paradigm with MapReduce, a new paradigm emerged: ELT (Extract, Load, Transform), where data was transformed on the way out of the lake. Query interfaces such as Apache Hive and SparkSQL quickly gained popularity. So with all of this incredible scale and flexibility, why would we need anything else? Performance & Efficiency.

(ETL)+

While the introduction of data lakes revolutionized the way we processed data, data engineering teams quickly realized that repeatedly querying raw data was proving to be both expensive and inefficient. Modern data pipelines were born to “pre-process” raw data into compact, efficient, and higher quality data assets. Not only would these pipelines clean, normalize, and enrich raw data, but they would also perform valuable aggregations that compressed, by multiple orders of magnitude, the volume of data needed to process for ad-hoc queries. Data engineers quickly began to see their world not just as ETL, ELT, or even ETLT — rather, it was becoming (ETL)+ with pipelines feeding countless other pipelines. With the addition of greater performance & efficiency? Simplicity & Maintainability.

Declarative

Ask a Data Scientist what their greatest pain point is, and they’ll tell you having to wait 2-4 weeks for new or modified data sets. Ask a Data Engineer, and they’ll tell you it is maintaining all of the pipelines they already have. Big data systems have lost their excitement, and growing demand for more data pipelines has left many data engineers feeling like an old fashioned switchboard operator simply trying to keep everything connected. Fortunately, the shift from Imperative programming, to Declarative, is happening everywhere from Frontend engineering with React and Redux, to Infrastructure engineering with Kubernetes and Terraform. This shift introduces higher layers of abstraction as we program “the what”, as opposed to “the how”, leaning on underlying context-aware systems to figure out the latter. Declarative pipelines provide a radically lower maintenance burden, faster development cycle, and enables us as data engineers to spend less time plumbing, and more time architecting.

Data Professionals are Overloaded

While data teams are generally growing, business demand for information and insights continues to surge much faster.

0 %
of data teams are at or over work capacity, with just 3% citing they have extra capacity for new projects.
0 %
of data scientists, data engineers, and enterprise architects currently use or plan to adopt automation, low-code, or no-code technologies.
0 %
of data professionals cite automation as a career advancement opportunity.

All Roads Lead to Data Engineering

Data Engineering is the key to unlocking all other data-driven workflows.

Everyone is Feeling the Pain

An Interesting Pattern Emerges

Data execs (VPs and Directors) are 2x more likely to indicate that their teams are overloaded than the team leads and individuals contributors themselves, signaling a huge backlog in demand by the business.

When asked which team was the most backlogged,  teams were 3.5x more likely to identify their own team over other teams.

When asked what impeeded their work most, nearly half (48%) of data scientists cited access to data and systems. 

Meanwhile, more than half (54%) of data engineers cited maintenance of existing systems—the very ones providing data access to the rest of the organization.

The World Economic Forum (WEF) 2020 Jobs of Tomorrow Report indicated the role of the data engineer as one of the top three emerging data and AI-related jobs for the upcoming year
World Economic Forum

Research Shows, There's a Better Way

A new wave of automation tools is reducing code and raising productivity.

Survey respondents indicated that they are 1.5x more likely to invest in automation technology than hiring more staff to solve for bandwidth issues.

Modern data engineering solutions provide the intelligence to automatically optimize and orchestrate data pipelines, allowing data engineerings to focus on the data itself and truly accelerate digital transformation.

Mike Leone, Senior Analyst @ ESG