The Evolution of Data Engineering

Data Engineering

We've entered a new era of Data Engineering

The Evolution of Data Engineering

Data engineering has evolved from focusing on just loading and storing with data lakes like Hadoop, to computing and extracting with Spark and Hive, to orchestrating with Oozie, Luigi and Airflow. In each generation, a new layer of software was initially built bespoke for each project, then standardized and automated with a new wave of data engineering tools. And in each generation, the growing maturity of infrastructure like Kubernetes rendered previous tools obsolete. Today, data engineering has converged on the clean, error-free processing of data through pipelines, opening the next wave of automation.

ETL

Data Engineering is by no means “new” — it has a long history rooted in ETL (Extract, Transform, and Load) supported by a variety of technology providers. The pattern of Extract, Transform, and Load has existed for decades, and assisted in the movement of high value data across databases, warehouses, and APIs. Given the long history of this space, it is not surprising that many of these data engineering tools have evolved to require little to no code, while maintaining a high degree of flexibility. So what ushered in the next era? Scale & Flexibility.

ELT

In the early 2000’s, the dramatic rise in consumers coming online, as well as companies connecting with them, introduced a profound increase in the volume, and variety of data being produced. This ushered in the era of the data lake and ELT, offering far more flexibility than ever before to store incredible volumes of data, and process it later. In 2003, Google published their work on the Google File System and MapReduce, inspiring much of Apache Hadoop. While many of the early adopters continued applying the ETL paradigm with MapReduce, a new paradigm emerged: ELT (Extract, Load, Transform), where data was transformed on the way out of the lake. Query interfaces such as Apache Hive and SparkSQL quickly gained popularity. So with all of this incredible scale and flexibility, why would we need anything else? Performance & Efficiency.

(ETL)+

While the introduction of data lakes revolutionized the way we processed data, data engineering teams quickly realized that repeatedly querying raw data was proving to be both expensive and inefficient. Modern data pipelines were born to “pre-process” raw data into compact, efficient, and higher quality data assets. Not only would these pipelines clean, normalize, and enrich raw data, but they would also perform valuable aggregations that compressed, by multiple orders of magnitude, the volume of data needed to process for ad-hoc queries. Data engineers quickly began to see their world not just as ETL, ELT, or even ETLT — rather, it was becoming (ETL)+ with pipelines feeding countless other pipelines. With the addition of greater performance & efficiency? Simplicity & Maintainability.

Declarative

Ask a Data Scientist what their greatest pain point is, and they’ll tell you having to wait 2-4 weeks for new or modified data sets. Ask a Data Engineer, and they’ll tell you it is maintaining all of the pipelines they already have. Big data systems have lost their excitement, and growing demand for more data pipelines has left many data engineers feeling like an old fashioned switchboard operator simply trying to keep everything connected. Fortunately, the shift from Imperative programming, to Declarative, is happening everywhere from Frontend engineering with React and Redux, to Infrastructure engineering with Kubernetes and Terraform. This shift introduces higher layers of abstraction as we program “the what”, as opposed to “the how”, leaning on underlying context-aware systems to figure out the latter. Declarative pipelines provide a radically lower maintenance burden, faster development cycle, and enables us as data engineers to spend less time plumbing, and more time architecting.

Data Professionals are Overloaded

While data teams are generally growing, business demand for information and insights continues to surge much faster.

0 %
of data teams are at or over work capacity, with just 4% citing they have extra capacity for new projects.
0 %
no longer consider infrastructure as the scale problem they need to solve in order to meet the demand for data projects.
0 %
plan to implement automation tools as a way to alleviate the burden of new data projects
The World Economic Forum (WEF) 2020 Jobs of Tomorrow Report indicated the role of the data engineer as one of the top three emerging data and AI-related jobs for the upcoming year
World Economic Forum

Research Shows, There's a Better Way

No-code and low-code is making way for Flex-Code

Only 4% of data professionals prefer a no-code user interface. However, that number jumps to 73% if the solutions offered the flexibility of using both low- and no-code user interfaces in conjunction with higher code options, signaling a tremendous surge in interest for flexible coding (flex-code) approaches.

Modern data engineering solutions provide the intelligence to automatically optimize and orchestrate data pipelines, allowing data engineerings to focus on the data itself and truly accelerate digital transformation.

Mike Leone, Senior Analyst @ ESG

We're a 2021 Gartner Cool Vendor!

X