Data Engineering

We've entered a new era for Data Engineering

The Evolution of Data Engineering

Data engineering has evolved from focusing on just loading and storing with data lakes like Hadoop, to computing and extracting with Spark and Hive, to orchestrating with Oozie, Luigi and Airflow.  In each generation, a new layer of software was initially built bespoke for each project, then standardized and automated with a new wave of tools. And in each generation, the growing maturity of infrastructure like Kubernetes rendered previous tools obsolete.  Today, data engineering has converged on the clean, error-free processing of data through pipelines, opening the next wave of automation.

ETL

Data Engineering is by no means “new” — it has a long history rooted in ETL (Extract, Transform, and Load) supported by a variety of technology providers. The pattern of Extract, Transform, and Load has existed for decades, and assisted in the movement of high value data across databases, warehouses, and APIs. Given the long history of this space, it is not surprising that many of these tools have evolved to require little to no code, while maintaining a high degree of flexibility.   So what ushered in the next era? Scale & Flexibility.

In the early 2000’s, the dramatic rise in consumers coming online, as well as companies connecting with them, introduced a profound increase in the volume, and variety of data being produced. This ushered in the era of the data lake and ELT, offering far more flexibility than ever before to store incredible volumes of data, and process it later. In 2003, Google published their work on the Google File System and MapReduce, inspiring much of Apache Hadoop. While many of the early adopters continued applying the ETL paradigm with MapReduce, a new paradigm emerged: ELT (Extract, Load, Transform), where data was transformed on the way out of the lake. Query interfaces such as Apache Hive and SparkSQL quickly  gained popularity. So with all of this incredible scale and flexibility, why would we need anything else? Performance & Efficiency

ELT

(ETL)+

While the introduction of data lakes revolutionized the way we processed data, data engineering teams quickly realized that repeatedly querying raw data was proving to be both expensive and inefficient. Modern data pipelines were born to “pre-process” raw data into compact, efficient, and higher quality data assets. Not only would these pipelines clean, normalize, and enrich raw data, but they would also perform valuable aggregations that compressed, by multiple orders of magnitude, the volume of data needed to process for ad-hoc queries. Data engineers quickly began to see their world not just as ETL, ELT, or even ETLT — rather, it was becoming (ETL)+ with pipelines feeding countless other pipelines. With the addition of greater performance & efficiency? Simplicity & Maintainability.

Ask a Data Scientist what their greatest pain point is, and they’ll tell you having to wait 2-4 weeks for new or modified data sets. Ask a Data Engineer, and they’ll tell you it is maintaining all of the pipelines they already have. Big data systems have lost their excitement, and growing demand for more data pipelines has left many data engineers feeling like an old fashioned switchboard operator simply trying to keep everything connected. Fortunately, the shift from Imperative programming, to Declarative, is happening everywhere from Frontend engineering with React and Redux, to Infrastructure engineering with Kubernetes and Terraform. This shift introduces higher layers of abstraction as we program “the what”, as opposed to “the how”, leaning on underlying context-aware systems to figure out the latter. Declarative pipelines provide a radically lower maintenance burden, faster development cycle, and enables us as data engineers to spend less time plumbing, and more time architecting.

Declarative

Data Professionals are Overloaded

While data teams are generally growing, business demand for information and insights continues to surge much faster.

 

of data teams are at or over work capacity, with just 3% citing they have extra capacity for new projects

of data scientists, data engineers, and enterprise architects currently use, or plan to adopt automation, low-code, or no-code technologies

of data professionals cite automation as a career advancement opportunity

All Roads Lead to Data Engineering

Data engineering is the key to unlocking all other data-driven workflows.

 

Everyone is Feeling the Pain

An Interesting Pattern Emerges

The World Economic Forum (WEF) 2020 Jobs of Tomorrow Report indicated the role of the data engineer as one of the top three emerging data and AI-related jobs for the upcoming year

Research Shows, There’s a Better Way

A new wave of automation tools is reducing code and raising productivity.

 

Modern data engineering solutions provide the intelligence to automatically optimize and orchestrate data pipelines, allowing data engineers to focus on the data itself and truly accelerate digital transformation.

Mike Leone

Senior Analyst @ ESG

Learn More

READ

Our latest white papers, reports, and case studies.

LISTEN

Our latest podcasts and radio shows.

WATCH

Our latest webinars and tech talks on demand.

TDWI Best Practices Report | Faster Insights from Faster Data

This TDWI Best Practices Report examines experiences, practices, and technology trends that focus on identifying bottlenecks and latencies in the data’s life cycle, from sourcing and collection to delivery to users, applications, and AI programs for analysis,...

Data Engineering Podcast | Replatforming Your Dataflows

Reposted from: Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn...

Pin It on Pinterest