Moving Past ETL and ELT: Understanding the EtLT Approach

Explore the evolution from ETL and ELT to the dynamic EtLT approach, highlighting the shift towards continuous data refinement in modern data operations.

Michael

Michael Leppitsch

michael@ascend.io

While rapid innovation drives huge opportunities for data teams, many professionals are still fixated on traditional patterns of data warehousing and ETL, even while their organizations are migrating to the cloud and adopting cloud-native data services. But the capabilities of the clouds have eclipsed traditional data architectures, and have upset the roles of data acquisition ("Extract"), logical processing ("Transform"), and populating a schema ("Load").

Central to this transformation are two shifts. First, the unmatched scalability of cloud databases which maintains classic table structures while separating storage and compute. Secondly, the rise of data lakes that catalyzed the transition from ETL to ELT and paved the way for niche paradigms such as Reverse ETL and Zero-ETL. Still, these methods have been overshadowed by EtLT — the predominant approach reshaping today's data landscape.

In this article, we assess:

The role of the data warehouse on one hand, and the data lake on the other;
The features of ETL and ELT in these two architectures;
The evolution to EtLT;
The emerging role of data pipelines.

Understanding this evolution equips data teams with the tools to navigate the ever-shifting data space, amplifying their impact on business value. Let's take a closer look.

Modern Cloud Data Platforms

The native capabilities of the cloud providers have been joined by third-party services to offload that data into separate less costly systems that are optimized for analysis of that data. These services fall into two broad categories: those designed to serve as data warehouses and those designed like data lakes. Let's investigate these two traditionally contrasting modes of operation.

The Data Warehouse Pattern

The heart of a data warehouse lies in its schema, capturing intricate details of business operations. This unchanging schema forms the foundation for all queries and business intelligence. Modern platforms like Redshift, Snowflake, and BigQuery have elevated the data warehouse model. By marrying the tried-and-true practices of traditional warehousing with modern scalability, these platforms eliminate the cumbersome tasks associated with on-prem management while enabling the execution of analysis, reporting, and ad-hoc inquiries.

The Data Lake Pattern

Emerging in contrast to the structured world of warehousing, data lakes cater to the dynamic and diverse nature of modern internet-based applications. For example, unlike traditional platforms with set schemas, data lakes adapt to frequently changing data structures at points where the data is loaded, accessed, and used.

These fluid conditions require unstructured data environments that natively operate with constantly changing formats, data structures, and data semantics. Services like AWS Glue, Databricks, and Dataproc have powerful data lake capabilities, where code-heavy processes and agile workflows can transform data into many different forms.

This pattern requires new methodical, traceable, and repeatable methods like Directed Acyclic Graphs (DAGs) that give the business confidence in the results.

Source: A stream of sensor data represented as a directed acyclic graph.

ETL for Traditional Data Warehouse Pattern

As discussed, data warehouses are structured repositories designed for predefined schemas that are optimized for analytical querying and reporting of data. The ETL pattern was developed specifically to load data into these rigorous schemas, with the following characteristics:

Extract from Sources: In traditional data warehousing, the first step is always to extract the data from the business' operational applications. More recently, data is also acquired from third-party source systems such as business partners, as well as public and subscription sources that enhance business analytics, reports, and insights.

Transform before Load: The ETL process conforms data to the static, predefined warehouse schema. This means that data from various source systems is first extracted, then transformed (cleaned, normalized, and integrated) to match the defined schema, and finally loaded into the tables that make up the schema.

Consistency and Quality: Since the data entering the warehouse schema must already be of high quality and ready to be used by the business, any transformations to the data occur before it is actually loaded (the "T" comes before the "L" in ETL). Most data warehouse services include built-in validations that ensure that the data is coherent and consistent with other data in the warehouse, or else it is rejected.

Performance: Because the data is transformed and normalized before it is loaded, data warehouse engines can leverage the predefined schema structure to tune the use of compute resources with sophisticated indexing functions, and quickly respond to complex analytical queries from business analysts and reports.

These requirements are typically met by ETL tools, like Informatica, that include their own transform engines to "do the work" of cleaning, normalizing, and integrating the data as it is loaded into the data warehouse schema. If you encounter a tool that claims to perform ETL but uses the data warehouse itself to do this work, you can consider it to be ELT — the transformation ("T") occurs after the data is loaded ("L").

ELT for the Data Lake Pattern

As discussed earlier, data lakes are highly flexible repositories that can store vast volumes of raw data with very little preprocessing. They can accommodate any type of data, from structured to semi-structured to unstructured, and do not need a predefined schema. The ELT pattern emerged as a more suitable approach to process data in data lakes, due to these reasons:

Extract from Sources: Similar to data warehousing, data lakes also need data to be extracted from the business' operational applications and third-party sources. There are a range of tools dedicated to just the extraction ("E") function to land data in any type of data warehouse or data lake.

Load before Transform: Data lakes store all the extracted data directly in a storage system like Amazon S3, Azure Blob Store, or Google Cloud Storage, in its original structure (the "L" comes before the "T" in ELT). Once in place, any transformations on the data are performed directly in the data lake on demand as different analytical tasks come up. The idea is to postpone the processing cost of transformation until the analytics that use it are actually needed.

Flexibility and Scalability: Since there is no predefined schema to restrict transformations in data lakes after loading, ELT provides greater flexibility to accomplish analytical tasks. This makes data lakes more adaptable and scalable to handle diverse data types and large volumes of data. Instead of a monolithic ETL application, transformations are typically performed by a much less costly compute engine like Spark, in sequences organized in Directed Acyclic Graphs (DAGs).

Data Exploration and Discovery: Since raw data is maintained and a variety of computation capabilities can be brought to bear, data lakes are excellent environments for data exploration, machine learning, and advanced analytics where unknown patterns need to be discovered from raw data. On the other hand, analysts need additional data engineering skills to harness the data lake processing capabilities, and transform and structure the data to meet their objectives.

Read More: Learn More About ETL vs. ELT

The ETL to ELT to EtLT Evolution

For many years, data warehouses with ETL and data lakes with ELT have evolved in parallel worlds. But recently, ELT processing has become common in organizations that augment their legacy ETL tools with newer, cheaper ELT processes to populate their data warehouse schemas — laying the foundation for the EtLT methodology.

This change can play out over three phases. Can you tell which is in use at your organization today?

Phase 1: Sequence Data Lakes and Data Warehouses

A common approach to harnessing the flexibility and low cost of data lakes is to replace a costly legacy ETL tool with a data lake, and program transformations inside the data lake. However, to reduce the impact on the business, a data warehouse remains in use. This results in a combination of the ETL and ELT patterns:

Extract from Sources: This step is the same as the ELT for data lakes described above, but different from extraction performed by a traditional ETL in that there is no monolithic ETL engine. Shifting between these two patterns usually involves migration to new tooling to acquire data from sources.

Transform in the Data Lake: Transformations in data lakes typically involve deep data engineering skills, using languages like Python, pySpark, and SQL, and data engines like Spark.

Loading the Data Warehouse: The final result of this processing is then loaded into the data warehouse schema, where analysts and business users recognize their data structures and use them as before. The pattern remains the same even if the warehouse schema is migrated from an on-prem system to a cloud-based data warehouse.

Evolution from ETL and ELT to EtLT: Sequenced

This combination of patterns is commonplace in many enterprises today. Unfortunately, it has become the source of complexity and inefficiency across data groups and sprawl across infrastructures.

Phase 2: Consolidate ETL and ELT

The costs of cloud data warehouses have dropped sufficiently to where the maintenance of a separate data lake makes less economic sense. In addition, some cloud data warehouses like Snowflake are expanding their features to match the diverse and flexible data processing methodologies of data lakes.

As a result, transformation processing inside data warehouses is becoming more accessible, and maintaining a separate data lake is no longer needed. This is leading many enterprises to consolidate their data operations onto the newest generation of cloud-based data warehouses.

ETL ELT evolution to EtLT: Consolidation phase

‍

Meanwhile, some data lakes like Databricks are adding structures that match the ease of use of data warehouses, reducing the benefits of loading data into a separate data warehouse. In some organizations, these changes are causing a tug-of-war on whether to retire data warehouses altogether.

Recognizing this convergence of data lakes and data warehouses, several vendors are now positioning their data platforms as "lakehouses," all-in-one solutions capable of storing and processing any type of data.

The disintegration of the differences between the two patterns has led to a new balkanization of tools along the value chain instead:

Extraction tools like Fivetran that only move data from source systems into the data warehouse or data lake;

Transformation tools like dbt and Matillion that help analysts build their transformation logic in the form of easy-to-program SQL logic, and then run that logic inside the data warehouse or data lake;

So-called "reverse ETL" tools that selectively reintegrate data analytics back into applications.

Orchestration tools like Airflow are required to manage the flow across tools.

As a result, distinct tooling for ETL and ELT is disappearing, and the concept of loading (the "L" in ETL and ELT) has diminished altogether. Instead, the growing scale of data operations has given rise to a new set of concerns for which the balkanized tools are ill-suited:

Metadata Generation: The extraction process acquires data from various sources, but it is also the perfect opportunity to capture rich metadata about it and make it available to validate and drive subsequent transformation logic.

Stepwise Transformation: Structuring data transformation in sequential steps provides clarity and control over sophisticated data operations such as business validation, data normalization, and analytics functions. By processing it with transparent stepwise transformations, data can achieve data warehouse integrity without leaving the environment.

These processes can implement techniques familiar to traditional data warehouse ETL, including normalization, validation for nonrepudiation, referential integrity enforcement, and idempotency enforcement, but without the cost of monolithic ETL tools. This approach can be strengthened by the use of Directed Acyclic Graphs (DAG), and monitoring that supports timely developer interventions.

Data Validation: At any point of extracting and transforming data, validation techniques can be applied to maintain data quality and improve confidence when data is accessed by analysts and business users.

Loading a Stable Schema: The final transformation steps can simply terminate in tables structured like a classic data warehouse model, where data analysts can run queries just as they did in their previous data warehouse. There may also be a need for selective propagation to other applications, similar to the reverse ETL features discussed before.

Phase 3: The EtLT Approach

While ETL has its roots in the rigidity of early data warehouses, ELT was born from the flexibility of data lakes. However, with the immense growth in data generation and the complexities involved in handling it, even ELT is seeing an evolution.

A new approach, EtLT —and even beyond, in its ongoing iterations of EtLTLTLT— is surfacing in the data landscape. At its core, EtLT encapsulates "Extract, transform, Load, Transform," though we've also seen it called "Extract, tweak, Load, Transform." But, this is more than just a quirky sequence of letters. It's indicative of a deeper shift where data, once prepared, isn't static but is continuously refined and republished in a data catalog to respond to diverse and evolving needs. This dynamic and ever-evolving nature of the EtLT approach, with its emphasis on continuous refinement, is a pivotal reason why the concept of the data mesh has gained such traction and popularity.

The EtL Phase: In this foundational stage, data engineers focus on extracting relevant information from varied and often unstructured sources. Their task is straightforward: take the raw data and transform it into a structured, coherent format. They prioritize timely data delivery and accuracy in this transformation, setting the stage for more in-depth processing.

The Indefinite TL Phase: Here, the process dives deeper. Data analysts, SQL developers, and AI engineers take over, translating intricate enterprise business rules into SQL statements. Their primary objectives are to perform in-depth analyses and ensure the resultant data aligns with business requirements and can power AI applications. In this continuous phase, data is always on the move. It's refined, republished, and reutilized, staying ever-relevant to evolving business needs.

Harnessing the Power of Data Pipelines

The EtLT approach stands out as the standard and prevailing demand in today's data space. However, to implement this approach effectively, one needs the right technology in place. As James Densmore highlights in the "Data Pipelines Pocket Reference 2021," data pipelines are integral for a successful EtLT deployment, to the extent that the efficient execution and management of the EtLT model are now synonymous with the concept of data pipelines.

While data pipelines have long been tools in the arsenal of data engineers, they are now re-emerging as the most adaptable and potent data processing architecture.

Pattern Consolidation: Data pipelines converge the patterns of both data lakes and data warehouses, while addressing many new concerns that arise from the sheer scale of modern data operations. Data engineers and analysts can apply data pipeline methodologies to deliver the fluidity of data lakes along with the structured discipline of data warehouses.

Data Pipeline Platform: Data pipelines open the door to many different data processing objectives, and replace the need for narrow ETL and ELT processes and tools. Data engineers can organize extraction, transformation, and loading into any number of sequential operations, apply powerful new methodologies, and achieve unprecedented control and flexibility to meet emerging data analytics opportunities.�

Accelerating Transparency and Velocity: With these new powers come new responsibilities, among which is the need for unprecedented clarity and transparency of what is happening at every step in every pipeline at any time. Data teams may consider adopting a unified data pipeline platform with automation to unlock these capabilities with operational tested frameworks, and achieve unprecedented velocity in delivering business insights.

Enterprises have an opportunity to undergo a metamorphosis. By rethinking traditional ETL as a mainstay of data handling, their data practice can evolve to accommodate data pipelines, and power up their use of both data lakes and data warehouses to tackle the complexities of their modern data landscapes.

Read More: What Is Data Pipeline Automation?