As data became the backbone of most businesses, data integration emerged as one of the most significant challenges. Today, a good part of the job of a data engineer is to move data from one place to another by creating pipelines that can be either ETL vs. ELT.
ETL has been the traditional way to manage pipelines for decades. However, with the advent of cloud-based infrastructure, ETL is changing towards ELT. Despite their similarities, there are significant differences between the two. In this article, we explore what those differences are, why ETL is shifting towards ELT, and what can we expect in the future.
How ETL Became Outdated
The ETL process (extract, transform, and load) is a data consolidation technique in which data is extracted from one source, transformed, and then loaded into a target destination. We won’t dive into the origins of ETL, but it’s important to understand its surroundings.
In the 1980s, companies started to amass big amounts of transactional data. In this context, ETL arose to address the issue of providing organizations with ready-to-analyze data. The process allows data teams to remove irrelevant data and transform the rest before loading the data into the target database. For example, combining user events with geographical data to analyze consumer behavior. But the industry realized that making transformations while on the journey can cause problems:
Expensive to Implement and Maintain
ETL is challenging. A small-sized company needs at least a few months to implement the process. After, there will be constant modifications, as data evolves with the business. In addition, analysts frequently lack the technical skills to construct an ETL-based data pipeline, which makes engineering needed to extract and transform the data. For this reason, businesses employ armies of talent, experts, and consultants to create and maintain ETL processes.
ETL processes are inherently structured and rigid. On one hand, the transformations take place in a staging area that sits between the data source and the target destination. While this might be perfect for use cases where it’s important to guarantee data types and format, it slows down the processing of big volumes of data. On the other hand, it requires data analysts to plan out how they will use the data and what every report will look like. If they didn’t, the impact on data consumers downstream of the initial extraction can be costly. For example, you need to know data formats in advance to avoid running into the problem of uploading data in a different format than you needed.
Slow to Provide Fresh Data
Before loading any data into the destination, you must complete the data transformation steps. ETL is unable to expand to keep and handle the volume of data we generate and collect today—at least not quickly or cost-effectively. In the hopes of resolving this issue, ETL tasks that update hundreds or millions of data warehouse tables frequently take place at night. But in a world that favors the here and now, ETL processes lack in the area of providing analysts with new, fresh data.
Low in Visibility
End-users won’t be able to access all the data in the final destination, only the data that was transformed and loaded. This causes two issues. First, every transformation performed on the data pushes you further from the raw data and obscures some of the underlying information. Second, during transformations, data gets reshaped into some specific form. This is risky, as conclusions might be drawn based on data that hasn’t been properly sliced.
Traditional ETL is effective if you are still on-premises and your data is small and predictable. ETL was an advantage when we weren’t able to work with the size and complexity of raw data. However, that is less and less the case. With the advent of cloud computing, storing unstructured data quickly without having to worry about storage or format is faster and cheaper. This is when ELT came in.
ETL vs. ELT: When Do You Run Your Transformations?
What happens when we swap “L” and “T”? ELT (extract, load, and transform) is a data consolidation process in which data is extracted from one source, loaded into the target destination, and then transformed as necessary. With the new, fast data cloud warehouses, data professionals are able to perform their transformations in the warehouse itself. But the differences between ETL and ELT go beyond the order in the transformation step.
Software engineers deploy multiple times per day—praising continuous delivery. The same principle guides data transformations in the ELT process. When the input data or the metric definition change, you can easily implement these transformations in hours, not weeks or months. The velocity of ELT is especially valuable for startups, where data professionals need to be flexible and agile to keep up with business requirements and product growth.
The high cost of on-premises storage and processing made ETL imperative. Today, there is less of a need to do transformations before loading data to the final destination thanks to the cloud. The fast expansion of cloud-based data warehouses has contributed to the falling costs of compute and storage. Thus, making transformations the last step in the process enables scalability—infinitely and on demand.
Optimized for Decision-Making
Modern warehouses are columnar and designed for storing and analyzing big datasets. Now that analysts are able to load data before transforming it, they don’t need to decide beforehand what reports or schema they need to generate. Instead, a data warehouse that serves as a “single source of truth” receives a direct replication of the source data. Then, analysts can perform their transformations—always being able to access the original data without worrying about transformations that might have jeopardized its integrity.
ELT vs. ETL Comparison Chart
The goal of ETL and ELT is the same: move your data from one system to a storage location for analysis. While the order of the three steps is the most noticeable difference between ETL and ELT, there are many underlying implications. Here is a summary of the main differences between ELT and ETL:
How to Decide When to Use ETL vs. ELT
Ultimately, ETL and ELT have the same goal: to move your data from one location to another and make it usable for analysis. That being said, there are tips that might help you determine which of these two processes would serve your project best:
- Large amounts of data can move through the process faster with ELT.
- If you need raw data accessible, ELT moves all of your raw data to the new location, so you would have it available.
- You’re doing a one-time migration from legacy systems, ETL will help you transform the data to make it fit the new system’s structure.
Final Thoughts on ETL vs. ELT and Next Steps
Overall, ETL and ELT share the same goal, but the infrastructure underlying each process is vastly different. Generally, ELT will be the process that best fits modern companies’ data needs. However, there are specific use cases where ETL might be the right choice.
Understanding the difference between ETL and ELT is only the beginning of truly understanding what you can do with data in an ever-changing environment of technology and tools. EL(T) is the newest paradigm to explore. Contrary to ELT, EL totally decouples the extract and load steps from any potential optional transformation. Each operational use case is distinct in how it makes use of incoming data. Some could perform sophisticated transformations, while others might not even need to transform at all. As data warehouses advance more and more, we’re sure we will continue to evolve the processes and techniques to extract true value from our data.