The ever-growing volume of data empowers companies to enhance decisions and fasten outcomes. Data is an advantage, but only if the business can use it. So how do you structure and make data accessible for stakeholders to drive insights? The answer is data transformation.
Raw data is not valuable until we do the hard work of transforming it to a state where the business can use it. Data teams must first arrange and format data so that they can create dashboards, reports, and predictive models. In this article, we’ll cover the basics of data transformation to set the required foundation for delivering business value.
What Is Data Transformation?
Data transformation is the process of converting data from one format, structure, or set of values to another by way of joining, filtering, appending, or otherwise performing some sort of computation on the data. The data transformation process is managed within a data pipeline. More specifically, the T in the ETL or ELT pipelines stands for transformation.
While data transformation is a relatively simple concept, in practice it can be quite nuanced. If companies have ingested their data, can’t they use that data to create business analytics and dashboards? Why would they need to change it? Simply put, that data is very rarely in the right format and structure to be useful or usable to the right parties.
First off, when data is ingested from an API, blob storage, data warehouse, or another source, you have no control over the format. Most often, the data will not be in the fitting format for your destination. Beyond standardizing the format, there are many steps required to get data to a state where you can work with it. Or even apply it to your use cases and derive its full benefit. For example, filter out bad data, perform data quality checks, and aggregate data downstream.
That’s what data transformation is: the process of making your data valuable, usable, and reusable. The goal is to keep data organized and make it more compatible and easier for humans and computers to use.
Most Common Data Transformation Functions
A data team has infinite computational power over the data. However, each transformation layer’s design needs to satisfy the ultimate business requirement. The most common data transformation functions include:
Extraction and Parsing
In the early phases, you’re reformatting data, extracting certain fields, parsing, and looking for specific values. A data pipeline process starts with ingesting data from a source, followed by copying the data to a destination. This transformation technique concentrates on modifying the format and structure of the data. The objective is to guarantee that the data is compatible with the target system.
Filtering and Mapping
Afterwards, you’re refining your data by filtering or mapping fields and values. For example, you may want to display low-activity users in a customer-facing application. Or, the state field in your source may show New York as “New York,” but the destination may store it as “NY”.
This type of transformation involves bringing in data from another source and adding it to your data set. For instance, you may want to add user metadata to build a more detailed view of specific users. In this phase, enriching the data can often turn into its own form of data ingestion. This step highlights just how sophisticated data transformation can get.
This type of data transformation involves analytical-style operations, such as “count how many users during a particular time did x, y, or z.” There’s also ‘correlation of events’. You may want to determine if activities are distinct user sessions by correlating one user’s activity with the previous one. Or correlating to the following session and looking at the duration of the time gap. The transformation that happens, in this case, is the ordering and clustering of events.
Let’s dive into an example. Your data is being ingested in a different format than you generally like to work with. Let’s say a log format, with some JSON-structured objects thrown in. In this case, it’s mostly semi-structured text data, as is often the case when data is coming from a back-end system that is logging user activity. Once you start to do analytical-style operations on the data, you need to take it from JSON-compressed files to columnar structures. That involves taking the JSON data, decompressing it, and putting it into a column format.
Another example is filtering out the data you’re not interested in. This filter isn’t simply based on individual users, but also on the larger groups of people using the data. Let’s say you’re looking for people who create, update, and delete operations. But you are less interested in other types of events. Filtering out the data for those groups is a type of transformation that hones and refines the data set to make it more useful—and accurate for the downstream workload.
How to Transform Your Data
Conceptually, think of data transformation as a bidirectional search. Or finding the shortest path between two points in a graph. You need to map your raw data to your business needs. Then, figure out how to efficiently traverse from both sides towards the middle.
Often, business teams toss requirements to the data team with a list of demands. Other times, data engineering teams look at their data and figure out what to do with it—unrelated to business goals. The real value lies in skillfully blending the two and understanding the context in which the data will be used. Why are people looking for this data set? What are they trying to extract from understanding it? What is the next natural follow-on question they might ask?
Understand Both the Business Needs and the Data
Once you understand the goals the business needs to achieve, only then you can take stock of what data you need to work with. Planning transformations has traditionally taken a waterfall-style approach involving meetings, whiteboards, and diagrams. This can lead to a lot of expensive, complex work. Instead, teams need to make iteration cheap, easy, and streamlined.
Pipelines should be built in minutes to incrementally move forward to meet new business use cases. That includes mapping out the fields, prototyping a query, sending it off to the processing cluster, running the transformations, and validating. Data teams need to understand contextually why the data matters, as much as how to transform it and work with it.
Be Aware of the Physical Limitations of Data Pipelines
As you start querying the data, it’s not uncommon to simply start to transform it as you go without a specific plan. However, we recommend starting by breaking the process down into bite-sized transformations. This makes it easier to maintain the data pipeline as user needs and business logic inevitably change. Make sure the pipeline is simple and understandable enough for stakeholders to come in and make changes, if necessary.
In addition, it is important to understand how the infrastructure that supports your data pipelines needs to scale. As you build your transformations, consider how efficient your logic is, so you don’t run into unexpected errors. For instance, “Out of Memory” errors. This becomes important when you go from processing 100k records in your staging pipelines to millions in production.
Avoid Prematurely Optimizing Your Transformation Logic
Frequently, teams have optimized their transformation logic, but it’s not very maintainable. For instance, avoid winding up with 1,000-line SQL queries with complex, nested sub-queries. This may optimize processing, but not maintenance and engineering efforts. Break down queries into small components and understand the input and output for easier debugging and alteration.
Concurrently, take care not to over-optimize. Especially if you are working with a small data set. Once you get larger data sets and a better understanding of them, you can incorporate sophisticated transformations, such as incremental data propagation or compound nested transforms. Only do performance transformations once they become necessary.
Benefits and Challenges of Data Transformation
There are challenges to transforming data:
- Data transformation can become expensive, depending on the software and resources.
- Data transformation processes can eat up resources, whether on-premises or cloud-based.
- Lack of expertise can introduce problems during transformation, so data analysts, engineers or anyone dealing with data transformation needs to have subject-matter expertise, so they can accurately and properly curate data.
- Enterprises sometimes perform unnecessary transformations—and once changes are made, data teams might have to change it back again to make the data usable.
Even so, transforming data also yields several benefits:
- Once data is transformed, it is organized and easier—sometimes only now possible—for both humans and computers to use.
- Properly formatted and validated data improves data quality and ensures that applications run properly without encountering pitfalls such as incompatible formats, duplicates, or incomplete values.
- Data transformation streamlines interoperability among applications, systems, and types of data.
Final Thoughts About Data Transformation and Next Steps
Data transformation can be a tricky and nuanced step, but with the right tools and process, your data pipelines can become much more valuable, faster. You’ll be able to streamline data pipelines, ensure data integrity, and organize and interpret data in a meaningful way for engineers and analysts alike across data teams.
With Ascend for data transformation, you can easily make data transformation fast and efficient. You can design your pipelines with declarative definitions that require 95% less code and result in far less maintenance and specify inputs, outputs, and data logic in multiple languages: SQL, Python, Scala, and Java specs.
Ascend’s full-featured SDK lets you programmatically create and interact with Ascend components, integrate with code repositories such as Github, and build reusable components. All this helps teams avoid work that’s not essential to deriving business value from data.
With queryable pipelines, you can treat any stage of any data pipeline as a queryable table. You can quickly prototype new pipeline stages or run ad-hoc queries against existing pipeline stages, all in a matter of seconds. When underlying data has changed, you’re immediately notified.