Data Lake ETL Tutorial: Transforming Data

Back

Data Lake ETL Tutorial: Transforming Data

This tutorial will give you an overview of the “T” in ETL, namely, how to start transforming your data before you load it into the final destination. We will use SQL in this example, but Ascend also supports Python/PySpark and Scala/Java transformations as well.

Ascend.io

data-eng@ascend.io

Now that you’ve learned to Extract data with Ascend, this tutorial will give you an overview of the “T” in ETL, namely, how to start transforming your data before you load it into the final destination. We will use SQL in this example, but Ascend also supports Python/PySpark and Scala/Java transformations as well.

1. We are going to start off with the data called “Green Cab” we already pulled in from S3 and parsed as a CSV. In Ascend, every component in every graph is interactive, meaning you can inspect its schema, as well as the records immediately. This allows us to take a look at the information, and get an idea of what we want to transform.

2. Scrolling through these fields, we can see that payment type happens to be an integer — 2 or 1, with 1 meaning cash, and 2 meaning credit. We see a lot of raw data like this, and these would be good fields to transform to make it easier to work with the data.

3. To prototype our transform, we can start running interactive queries against the data. To do so, we can also use Ascend’s hinted SQL editor to help us find the right column names. Once our interactive query returns, we can then start checking out the data that we have.

4. Notice there is a column on the right called payment form. It is not transformed yet, so let’s fix that, and run that query again to make sure we are on the right track. Once that’s complete, that column will update with “credit” and “cash”.

5. Let’s remove the limit, and then we can save this as a transform by hitting “Create”, and Ascend starts running the job to transform the whole dataset. This is immediately productionalized, and whether it’s a small or large number of records, an incremental or a historical, all the code is going to be exactly the same, and those details are handled by Ascend. Ascend’s intelligent partitioning and scaling will automatically scale up and scale down to handle the workload.

Data Lake ETL Tutorial: Transforming Data

Try it out. Your future self will thank you :)