How-to: Snowflake Data Ingest & ETL with Ascend.io

Back

How-to: Snowflake Data Ingest & ETL with Ascend.io

This How-to will provide you with an overview of how to ingest data into Snowflake by building a production pipeline that automatically keeps data up to date, retries failures, and notifies upon any irrecoverable issues.

Ascend.io

data-eng@ascend.io

EXTRACT

Ascend provides pre-built read connectors for common blob stores (such as AWS S3, Azure Blob Store, and Google Cloud Storage), data warehouses (such as Redshift, BigQuery, and Snowflake), databases (such as Postgres, MySQL, and SQL Server), and APIs (such as Twitter, Facebook Ads, Google Analytics, and Zendesk). In the off-chance we don't have something you're looking for, we provide the ability to write a custom connector
as well.

In this How-to, we are going to connect to data in an S3 bucket.

To get started, we create a new S3 Read Connector, name it, and provide some bucket details
Once that is set, we can test our connection to see that we can find files, and ensure that everything is working as it should be.
After the connection is tested, we will need to tell Ascend how to parse the data. All the common formats are already handled. If for some reason your data is not in a standard format, you're not stuck in Ascend; you can always pick “Parser Function”, and write the Python code necessary to deal with the data. (In the case of this how-to we select parquet data).
After making your selection, Ascend automatically generates the schema.
The last step to extract your data, is selecting the refresh schedule. In Ascend, a refresh schedule is simply when to check for new data. On the schedule, Ascend will automatically look to see if any files are new, updated, or deleted, and only propagate those detected changes through the pipeline. (This how-to selects an hourly refresh schedule, that refreshes five minutes after the hour).
Hit “Create”, and we have created our read connector.

TRANSFORM

Ascend has automatically brought in the data. Once we start exploring, we can see the individual records, but we can also query the entirety of the dataset. This allows us to bring the full power of SQL to explore the data as we're ingesting it. Let's see what the first hundred records look like in this how-to:

It's looking pretty good, but one thing that's missing is a Ride_ID —
something to uniquely identify each ride. Let's generate one:

We can start by adding a table alias, and concatting together the Pickup_latitude, Pickup_longitude, and the datetime of the pickup.
We can also shaw this to make it look a bit more like an ID.
Finally, name it, and give that a run!
Now that ride_id has shown up and we are happy with our transformation, we can remove the limit, and hit “To Transform”
Name it, and hit create

Ascend automatically starts running this transform component, but is not trying to redo the pipeline from the beginning.

Although we can run our transformations on Snowflake, Ascend transformations run on Spark. You get the full capabilities of also being able to code in Python or Scala, allowing for more advanced use cases, like machine learning.

LOAD DATA INTO SNOWFLAKE

The final step of this process is creating a write connector to load our data into Snowflake.

Go ahead and click the component, create a new write connector, and choose Snowflake as the write connector location from the list of options.
Fill in your account details and choose your credentials.
We'll also select some intermediate store information-this is where the files get staged before they get copied to Snowflake-and provide some credential for that.
The last piece is how we handle schema mismatches-if the table is Snowflake does not quite match the schema we have in Ascend. We could stop and display the error, but Ascend also gives us the option to adapt the schema changes, including even type mismatches, which allows us to very quickly go to our transforms and add columns, change column types, and have it immediately rematerialized in Snowflake without even having to hit run.
Hit create, and Ascend has automatically started to write that data into Snowflake.

Ascend automatically loads the data into Snowflake incrementally by default, but Ascend will switch automatically to a full reload based on business logic changes.

How-to: Snowflake Data Ingest & ETL with Ascend.io

EXTRACT

TRANSFORM

LOAD DATA INTO SNOWFLAKE

Try it out. Your future self will thank you :)