This tutorial, the first in a three-part series on Data Lake ETL (Extract, Transform, Load) will give you a brief overview on how to quickly and easily you can accomplish the “E” of ETL to ingest data from any data source with Ascend.
Welcome to Ascend! This tutorial, the first in a three-part series on Data Lake ETL (Extract, Transform, Load) will give you a brief overview on how to quickly and easily you can accomplish the “E” of ETL to ingest data from any data source with Ascend.
Ascend provides pre-built read connectors for all the common blob stores (such as AWS S3, Azure Blob Store, and Google Cloud Storage), data warehouses (such as Redshift, BigQuery, and Snowflake), databases (such as Postgres, MySQL, and SQL Server), and APIs (such as Twitter, Facebook Ads, Google Analytics, and Zendesk). In the off-chance we don’t have something you’re looking for, we provide the ability to write a custom connector as well.
In this Tutorial, we are going to connect to data in an S3 bucket.
1. To get started, we create a new S3 Read Connector and call it “Green Cab” since we will be pulling green cab data.
2. Put in your bucket, how you want to match the file name pattern, and then your actual pattern. You’ll need to enter in credentials to access your data, and then you can test the connection.
3. Now that that’s set up, you’ll need to give it a schema to start parsing this data. The data in this tutorial happens to be CSV, so we chose that, and then the magic of Ascend starts to figure out the right columns and data types that we want.
4. Ascend gives us a couple extra parser options for common issues that occur with raw data. For example on a string field, we might want to trim the white space, or on an integer field, we might need to deal with different style thousands separators that might be in the data.
5. The last part of setting up a read connector is setting up a refresh schedule. In Ascend, a refresh schedule is simply when to check for new data. If we set our refresh schedule for every five minutes, the pipeline won’t actually run every five minutes, it will simply check if anything has been inserted, updated, or deleted, and only then will Ascend process the downstream effects, and only for data that’s actually changed.
6. Go ahead and hit “create”, which will cause Ascend to automatically start asking S3 which files are available, and parsing them.
7. Once our data is parsed and up to date, even though this is only part of a pipeline, we can already start inspecting the records. At this point, we can start attaching more components to transform the data, query the data, or even start writing it back out to our data lake (or other destination, like a warehouse or database).
Share on facebook Share on twitter Share on linkedin Share on reddit Share on email