Data ingestion is the process of transporting data from one or more sources to a storage location where it can be analyzed and used by an organization. In this post, we’ll cover the basics of data ingestion, the different types, the potential pitfalls, and the benefits of getting it right.
First, it’s important to note the difference between data ingestion and ETL. Data ingestion is a broader term for the process of collecting raw data from various source systems and loading it into a target system. Unlike data ingestion, ETL’s focus is to transform data into well-defined structures optimized for analytics.
Data ingestion: from asymmetry to symmetry and how to get there
Essentially, data ingestion takes you from data asymmetry (disparate data located in different data stores) to symmetry (standardized data, typically in a singular/limited number of data stores). It may not be the most fun or exciting task to write connectors to enable data ingestion, but it’s essential so you can move to a standardized environment where you have all the tools at hand to work with larger aggregated data sets.
Sounds simple, but data ingestion gets more complicated quickly. There are several different types of data ingestion: real-time, batch-based, and so on, each with different nuances when it comes to replicating the data and reconstituting it into a workable model. For instance, data from APIs have different characteristics than data coming from a data lake.
Streams and batches and hybrids, oh my!
The intermediate part of data ingestion (getting to symmetric data) varies widely as well. Some data engineers take a stream approach to data ingestion, hitting an API, pulling individual records, and sending them into a queue. Others prefer a batch-based approach, grabbing records, writing them to a file, and uploading them to a data lake. Then there are hybrid approaches involving streams or queues combined with historical or archival data. How data ingestion is done often depends on what business users need from the data, whether it’s live charts or analytical data.
It’s important when doing data ingestion to take the approach that optimizes each system’s strengths. For example, queue-based systems are awesome for real-time, streaming data. However, queues quickly hit the limits of what they’re designed for if you try to have them enrich records as they are streaming through (especially as the window of what records need to be joined widens). A better option in this instance might be a data orchestration or big data tool. When it comes to data ingestion, it’s vital to apply the right tool for the right use case.
Start by looking for data ingestion patterns
One way to simplify data ingestion is by considering the types of data you’re ingesting and where you’re ingesting it from and look for patterns. In our experience, people generally ingest data from APIs, databases, warehouses, and data lakes. Regardless of the data source, the first question to answer is whether the records are immutable or not. If the records don’t change and are only appended (customer purchases, for instance), then you get different patterns than if it does change (like customer addresses).
You may only need access to a CDC string, a bin log, or a transaction log. Or you may need to look at auto incrementing IDs or last modified times. The key is to set things up so an external system can take a snapshot of the data and continually watch for changes and new data.
Learn to become a data hoarder
Having to go back and historically backfill data from an API can be challenging if not impossible due to the rate limits, granularity of data, or inability to access old data. Due to the cheaper cost of storage, it can be beneficial to instead store more than is necessary so that it is present for recomputation.
Acknowledge that you won’t always have full fidelity or all the historical information you need when doing data ingestions. Keep the data you got previously, even if you need to enrich it against a different API request. Grab more than you need because it can be painful to go back and get it later. It’s never fun when your roadmap hits speed bumps on needing a complex historical data pull.
Data ingestion steps
All this sounds complicated, but there are four basic data ingestion steps to consider when you’re getting started.
- Ask yourself if you have the code to connect to the external system and be sure you can test and validate connections. Start with the basics: Can you look up the entity on DNS? Can you create a TCP connection to log in and authenticate to the system? These checks are important in an increasingly cloud-based world and can save later debugging time.
- Before you get more complex, just go grab all the data. Or, if it’s an incredibly large amount, grab a subset, but be sure to be sure you can pull in data and start to see the behavioral characteristics of the systems you’re connecting to. For instance, some systems cap out at 100 records a second; others may give you thousands.
- Write the data somewhere and make sure it’s getting properly ingested. Preserve the structure of the data and check that you aren’t dropping columns or records. Map and match the data from the originating system to the new system. After that, you can start looking at how to make things faster, more efficient, and parallelized and how to monitor everything.
- Don’t move too fast, and always keep your timestamps, data formats, business logic—everything that makes your system tick—intact as you move upstream data in. Go lightweight at first and bring in simple data, view it, audit it, and transform it later.
Connector flexibility is paramount
If you don’t want to write connectors yourself for data ingestion (and who does, most of the time?) then take a hard look at the flexibility of the out-of-the-box connectors. Most vendors provide common connectors between S3 and BigQuery, and Salesforce and Hubspot, but do they cover more complexity?
For databases, can the connectors support CDC streams or multiple replication strategies? Can they do things such as parallel reads and writes to optimize for speed and performance? You’ll inevitably need these types of advanced capabilities. For warehouses and data lakes, be sure the connectors can detect changes, fingerprint data, and monitor for new partitions of data.
Also ask yourself if the connectors can support a wide variety of object types and fields because custom objects, fields, or tables may be critical for your business use cases. Also check that there’s no data lost when data is translated from the source to the destination system, and make sure you have the data granularity you need to support your business use cases.
The connector should optimize for the system you’re writing to. For instance, if you’re writing into a lake that you plan on running Spark jobs on, remember that Spark doesn’t like large numbers of small files. And, as you’re ingesting data, be sure the system can start building profiles on the data as its moving through from one system to the other. Metadata is a powerful way to inform downstream systems. If all else fails and you’re faced with a long-tail use case, be sure you can still fall back and create your own code.
Get the ultimate in data ingestion flexibility with Ascend
With Ascend, you can ingest data from any source in any format. To make data ingestion more effective, you can replicate data from cloud to cloud with a simple data pipeline. You can build connectors once and use them forever and automatically profile every piece of data. Instead of incurring big infrastructure costs from processing unchanged data, Ascend lets you process only data that has changed, and the data related to it. You also get automatic data detection and formatting. It’s the ultimate data ingestion platform.