Back to the Beginning with Data Ingest [Podcast]

In this episode, Sean and I chat with Sheel Choksi and go back to the foundations of data engineering and data pipelines with a deep dive into data ingest! We take a look at everything from where to start to how to not back yourself into a data pipeline corner and what different requirements different sources may need. Learn more—including where Sean and Sheel’s favorite place to ingest data from is—in the newest episode.

Transcript

Leslie Denson: Today, we’re going back to the basics and back to pretty much step one of the data engineer’s pipeline, talking about Data Ingest. I’m joined once again by Sean, and this time also by Sheel Choksi, to talk the hows of Data Ingest, and the hidden and not so hidden intricacies of Data Ingest in this episode of DataAware, a podcast about all things data engineering. Hey everybody. Welcome back to another episode of the DataAware Podcast. I am joined once again by the illustrious Sean Knapp. Sean, how’s it going?

Sean Knapp: It’s going fantastic. How are you, Leslie?

LD: I am doing very, very well on this, for us, I don’t know when you guys will listen to it, but for us it’s a Monday evening. So doing well, can’t complain. Especially because we have… Well, I would call him one of my favorite Ascenders, but I really love everybody that we work with. But I do enjoy talking with our special guest today, who would be Sheel Choksi. So, hello Sheel. Welcome back to the podcast.

Sheel Choksi: Oh, thanks. Hey Leslie, hey Sean.

LD: How are you Sheel? If you guys don’t remember Sheel, Sheel was on a couple of our episode last season. Sheel was actually a customer who decided to jump to the dark side once he [chuckle] used Ascend and joined us on our field team, and now he works hand-in-hand with a lot of our customers, helping them be successful. So, love Sheel. Sheel always has great stories and great insights, and so I always enjoy talking to him. So yeah, Sheel, if you wanna give any kind of more insight on yourself there and introduce yourself to the people, please feel free.

SC: Oh sure. No, I thought that was pretty great, covered all the basics there. But as Leslie mentioned, the fun part of being able to now see it from the customer side, is just see a lot more patterns very quickly. So instead of just, “Here’s one thing that we did at my previous company, and here’s how we decided to solve those problems,” now we would just be able to take those patterns and scale them out of, “Okay, what works across five or 10 customers? And what are the best practices as determined by actual success and failure rates out in the field?” So that gives me great joy to be able to see that, learn from that, and then share them with others.

LD: Woo-hoo. I love it. That’s why we like talking to Sheel. And that is why Sheel and Sean are gonna be phenomenal for this episode, because if you all haven’t kind of figured it out yet, now that we’re a few episodes into season two, we’re starting season two with a little bit of a back-to-the-basics around data engineering, and talking a little bit about some of the essentials, if you will, around data engineering. We’ve talked a little bit, at this point, about orchestration, we’ve talked a little bit about data automation, and today we’re gonna dive into what… One of these very basics around data engineering, one of the very first things that you’re gonna have to do, which is really data ingest. And so hearing from both Sean and Sheel is gonna be really fantastic around that. So I’m gonna ask what is gonna come across like a really dumb question, but I want both of your perspectives on this one. It’s gonna sound like a dumb question, but it’s not, ’cause I think it’s probably a little bit more complex than some people may think it is, and that’s why I want both of your perspectives. But what exactly is data ingestion? ‘Cause it sounds really simple, but I think it could also… There’s a lot of I think nuance to some different pieces of it as well. Sean, you’re up first. Make you go first.

SK: Tee me up first. What is data ingest? It might be only because it’s… I don’t know, it’s getting closer to dinnertime for us all as we record this, but I was thinking, I was like, data ingest is sort of like having to eat your veggies. It’s kind of like you can’t really get to the rest of the cool engineering things until you get through that piece, and probably doesn’t make a ton of sense, but I think the… Maybe I’m just hungry…

LD: Possibly.

SK: Is all that amounts to. I think when we think about data ingest in the simplest form, it is the thing usually between your idea and you getting to applying your idea. And more often than not, it boils down to getting data from some other system, whether it’s an API, a Queue, a lake, a warehouse, you name it, into your system where you can then really start go to work on that data. And I think the reason why I describe it as “eating your veggies” is oftentimes where… It’s not really the most fun and exciting thing. I don’t think most engineers wake up in the morning and are like, “You know what I really wanna do today? I wanna write another connector to connect into another system get access to some more data.” It’s just not the same level of excitement, but it is the thing that I think you really do have to solve for to get to the very good stuff.

LD: Yeah, makes sense. Sheel? What veggie…

SC: No, I was just thinking while Sean was talking about that, I was like, sometimes it makes sense to go first, to make sure that not everything’s covered already, [chuckle] but…

SK: Maybe it’s also Sheel’s a vegetarian, that I even went to the veggie…

SC: That’s the other thing, I was like…

LD: Sheel likes veggies.

SC: So it’s like eating your vegetables and I’m like, “That is dinner.” But…

SC: But I like to focus specifically on the system aspect that you mentioned, and using a generic term like that where data is spread out across all these different systems. As you mentioned, it might be APIs, which typically, to me, means SaaS services, maybe your Salesforce data, or maybe your email marketing data. It’s sitting in those systems, sitting in your own custom-built systems, which might be backed by databases and warehouses, and it’s really just moving it to another system. And the real fact of that, is that the other system is just the one that you feel the most comfortable with of doing your other data processing work. That might be your warehouse, that might be your database, that might even be your own queue with your own custom-built app, it’s just that you want everything standardized and you want it in your system that you’ve picked. And that’s really how I look at a data ingestion, is arbitrary systems to just one particular system, and then that’s where you get to then standardize and work with it in just more of a symmetric manner because you’ve done the work of asymmetric systems and now you have all the tools at hand.

LD: Makes sense. And it goes into I think what both of you are saying to some degree, and I guess where I’m thinking of the nuance of it as well is data ingest isn’t a one and done, to your point. It’s not a, I ingest… Well, I guess it could be, depending on what your systems look like and what your architecture is and what you’re trying to do, but for the vast majority of things, for the vast majority of people, it’s not a one and done. And there are different approaches for different things. You’re gonna have people who have real-time streaming systems and their ingest is gonna look a whole lot different than you have for batch-based systems. And even with batch systems, you’re gonna have things like incremental data and incremental processing. And you’re gonna have different things that you have to take care of in that regard. And I feel like ingest, and what you do with ingest for that, looks different than it does in maybe other systems. Talk to me a little bit about that.

SK: Sheel, you wanna go first system?

SC: I was just thinking, sometimes it makes sense to go second so that there’s some precedence set.

SK: I see what you’re doing here.

SC: Yeah. [chuckle]

SK: Yeah I do think the, as Sheel highlights, a lot of this really comes down to you’re trying to pull data from some arbitrary system into another system that I would contend is just slightly less arbitrary as another system that you yourself happen to be comfortable with, and usually want to work with larger aggregated data sets together. And I think the thing is, every category of data system that you are going to connect to will exhibit different behavioral characteristics, and then fundamentally has very different ways as to how you’re replicating that data and how you’re actually reconstituting that data into a workable model on the other side. For example, APIs have very different and interesting characteristics than a lake. If I’m reading data from a lake, almost always a really large distributed system, I can hit it with as many parallel access layers as I want, I can stream through incredible volumes of data all in parallel. Whereas if I’m hitting an API, the limits and the restrictions are very different and they may vary depending on whether it’s my Facebook API or my Omniture API, or even an internal proprietary API.

SK: And so each one has different access patterns where I may have to put in different control models in place to make sure that I can get that data to the other side. I know Sheel, you’ve actually been, over the course of the last number of years, have seen many of our different customers having to work their way around a bunch of these different patterns tied to these, as well as when we think about warehouses and databases and CDC. What’s your take on this?

SC: Yeah, absolutely, I think… And that’s why I love just the data ingestion part is dealing with the asymmetry, and then once you’re in your system, then you’ve got it symmetric. And so, just as an example that you mentioned of pulling from APIs into the lake, you know the lake’s gonna have that great throughput, you know the APIs aren’t, and so your job is to really do that data ingestion gently for the API, pull the data that you need, and then just get it into that lake format so that you can do the rest of the work downstream of data ingestion, the transform, where else that needs to go, things like that, using your lake semantics. I think how you do that intermediate part though, from, in this case, API to lake, is fascinating. We also see a huge number of patterns there where some people might do it through more of a stream style, so on a clock, they may be hit the API, but then pull those individual records and send them into a queue. Others we see maybe do a batch style in there of grab those records, immediately write them to a file and upload them into the lake, so now you’ve got batches and you’re dealing with that more of that kind of architecture.

SC: We’ll see folks multiplex across these. Perhaps there’s some real-time use cases for some of the data, and so maybe like a queue or a stream might be better, as well as maybe there’s historical and archive and batch methods for it. And so I think it’s not only the subtleties of the ingest systems, but also how you’re going to do the ingest, and then ultimately, I think all of this is for both your system of record where you’re doing the downstream work as well as the business use cases, past that. What does the business need from this data? Do they need to see it real-time? Do they need to see live charts because it’s part of operational datasets that are giving live feedback to people? Does it actually just provide analytical value, so it needs to be accurate, but it needs to be on a much slower cadence? And there’s a lot more forgiving natures there of what can be done. So I think you stack all of that up together, and then all of a sudden that data ingest that we just defined so simply, I think becomes a bit of a combinatorial problem of how to solve it.

LD: More on that front, I always like asking… Well, maybe not this specific question because it’s the first time we’ve talked about ingest like this but I like asking especially when we have Sheel, you, or somebody like you on the line, which is, what is the biggest corner that you’ve either… ‘Cause you’ve done this. It’s something that you did before you came to work here. And so what is the biggest corner either you’ve painted yourself into or you’ve seen a customer paint themselves into when it comes to ingest? And that goes to basically what roadblocks can people expect or what things should people be thinking about because it may be like eating your veggies, it’s something that you have to do, and I don’t think people think it’s simple, but it’s certainly something that can cause issues.

SC: We’ll leave out names to protect the innocent, but… [chuckle]

LD: Yes, please.

SC: Essentially, the biggest theme that I see is when using these services that all have trade-offs in ways that are typically not optimizing for what their strengths are and instead using them for their weaknesses. So for example, in a queue based system, queues are awesome for more real-time nature, streaming things through individual records. And then actually maybe trying to use a queue as either part of control logic, so instead of the records actually flowing through the queue, trying to send through like, “Hey, this file over here in S3 is ready.” Okay, read that off of a queue and then process that file and then send it to another queue. And you end up hitting the limitations of really what a queue was designed for, of expected times for you to act the message versus when the message gets pushed back and things like that. Whereas maybe that was a better off fit for maybe like an orchestration tool with some more big data tools for… Because the data was already in the files, or vice-versa. Sometimes we’ll see that people want fast record streaming because they do have real-time use cases, maybe to the sub-milliseconds and things like this. S3 is fantastically capable at doing very low latency operations.

SC: And so sometimes we’ll see folks just use very, very small files into S3 because they’re looking for that real-time latency. So instead of using like a queue which is, for example, a great fit for this kind of streaming use case, they’ll instead just be like, “Well, why don’t we just every 50 milliseconds just upload the latest records into a new file in S3, and then somebody else could read that in?” Yeah, and all of a sudden, S3’s strengths are being able to very quickly send back bytes, get input operations, things like this, but now all of a sudden we’re relying on S3 to essentially do the number of metadata operations as their actual data operations, because the size of the files are too small. We’ve broken conventions of big data processing because we’ve created so many files, a lot of big data systems assume that listing files is gonna be relatively cheap compared to the actual data processing of those files. But again, if all of a sudden the files mainly just contain a couple of records, then again, creating metadata operations the same size as your actual data operations. So we’ll just see a lot of these patterns where these tools can be so compelling up to a point, but fundamentally, there’s always these trade-offs, and so we definitely wanna analyze the… Are we using the right tool for the right use cases and the right trade-offs that we’re looking for?

LD: Yeah. So with that, we’ve talked a lot about, to this point, obviously ingest, that’s what we came here to talk about today, and we’re talking a lot about the ways that you’re ingesting and where you’re ingesting from, but for those who maybe are looking at this as kind of an introductory of data engineering or aren’t maybe as well-versed in some of these things, talk… Let’s back up a step, we’re taking one step higher level, and talk to us about some of the classifications of data, where you’re ingesting from, can records be modified, can they be attended? What is the data? How do you ingest different data different ways? That sort of thing. And then maybe we dive in a little bit further on each one of those. ‘Cause I think we’re having a really great conversation about some of this stuff, but maybe we skipped over some of the 101 levels to get down to the two and 301 levels.

SK: Yeah, that’s certainly something that can be helpful for folks is, let’s boil it down to some of the most common patterns that we see too…

LD: Much better way of putting it, yeah.

SK: Which are, if I had to guess from what I’ve seen, the three or four most common places that people are generally ingesting data from are APIs, databases, warehouses and lakes, not necessarily in that order. Bt when we start to think about how each one of those behave, you know, probably helpful to go through with those. And it’s interesting against that backdrop, so much of I think the behavior of how you ingest data from those systems to replicate the data model in your internal system, I think so much of that can come from what really boils down to a basic question of, Are the records immutable or not? If the records never change, if the data doesn’t change, but it’s only ever appended to, you end up then employing pretty different patterns for how you try and ingest data versus how you would if those records are actually mutated and modified over time. And there’s a handful, I think, other ways of slicing this too, but oftentimes we find as we help customers first go through those ingestions, is that’s like the first thing is, if I take a database, for example, what’s the nature of the records in the table? Are they append only or are they actually mutated? For example, if they’re customer address records, they’re going to be mutated and modified. If they are purchases, hopefully they’re not modified. Still actually can be and probably are in most actual systems, but…

SC: Yeah, often are, depending on how people modelled it. Yeah.

SK: The notion behind that is, if I’m looking at a database I know, for example, a database generally can’t be hammered super hard by a distributed big data system. It can definitely be hit pretty hard, but not as hard as a warehouse or a lake, for example, when I’m looking at a table inside of the database, if I need to replicate those records, there’s a couple of different models that work for ingestion. You’re looking for other… Do I have access to a CDC stream? And a database is oftentimes… It’s a bin log or some other transaction log that you’re looking for. That you can read from and replicate off of all changes. Or oftentimes, you’re just looking for something that’s a little bit more simple, which is, “Are there auto-incrementing IDs?” And last modified times. I would say for anybody designing a database table today, if you’re using a typical database, at least add auto-incrementing IDs and last modified times and probably created at times and a handful of other things too. But that would help tremendously any simple layer that’s trying to replicate your data, especially if you don’t want to go to the next layer of complexity, which is generally a CDC stream, that’s all.

SK: So I think we tend to see, using databases as an example, that’s a pretty common turn to set the patterns of… For an external system connecting into it, frankly, how does it find not just a snapshot of the data, but then how does it continually watch and listen for new data that is becoming available or data that’s changing?

SC: Yeah. And I think that touches on a big piece of data ingestion, which we haven’t actually talked about yet, which is creating incremental versions of the data ingest. ‘Cause certainly with all of these methods, or all of these sources, cues, APIs, databases, you can pull a lot of records back. Cues, maybe the records are more temporal, but a lot of times databases, they’ll have the full set of data there. So yes, you could do data ingestion by just pulling the full data set. But most of the time what people are looking to then work on and build is finding that incremental nature of it. So I think with databases, you covered pretty much all of the standard patterns that we see there. Cues tend to be a little bit more straight forward since you keep a cursor normally, or a marker of your last acknowledged and last streamed in message and then just pull all the new messages since then. And it ties back to your point about mutability. Cues have frameworks for how you mutate records, for example, here’s a record restated with the same primary key. But most distributed cues like Kafka Canisius, things like this, are append only. So the ingest and the incrementality of it is quite a bit simpler there as well, where you’re just pulling those in. Some complexities with Kafka, like tomb stoning records and if you’re using higher level obstructions there, but for the most part… I think API is on there where it gets a little bit fuzzy though. Yeah, I see Sean smiling.

SK: This is where it gets fun. So with APIs, we inject a handful of other elements here. And we see this from a lot of customers who are like, “Hey, APIs are really interesting. But for example, I, often times, am rate limited. Or I can only… In number of queries I can make per second and per day.” And/or, “Hey, at some point in time, I lose granularity of my data. Maybe I can get ultra-granular data for the last 30 days, but beyond that window, I only have access to summary information from the API. So I can’t even access that old data.” How do you see people solving for that Sheel? ‘Cause that does introduce some behaviors kind of similar to CDC streams often times, kind of similar to cues at other times, but unique because it’s an API.

SC: I think we may have talked about this the last time I was here as well, but I am a data hoarder, and I encourage people to become data hoarders as well. And it’s for exactly those reasons that you just mentioned, where either… For a lot of these sources, you either pull it and have it, or you lose it to the sands of time. APIs are a fantastic example of this, where even if an API allows you to fetch the current state of everything, very few APIs are even gonna let you fetch how things were before. Alright? And you have to keep that. Or the version that they let you fetch maybe doesn’t have all of the fields that you need and maybe you needed to enrich it against a different API request. And those types of things are essentially lost if you don’t hoard it. And so, strategy number one there is absolutely figure out what you need and how to get it. And granted if you’re just starting this practice, then strategy number two is acknowledge that you won’t necessarily have full fidelity. It’s very hard to pull off with a project that is going to have full fidelity of all historical information. And so just kind of embracing that. So for your historical, maybe you are gonna be able to fetch it from one other end point and it’s going to have 80% of the data that you need.

SC: And then going forward, now that you’re ingesting and you’re hanging on to it, you can employ strategy number one there. So that’s the first part, given that these API’s are changing. The second part is that it’s often worth grabbing more than you need, because it is very painful to go back and get it later. That’s not true of necessarily a warehouse or a lake or a database even, but that is very true of an API. So for example, I always suggest, “Hey, just pull in the full response body and let’s go ahead and take the full set of records and fields that we’ve got and save them.” Assuming that compliance and everything, we’re okay with there. Because then if we ever actually needed another field downstream for some use case, we’re just going back to our system where we’ve done the ingest, back to our familiar place. As opposed to creating a much harder problem of, “Now we need to go back and fetch this data historically.” Which might be a big period and then all those rate limits and throughput problems that we just mentioned. So it kinda still ties into the data hoarder strategy, but has to do with really that tail part of data ingest, which is as you’re putting it into your canonical system, into your system that you control, how much are you bringing in and how much are you letting go?

LD: I think we’ve talked through this piece of it a little bit in different aspects of what you guys have both talked about. But to just break it down for somebody who is literally just trying to build their first pipeline out there. And it’s… To your point though, it’s probably also different depending on where you’re trying to pull data from, but are there some generic steps that people should think about when it comes to ingest? And I guess maybe a checklist of things to consider and think about when it comes to ingest of, we need to just make sure that these five things are checked off the list before we move forward? And there may not be. But…

SK: I think it’d be… Probably some of the simplest and straightforward ones are, when we think of, I wanna go build a system that’s going to do some ingest. The most basic ones are, “Do I have the code that can connect, obviously, to this external system?” And something that I’d literally just run through with one of our customers is, we put in a bunch of tooling around, not just the connector code, but the ability to test and validate connections. Everything from, “Do you have the viable credentials?” to “Can we look up the entity on DNS?” to “Can you create a TCP connection?” to “Can you actually log in and authenticate to the system? Whichever system it is.” But generally, it’d be, “Can you connect, find and connect, the system you want to read from?” And I think that’s important because, even though we’re increasingly in a cloud world, there are also a lot of networking safeguards in place, with most companies, that can block access, even at lower networking levels to individual systems.

SK: So a lot of it just comes down to access. Can you connect to the thing you’re trying to get access to? I think the next big one that we usually recommend for folks is, before you get fancy, just go grab all of the data, or if it is an incredible large amounts of data, grab a subset sampling of that data. But then it usually boils down to, make sure you can just pull in some data, ’cause you will start to see different behavioral characteristics of the systems you’re connecting to. For example, some systems will just fundamentally cap out at how much and how fast the data you can read from that system. Some may cap out at 100 records a second, some may give you 1000, some may give you hundreds of thousands of records a second. It just depends on how they store the records, how they can stream it back.

SK: And then, obviously, once you’re capturing that data, you do need to write it somewhere, and so then that would be the last part, is just make to sure you can get it properly ingested, preserve it, shield highlighted, preserve the structure of the data. Don’t drop columns or records, make sure you can actually map from that one system to the new system, that the data types match, that the column types match, etcetera. And that’s really the basics, first, then you can start to get into a lot of the fanciness of, how do you make it more efficient? How do you read faster? How can you parallelize? Etcetera, but that’s really where I would say it first starts. Sheel, what are your thoughts?

SC: I think so. And, maybe as part of that second level, would be how to do it incrementally, how to monitor it, all the day two stuff after that. But absolutely, and I think what you touched on there is… It’s funny, it’s so philosophically embedded in how we think about it, but how you’re writing it to that second system, your canonical system is so critical. The more you do, in between when you’re pulling the records and before you insert them, is all temporal. So the business logic there, the how you decide to save it, the how you decide to parse timestamps, all of that, essentially, is in-flight. And I almost think of it as at-risk, like for example, if you’re pulling it from an API, and you decide that they look like proper ISO 8601 time formats, and so you still decide to parse them and put them in these time stamps, but they actually weren’t consistently that, so a bunch of them don’t parse, and now you have a bunch of nulls in that column. How much you trust that upstream data set, and how much you do in that data ingestion step, is really going to affect your abilities after that.

SC: So I even recommend, as Sean is saying, keep it simple, go really lightweight. If you’re not sure, maybe all the types are actually just strings for now, ’cause then at least that’s your safe format and at least you can bring it in, view it, audit it, play with it, transform it later, whereas you don’t lock yourself into a corner like we were talking about earlier with Leslie. So just keep it simple, get it going in, it’s really the beginnings of everything.

LD: So there are a lot of… Maybe not a lot, but there are multiple vendors out there, including us, that offer connectors, that will help you more easily ingest, get from Point A to Point B. What are some things that you would recommend people consider when they’re looking at somebody to help them do that so that they’re not… It’s not somebody who’s coding the connector themselves, they’re not writing the connector themselves, they’re not having to do this, everything, themselves. They’re going to somebody and saying, “Let me use this connector to get from S3 to Snowflake,” or whatever it might be. What would you recommend? What are, maybe, the three or five things that you would recommend people look for? Or are there any red flags that you would recommend people look out for when they are looking at a vendor that offers connectors like that?

SK: I’d say probably a couple. One is certainly the flexibility that you’re afforded with their set of connectors. Oftentimes, there’s going to be a ton of really common ones that you’re gonna wanna tackle with out-of-the-box, pre-configured, ready-to-go connectors. I wanna read from S3 or BigQuery. I think that most people will be able to support patterns like that. But then you start to get into, I think, some of the next layers of complexity, that do very much matter. And if you start to peel back a few layers. For example, for databases, can the connectors actually support CDC streams? Do they support multiple replication strategies via, for example, last updated at columns? Or can they do things, like parallel reads and writes, to optimize for speed and performance? And similarly, can they do snapshots of individual databases, just so you have a full, complete snapshot at a particular cadence if you can’t get a full CDC stream? So, you want to sort of bunch these just like that. And then probably for… I would say for…

SK: For both warehouses and even lakes, how do they detect changes? Are they fingerprinting data as it’s coming through to see if data has been changed? Or are they monitoring new partitions of data? So, I think that part becomes really important that I think I would recommend. And then probably, the last piece that I usually see with folks is, and we get asked this all the time, which is, “Yeah, but what happens when I do need to just write some code?” And I think that becomes really important, which is, can you still fall back and cut your own code, or are you just stuck? And I do believe that that’s something where a lot of people can get burned, is you find a tool that works for most things, but the long tail is just simply not possible.

LD: Sheel?

SC: Totally, I think those are some great use cases. The only thing I would add on is, so a lot of times with these out-of-the-box ones, they’re awesome, and as you said, they’re very common injest sources and 80 to 90% of people have the same use cases for them. So for example like, “Oh, I wanna grab my Salesforce data” or “My HubSpot data”. Again, 80 ti 90% of things that people wanna do with them are all the same. However, it is worth vetting, if it works for all the use cases that you have in mind. So for example, I think like in HubSpot in specific, a lot of the pre-built connectors worked on their main objects, but then they started adding more object types and those weren’t necessarily yet supported by the common class of out-of-the-box connectors. And so the API had them, but the out-of-the-box connector didn’t, and those custom objects might actually be critical for your business in your business use case. And so it is worth just vetting that and checking that, not to say that you can’t come up with a solution, but just say to that you know what your options are and what’s gonna be required. So I’d say that one’s pretty common, missed fields, missed tables, things like that are all pretty common, like, “Oh, we pulled the standard object, but not custom fields” or things like this, so that’s one.

SC: Two, a lot of times, there’s going to be choices made mapping between the systems. So, between how a SaaS vendor has the data represented and between how it’s going to be represented in a warehouse or a lake, there’s gonna be some decisions made about how that translation occurs. And most of the time, folks will be careful to make sure that there’s no data loss in that translation, but it’s worth just checking again of like, are there the right fields and pieces of data that you need… Went through that translation layer? And then the third thing, which kinda relates to that translation layer as well is, does it have the granularity of data that you need? So as Sean was talking about databases in CDC, a very common one is to use CDC to recreate the table as it is. That’s the 80 to 90% use case for this. And so because of that, a lot of these out-of-the-box vendors for something like a Postgres might just use the CDC stream and then deliver into, for example, a lake or a warehouse, the ready-made tables as they stand. However, certain businesses in certain use cases have to do analytics on the actual records as they were changing, for example, how often did a user update their address?

SC: And so if you fit into the common pattern, and you just need the table as it stands, fantastic use that. If you need the snapshots like Sean was mentioning, because you need to know what the users look like each day, or you need the CDC stream, ’cause you need to do that, that’s again, it’s a translation layer that’s happening on the out of the box connector and you just have to decide if it’s at the right level of abstraction to make sure that you’re still able to complete your business use cases.

SK: And one other thing that I would add too, and Sheel you mentioned another thing too. Probably two more, is as we get more and more advanced on the data ingestion side. One category is, can it also optimize for the system that you are writing to? So for example, if you’re writing into a warehouse, this specific thing wouldn’t matter, but if you’re writing into a lake, if you’re reading really large numbers of small files and you’re writing that into a lake that you plan on running Spark jobs on, Spark doesn’t like large numbers of small files. And so can your system, your ingestion system also help optimize in a lost less way on the data to help prepare that data to be better processed downstream? There’s one that I would suggest. Two is, as you’re ingesting data, can the system as data is moving through also start to build profiles on the data? Because that metadata becomes a really powerful thing to now and inform downstream systems on too. So is it categorizing? Is it classifying the ranges, and the cardinality of different data sets as they come through and profiling that, so you can feed those downstream systems later?

SC: Yeah, it doesn’t end after data ingest, does it?

SK: No like I said…

LD: It never does.

SK: That’s where you get to the good stuff.

LD: It never does. Alright, so to wrap this up, I have a final fun/mostly just totally nonsensical question that both of you have to answer. What’s your favorite place to ingest data from? Totally nonsensical, it makes no sense. I know that, but I just wanna know.

[pause]

LD: They’re both staring at me like I’m crazy.

[laughter]

LD: Just for those of you who… There’s no video for y’all, it’s just need y’all to know…

SK: We’re just thinking. You know, I feel like…

LD: They’re looking at me like, “Did she really ask that question?” And I ask this because I have been in multiple calls where Sean’s done a demo where he always gravitates to the Google Sheets read connecter for some reason. And he may not say that today, and that’s totally fine, he may just really love our particular Google Sheets read connecter for what it can do, and that’s totally fine. But I just feel like I wanna know, where do you guys really enjoy in particular, ingesting data from?

SK: I was gonna give Sheel the first shot of this, but I’m taking it. I love Google Sheets.

[chuckle]

SC: That’s where you’re going with this?

LD: I thought that might be where you would go.

SK: Yeah, absolutely. So I like it. And I was thinking back through. ‘Cause I’ve been spending a bunch of time with customers as of late, and watching them pull in data from all sorts of different systems, and there’s all sorts of I think great other anecdotes that Sheel can share. But here’s the reason why I like Google Sheets, is most teams have been able to pull together some reasonably… Even if they’re not happy with it, they can access the data some way somehow. Maybe they can’t get it fully into their pipelines or into their warehouses and lakes, but they’ve seen the data before. The reason why Google Sheets is fun for me, is that sheer glee moment of them being like, “Oh my God, there’s other data that we’ve never before unified with our ETL pipelines,” ’cause most of them have never done that. “This unlocks all sorts of other really cool fun use cases for us.” And so you see the light bulb moment now of not “Aha, I can accomplish this,” but that light bulb moment of, “Oh, all of these things now, we can go accomplish because we have access to this type of data. That’s one of the things I think I really enjoy seeing.

LD: I get that.

SC: You went in a very different direction in this than I was thinking. I was thinking like, “Okay, let’s pick the one that’s least temperamental, that just makes life easy. It sounds like [laughter] like no APIs.” I’m like Blob, or maybe Queues, or warehouse. And then the reason I didn’t respond, I was like, I feel like this is a trap. ‘Cause all of a sudden with Blob…

LD: It’s not a trap.

SC: Sometimes it’s like everyone’s like, “Oh, they’re just CSVs.” I’m like, “Oh, they’re just CSVs?” Like half the time they’re not escaped correctly, or there’s just always stuff to deal with. So I was like, “No, wait, no, not Blobs.” Blobs will behave from a throughput, but not from necessarily like, anyone can do anything with it. So then I started…

SK: Just winging it to go with Parquet?

SC: Yeah. I was like, I gotta tighten up my requirements here. Well-formed Parquet files and blob. Like, “Wow, wow, what a data source.” Now, [chuckle] that’s probably a little too boring. I’m gonna go with Queues, and I really like ingesting from Queues, ’cause I do think they’re… They have semantics at the level of Ingest, similar to Blobs or warehouse of there’s semantics as to how you’re supposed to get the data, the throughput that they support, they’re all well-known. And the reason I like it is because of certainly having some standards is always nice, something to build against. But it’s funny sometimes when folks get so caught up on Queues being a real-time system, but then actually, you also have to do some batch work off it. And so having a reliable system to even be able to ingest from Queues, make batch out of it, so that analytical workloads that are more batch-driven can happen from that. It’s kind of an exciting thing, and sometimes it just helps people bridge the gap of, these worlds don’t have to be isolated.

LD: I like it. I can appreciate that. Well, thank you both so much for taking some time this evening to talk about Data Ingest with me. I hope you all enjoyed this conversation as much as I did. And Sheel, you have once again passed the test, so I’m sure we will invite you back. So yeah, I know. You just never know with you. We’re just never totally sure with you, but you did it again, barely, barely, barely. It was that last question where you answered Queue and not Parquet files and Blob. [chuckle] So that was where it was gonna tip one way or the other. So alright, thanks Sheel.

LD: Honestly, I would love to hear where you all enjoy ingesting data from. So feel free to reach out over Twitter, either at @ascend_io, or to me personally at @LeslieD. That’s L-E-S-L-I-E-D. You never really know what you’re gonna hear, and it’s super interesting to hear why. Whether it’s because of the challenges you have to overcome, or the fun ways that you can then work with and mutate the data once you get it into the pipeline, everyone’s experience is a little different, and it’s always a super cool journey to hear. Also, as always, if you have any questions about Data Ingest or anything else that you’re hearing on the podcast, you can certainly reach out about that as well. Welcome to a new era of data engineering.

Ep 16 – Back to the Beginning with Data Ingest

About this Episode

Transcript