Getting to the Heart of ETL with Data Transformation

Back

Getting to the Heart of ETL with Data Transformation

Ascend.io

data-eng@ascend.io

In this episode, Sean and I chat with one of Ascend's Field Data Engineers, Shayaan Saiyed, about one thing data engineers can't function without�data transformations. Shayaan and Sean dive into best practices for iterating and troubleshooting and also discuss how data teams can be goal-oriented when starting to think about their ETL pipeline in the latest episode of the DataAware podcast.

Episode Transcript

Leslie Denson: At the heart of every ETL pipeline is, well, the T. So today, Sean and I are joined by Shayaan, one of Ascend's field data engineers, to chat through data transformations, the how's, the what's, and how many times transformation equals translations in this episode of DataAware, a podcast about all things data engineering.

LD: Hey, everybody. And welcome back to another episode of the DataAware podcast. I am joined once again by the person who... I don't know if you guys have noticed, but I am kind of sort of bribing him and making him join me for every episode this season, and that would be Sean Knapp. So Sean, welcome. Welcome to the podcast, I'm glad that you are accepting my bribery and joining us for everything.

Sean Knapp: Thank you for having me again, and no bribes needed. This is definitely one of the highlights of my day and week. So I always enjoy coming on and chatting.

LD: Well, good. Well, we appreciate it. We enjoy it as well. It's always very fun. I say we... I enjoy it and our guests usually... Other guest usually enjoys it, and I hope the listeners enjoy it, they keep coming back. So that's always good. And today, we have another guest. We have another Ascender for everybody who we're super excited to talk to. He's a part of our field data engineering team. So some of you listeners may have actually chatted with him before. So we have Shayaan, like I said, from our field data engineering team on the line. So Shayaan, welcome. Hello.

Shayaan Saiyed: Hello. Thank you so much for having me. It's good to be here.

LD: How's it going? Good.

SS: It's going great. This is my first time ever on a podcast, so I'm super, super stoked about it.

LD: Woohoo. Well, we promise to be kind and we promise to not surprise you with too many hard questions. So we'll give it a shot. I can't promise anything for Sean. He likes just throwing zingers in there somewhere, so...

SS: Let's see how this goes.

LD: Yeah. I'm not totally sure what's gonna happen with Sean, you never know. He just likes being crazy.

SS: Sean's a little suspicious, so I'm bracing myself.

LD: Yeah. A little bit shady over there. Never know what he's gonna do...

SS: Be prepared.

LD: Yeah, when it comes to the questions. But we... I am actually... I say this with every guest, and I know I need to stop saying it. I have a really bad habit of doing this, but it is always because I am so thrilled with the guests that we have on, with the particular topics that we have set forth. I'm super thrilled to have the two of you on to talk about this particular topic, which again, if folks have been listening over the course of the last several episodes, you will know that for the most part, we've been going back to the basics and really looking at some of the foundations of data engineering. And you can't do a whole lot of data engineering without talking about some data transformations. And so that is what we're gonna dive into a little bit today, which is why we brought Shayaan, because he is the one of the ones that's working a lot with our customers and helping them do things with the Ascend platform, and so he's working a lot with them on making sure that those transformations are working, so he's got a lot of really killer insights. So with that, we're gonna dive into the topic with as we were joking earlier, what seems like it should be a really easy question that maybe is, maybe isn't... For those of us out there who maybe are newer to the space, what is a data transformation, you guys? What is the 101 level of a data transformation? Who wants to go first?

SS: I guess, I can go first on that...

LD: I like it.

SS: When you mentioned this question, the first thing I thought about is what I kind of explain to family members or other people about what I do and what Ascend does. I kind of explain it such as like, you pick up data from one place, do stuff to it, and then put it somewhere else. So when I think of data transformation, it's the do stuff to the data...

LD: I like it.

SS: That usually works in helping them understand. They don't ask a lot of follow-up questions after that. They're just like, "Okay, cool. That makes sense, thanks. You want some dinner?" But...

SS: If I think about what data transformation is from what I've seen, it's really just... You have your data coming in, in whatever form it is, you've ingested it, you've eaten your vegetables, if... It's a reference to the last podcast which I listened to...

LD: Yes.

SS: But once you... Then that you kind of have an end goal for this data or you have some sort of use case for this data, and it's really taking data on that journey of point A to point B, of turning it from this... Not necessarily unusable, but like this format that you don't really need it in or you... That doesn't really work for you and transforming it into something that does, which is, it's in the name, data transformation. That's a very high level explanation, I think.

LD: Works for me. Sean?

SK: Yeah. I mean, I think... I love the simplicity of...

LD: Think that?

SK: Probably not, but I can definitely throw words with the best of them. I think the simplicity of that is really quite important, which transformation is the all-encompassing, whether it's ETL or ELT, it is the act of doing something with your data. And there's the incremental step towards making it more viable and more usable. Towards what end goal obviously depends based on your particular business use cases, but it is very much the... I would contend where we apply our intellect as humans, where we take the goal of what we want to accomplish, and work back towards what we have with the data, and start to incrementally move it closer to that end goal.

LD: So let's dive a little further down into maybe the 201 level, which is... And I think you both touched on this a little bit. Why is data transformation important in the grand scheme of things? So if companies have data coming in, just to ask the... Just to be really blunt about the question, if companies have data coming in, why aren't they only ingesting the data that they want? Why would they need to transform it? If they have the data coming in, surely that's the data that they want. So why isn't that just the data that they're using? Why would they need to transform it? Let's... Why aren't they just using their data? This question's for my parents, who don't understand what we do.

SS: Think I'll go first again on that but I think more often than not, anyone ingesting data is gonna be ingesting it from a place they have no control over how it's formatted. Whether you're pulling data from an API or you're getting data from a different source, or even just another source within your company, that format in a specific way sitting in a blob store somewhere, more often than not, it's not gonna be in the format you need it to be. So using that data right out the gate oftentimes isn't possible and you do need to do some transformation on it, whether that's just filtering out bad data, doing any data quality checks, doing some aggregation on the data downstream, but it's more so just... Like we mentioned before, it's getting that data to a place where you can work... Where it works for your end goals, and data sources are really varied in how they work.

SS: Let's say you're working with an API and you're getting JSON data, if you wanna pipe that downstream into like a warehouse, you're gonna need it in some sort of a CSV format, something more columnar. It's really about making sure that the data you're getting works for your needs and understanding why this data is important to you, and also making sure it works, and is really just actually useful for your use cases and businesses.

SK: Yeah, before the podcast, we're all even talking about some things we were doing earlier on today and I can certainly share... And I think topical example, this morning, I found myself waking up very early and myself building data pipelines, obviously in Ascend, but I think there's a couple of very interesting examples of data transformation even in just what I would consider to be the fairly straightforward and simple style of data pipelines that touched on a lot of what Shayaan is highlighting too.

SK: For example, I'm ingesting a bunch of data to create a pipeline around the user activity in the Ascend product. And first, this data is coming from log streams, from different deployments of Ascend, gosh, across all three clouds, all sorts of continents and different regions and zones, and it's collecting all this data. But the data actually comes in in a different format than we generally like to work with in big data systems. The data is coming in a log format style. It has some JSON-structured objects inside of what is a larger log format, but it's largely just text data that's coming in and it is semi-ish structured, as is often the case when you are coming from what would look like more of a traditional backend system that's just logging user activity.

SK: As Shayaan highlights, different systems speak different languages or have different structures. So a backend system is generally just adding in log line after log line after log line. But once we start to do what are more analytical style and warehouse style operations on that data, we want to take it from the JSON, gzip compressed files into more columnar structures. We want it to be in Parquet files where we can do really rapid analysis on that data in a structured format with big data systems. So as we ingest that data, we then transform that data. We don't even necessarily modify a lot of it. I definitely did it in my first step but instead, I'm simply saying, "Take a bunch of this JSON data and convert it over here. Decompress it, obviously, convert it over here and then put it in a columnar format so I can do really efficient queries on that data based off of specific columns." That's really the first step.

SK: Another example, I'll give two, that we do is in those incremental improvements. This morning, one of the things I was looking at was not just individual activity, but specific builder activity, users who are taking specific types of actions to create, update different data sets. And so another really common example of a transformation may be to filter out data that you're not interested in. So not only did I first transform that data and get it into my system into the right format, but then I can say, "Hey, I'm specifically looking for people who did great update, delete style operations and all the other styles of events I'm not as interested in." So that's another kind of data transformation that hones and refines the data set that I'm working on. Those are a couple of, I think, fairly straightforward examples and two that are probably common across a lot of data teams.

LD: So riddle me this. Are there different kinds of transformations? So we've talked about the different, the reasons why. So data can come in in different ways and you gotta get it into a different format, which goes back to, as Shayaan referenced, you gotta eat your vegetables, which means for those of you who haven't listened, go listen to our Ingest podcast, 'cause that's ingesting or eating your vegetables and you gotta ingest your data. So you gotta do that and it all comes in in different ways. You gotta transform it so that you can get it in a structure that you need it. Totally makes sense.

LD: Are there different kinds of transformations? And I can imagine that maybe different kinds can mean different programming languages. So you're writing maybe a transformation in SQL or you're writing a transformation in Python, or you're writing a transformation in, I don't know, Java or maybe there's some other different way of thinking about transformations that I'm not thinking of. But is there just one set kind of transformation that everybody focuses on, or are there different ways people need to be thinking about data transformation? Sean, you go first this time.

SS: Oh. [chuckle]

LD: Oh well, Shayaan can go first. He keeps jumping in and I appreciate that.

[laughter]

SK: I love the enthusiasm.

LD: 'Cause nobody else does, they all make Sean go first. So now, I'm gonna make Sean answer first," but no, Shayaan you certainly can go first.

SS: It's okay, I'll pass it back to Sean.

[laughter]

SK: So there's a number of different kinds of data transformation that happen, and to put a more specific list or set of kinds of transformations that we see, you see, oftentimes, actually as part of data pipeline, you'll flow through a handful of different pipes. For example, your early data transformations are going to be data extraction and parsing of data. In the previous example, I was mentioning you're reformatting it, you're extracting certain fields out of the JSON blobs that you find interesting. For example, you may be parsing and looking for a specific value as part of a larger blob of data. After you go through those, those early stages, sure to get a little bit more structure to your data, you oftentimes then get into a handful of refinements of that data.

SK: You may end up filtering out different pieces of data, you may end up mapping different fields and values. For example, there may be a really wide array of different activities or scores a user could have, but really all you're looking for is, "I want to bucket my users into one of three categories: Low, medium, high activity, for example," and so you're mapping fields into its specific values and these are those incremental refinements that you'll see. You'll also see enrichment of data as well, or you may say, "Hey, I have a bunch of my user activity, but I don't know very much about the user in that activity stream, but I have a bunch of user metadata over sitting in a different data set, let me enrich this data by looking up that user data as part of that other data set to build a more detailed view of that user. So that's actually adding more and combining data together.

SK: The next stage that we start to see a lot of users go through is what I would call the cross-record correlation, sometimes, it's aggregations, analytical style operations like count how many users at the particular time did X, Y or Z. We'll also see correlation of events. For example, this morning I was sessionizing user activity. I want to know, "Are these distinct user sessions or are they the same user session?" And so you need to correlate one record, one activity of a user compared to the next, the previous or the following activity to see how long of a time gap there was in between those two. So, the transformation that happens there is actually in ordering, so that you can get a cluster of events nearby each other. So that's where you'll see, the further down in the transformation logic you get, the more transformations you provide. You kind of go down this path from just basic restructuring and reformatting to enriching and expanding the data, and eventually then to either correlation of that data or for summarization or aggregation of that data to provide some structure.

SS: When I look at the use cases that customers have and what jumps out to me in terms of transformation, it's a lot of what Sean talked about, it's those two very distinct steps of first filtering, enriching all your data and then actually doing some sort of aggregation types correlations. And something interesting I've noticed and that just came to mind was when the enriching also turns into it's own form of congest, which is I've noticed with a couple of our customers do, which I didn't realize was a big use case, but it's really for that data that you're in the process of transforming data, you're actually ingesting more data, AKA calling an API for that specific record or batch of records to ingest it. We'll have customers take some data.

SS: And in order to enrich it, they'll batch it, send it through an API and then get some sort of... Either that API is, has some... Throwing some value back or it's actually a machine learning endpoint that is running analysis on that data. So for me, that's kind of a really interesting use case that I had never hit before coming to Ascend which was... It's transformation and it's enrichment of data, but you're actually also... It's a form of ingesting data. So I think when it comes to data transformations, there is sort of like a breadth of things you can do, it's very, very wide topic of what you can actually accomplish and what you can actually do with your data.

LD: So, is there a standard process that folks should think about when it comes to transformations or a standard process they should think about not using? Full disclosure, part of some of the pipelines Sean was working on this morning were things that I had asked for, for marketing stuff, or at least I think that's what... I hoped that's what they were and...

SK: We'll see soon enough.

LD: We'll see soon enough right, exactly. And I think that it's not unusual, it's not unusual that marketing might need data. Hopefully, marketing needs data, data should be data, marketing should be data-driven. So I think what typically happens is marketing says, "Hey, I need this data," like I did, and probably what happens because, if you have somebody within marketing who is a marketing analytics person who can actually write the transformations themselves, then, well more power to you, but most marketing teams probably don't. So what probably happens in my brain, at least, I think after having worked here for a while, worked in this industry for a while, is probably marketing goes to data engine and says, "Hey, I need this data, I need something...

LD: Get me this data in this format and I need it in this BI tool. I need it from this place in this BI tool. And data engine goes, "Okay," and then they just do their thing and they write the transformations and then marketing wants that. And I'm like, "Nyeah, it's not exactly what I need, I need to do it this way." And there's some massaging of it and there's some different works and they have to kinda go back and forth and maybe the transformations have to change a little bit. So in my mind, maybe that's a little bit of how it works. So is there a kind of best practice for streamlining that process to ensure that you're not kind of iterating constantly on what those transformations are, or is that just kind of by nature what it's gonna be?

SS: I can start here, too.

SK: Go for it, Shayaan.

LD: Yeah, take it.

SS: I think a lot of what you've said is what I've seen a lot of is that you have either externally or the data entry team has its own goals, and it's kind of starting with understanding what that is and what do you need your data to look like regardless of where it's coming from. So it's really about that and understanding what you want. And then from that, it kind of goes into, okay, you know what you want the data to look like. Now you start with, "Okay, what am I dealing with? What do I have?" So really taking that time to understand the source and what does the source data look like. Once you have those two kind of understood in the back of your mind, it's like, "Okay. Here's my point A and here's my point B." That's when you can kind of start to understand what that... The actual transformation itself looks like.

SS: So in the previous step of understanding your data, when you're playing around with the data, either through like querying it or just trying to understand what the data looks like, you do kind of start to transform it as you're going. Querying, kind of, is just another form of transformation if you are just running SQL against your source data. So I think with that next step, it's really just breaking that down into bite-sized pieces of transformations, like... First, I'm gonna do the enrich stuff 'cause I know I need the data to filter out these certain values. Let's say, if you wanna start with filtering out null values and then you wanna go ahead... Like Sean had mentioned previously, you wanna go ahead next and maybe do some enrichment with your data.

SS: So I think breaking down... From what I've seen, breaking down your transformations into these kind of logical components makes sense, because as you get to the point where you need to maintain these sort of transformation pipelines, which usually you do, 'cause use cases change and people need to find, "Okay, where in that pipeline do I need things change?" It makes it easier to kind of pinpoint, "Okay, this is where the Logic breaks down," or "this is where the logic needs to change." Once you have those steps kind of thread together and you have the transform built, you do need to start thinking about how iterative is this. Like does this... If someone else were to come in and needs to make a change to this pipeline, will they be able to? And that kind of applies with any sort of codes that you write. Is there a streamline process for me to make changes to this pipeline? To make... If someone else were to come in and do the transformations, could they do it? And could they understand what I'm trying to do with this data?

SS: And then I think finally, after that, the biggest thing is, if you are trying to handle the infrastructure side of things of transformations, understanding the efficiency of what you're doing with... Whether it's a SQL query you're running or Python script that you're running, are you making smart choices with how you're querying your data? If you're doing some joins, are you... Does the join makes sense for your use case or are you taking like billions of records and joining against another billion records along a single cluster and you're getting "out of memory" errors? I think that's the final kind of step of that is, is this scalable? Is what I'm doing... Does it make sense and is it scalable?

SK: Yeah, I'd agree. I think one of the things that you said, Shayaan, that, it reminded me... Gosh, this is probably all the way back, I don't know, maybe sophomore year in college, which I think at this point is about 20 years ago now, so I'm just excited that I remembered something from that far ago. But I look at planning of the data transformation process and the actual execution of it, not the technical processing itself, but the execution as a data engineer of that process, as a bidirectional search. For those who don't remember that algorithms class or didn't take it or just didn't pay attention in that class, bidirectional search is an efficient algorithm as to how to find the shortest path between two points in a graph. Not a processing graph, but a conceptual graph of... On the left, you have a bunch of raw data, and on the right you have business needs and use cases.

SK: And the reason why I described this bidirectional search path, it's... How do you actually traverse from both sides towards the middle efficiently? And the reason why this matters so much is, I think historically, we have seen teams operate from just one perspective or the other. One is business team tossing over requirements to the data engineering team, saying, "Hey, just get me this stuff," as you were jokingly describing, "Just get me this stuff in a BI tool of this format and we'll be good." And then the data engineer can go in and build in it. Or conversely, data engineering team just sort of looking at their data be like, "Well, we got a bunch of stuff. Maybe we can join some of these things and refine it a little bit and create a new more "more valuable" data set and maybe somebody would use it. And in that case, you end with a little bit more of a Field of Dreams strategy, except where nobody actually comes in that case. And so the really valuable part, and Shayaan had touched on this too, which is as you begin to implement and roll out how you're processing and transforming that data, continuing to validate and understanding the use cases, why are people looking for this particular data set?

SK: What are they trying to extract from understanding the importance of the various element of the data, so that you can then start to extrapolate how... May they want to be able to use it. Once you enter this particular question, what is the next natural follow-on question, and if so, how do you make sure that you have the appropriate data sets prepared such that you can build on top of that very efficiently? I think that's really the next level on that panacea of data engineering is to be able to connect the raw data to the actual business use case, but to understand the full context. To that end, when I think about the process of planning out how you transform data, it has historically been much more of a waterfall style process, where you get some business use cases, and then you get on a really big whiteboard, or in these days, a virtual whiteboard and you start to do this huge mapping process. I have all this data in these fields conceptually do X, Y, or Z and you end up with these huge diagrams of basically field mappings and a lot of conceptual.

SK: And this is something that we are really passionate about. I know I started to look at about how do we break that waterfall pattern and paradigm and instead make change, and more importantly, even iteration, very cheap and easy and streamlined such that as you go and build new data pipelines, you can rapidly on the order of minutes crawl through exploring the data first, mapping out the fields, prototyping a query, shoving it off to the processing cluster to actually go and run the transformations itself, and then validating that data and incrementally move forward, just each hop closer to that business use case as quickly as possible. So I think that's something that we will see happen more and more as ETL embraces more modern engineering-style paradigms, which is getting away from that waterfall plan process around your data transformations and just rapid iteration that helps you explore that space faster and get to the target data sets sooner.

LD: So what you're telling me is, it's not always that if you build it, they will come scenario is what I'm hearing. It's not the Field of Dreams. It's... Got it.

[laughter]

SK: Yeah, that's the... We do see with a lot of these teams, we work with tons of companies, and the classic consternation has always been from the engineering IT side, the, "Hey, we gave everybody all this access to data." They just... They don't understand that creating this piece of data set, it's hard, but we give them all these other things. And then on the business side, people are saying, but the data they gave me isn't the one that's valuable, all I want is this other column and it ends up being this impedance mismatch between the two. And that's really where that value of being able to understand both the data and how it applies to the business and bringing your teams to be able to iterate so quickly on that transformation logic is really, I think, the key.

LD: So the other thing you're telling me is that teams and job hostings for data engineers need to start adding mind reader to one of the data requirements bullets. That's what I'm hearing.

SK: I do think it is a very valuable skill set to understand the problem, not just the data or the technology that moves the data, but why the data matters. I think that is going to be the increasingly valuable skill set in our industry.

LD: Makes sense.

SS: I've seen that a lot, where the people I've seen who are successful in any sort of data role have really good contextual knowledge of what they're actually working with. So, you hear software engineers who started off in a different field and then came into... And this is more of a meta conversation, so not really [chuckle] not transformation, so feel free to reel me in. But they already know the... They already know the industry and the specifics of that industry, let's say its healthcare, and then they're like, "Oh, I wanna do more of software engineering." So they become a self-taught software engineer, but they already have that really great contextual knowledge about what kind of data you're dealing with in healthcare.

SS: So when it comes to building some sort of ETL pipeline for healthcare specifically, that contextual knowledge that they have really helps them on both ends. It helps them on the understanding what that initial data looks like 'cause that's stuff they've dealt with. And then when you know that industry, you also know what... You have a better understanding of what your business use cases are. So I think having... Even if you're not part of that industry, having contextual knowledge, and that's something I've worked on trying to... With all of our different customers in different industries, is really understanding contextually the bigger picture of, what is this industry? What are they trying to do? And I think that helps a lot with any sort of data engineering project.

LD: Yeah. I think you're exactly right. I was... My next thing was gonna be to make some joke about the fact that Sean was able to build the pipelines that I asked for this morning, probably without asking me a lot of questions, because he's at this point has the ability to read my mind, but in reality, it's probably because he's... We talk so much and he's so very much involved in the project in which we are building the pipelines for, that when I... When we talked about what we needed, he immediately understood what it was we were going for, and it's like, "Okay, I got it. I know what we're doing. Cool. Got it. Done." And it's not like it's somebody who's five steps removed from what marketing is doing. It's somebody who's zero steps removed from what we're doing and therefore it's easy to do. I can only imagine how difficult it is if you're looking there and to use the demo data that we use, which is taxi cab data and weather data just to think about that, if you, for instance, are trying to say, I'm trying to figure out how few people...

LD: If what you were trying to do is go run a special on cab fares to make them cheaper on days where it's sunny, because you know that people aren't taking cabs when it's sunny. But what the transformation is giving you is how many people rode cabs on days when it was raining, because that's what the transformation gave you because you were so... Data engineering is so far removed from what, again, marketing is doing, just using that because it's where I am, it's because there's a communication problem there,

Getting to the Heart of ETL with Data Transformation

Episode Transcript

Try it out. Your future self will thank you :)