Diving Into Data Engineering with Sheel Choksi [Podcast]

In the latest episode of the DataAware podcast, Sean and I chatted with our own Sheel Choksi about his path to data engineering and what he’s learned along the way. Learn more about the fine art of data hoarding as well as what to avoid when building out data orchestration systems—and how to avoid it—in this episode of DataAware.

Transcript

Leslie Denson: If there’s one thing that always seems to be a little different, but also fairly similar, it’s how data engineers become, well, data engineers. I’ve had the chance to chat with Sean and Sheel Choksi about Sheel’s path to data engineering, the fine art of data hoarding and how that impacts the interplay between different members of data teams and this episode of DataAware, a podcast about all things data engineering.

LD: Hey everybody, and welcome to another episode of the DataAware podcast. I am joined once again by Sean Knapp. Hey Sean, how’s it going?

Sean Knapp: It’s going great. Hey, everybody.

LD: Good, good, good. We are super excited, which you will hear me say a lot because I get very excited actually about all of these episodes, but today’s very exciting because we have our own Sheel Choksi, who’s one of our solutions architects here at Ascend, on the line to talk with us a little bit about his, quite an interesting history in data engineering and as actually, a user of Ascend, who’s somebody who has been doing the work that we’re talking about. So we have Sheel on the line to chat with us today. Hey, Sheel, how’s it going with you?

Sheel Choksi: Hey, Leslie. Yeah, it’s going great. I’m getting ready for the holidays, and hey, Sean. Good to see you.

SK: Hey, Sheel.

LD: Awesome. I just gave a very high level and not the most in-depth or probably, descriptive overview of kind of your history and your background. We always love chatting with Sheel because Sheel actually came to us from being a customer. So Sheel, why don’t you tell us a little bit about your background with Maven, kind of what got you into the data engineering space and just what your history is. That’s not a broad question at all, but we’ll just go with it.

SC: No, I think I can make it work. I’ll try not to go back to childhood memories, but let’s see. As far as my history of it goes, is, I started in the software engineering space and actually, I started in a company called Pivotal Labs, which is a software consultancy, a lot of startups as well as enterprises, and it was just a great opportunity to work with just awesome people on anything from front-end, back-end systems, we had a product, Cloud Foundry, so working on… This is even before some of the Docker days, working on container level of containerization to run people’s code on Cloud Foundry, and so I really got a chance to work on a gamut of things over the years. And that’s actually how I got introduced to Maven, which was a startup. And I kinda wanted to take the plunge for myself and see what it was like to work in a startup, and as I’m sure, you all have heard before, the startup is always the variety of experiences, often more even related just to technology, where you’re still also trying to figure out like, “Hey, how do I get Internet into this office for everybody to use?” It’s like, “Well, that’s not really gonna be software engineering, but someone’s gotta figure it out and you’re just gonna get it done.”

LD: You wear the many hats.

SC: Wear many hats, yes, exactly. And so in going through the startup experience of joining early and building out teams, some of the teams that I built out where first, software engineering and then product, and then finally data analytics. That’s how I started getting closer into the data engineering space, not through a intentional job change of, I wanted to be a data engineer, but more, okay, software teams have a lot of data that they’re working with and are generating, clickstream type tracking, relational stores and orders and users and things like this, and the data and analytics teams are trying to use that data and unlock insights, and so it turns out the bridge there is data engineering.

LD: I don’t know, so you fell into it, but you kinda fell into it a little bit is what it sounds like. What made you decide, “I wanna continue doing this,” instead of run away screaming with your hair on fire?

SC: Yeah. It might partly come from… At Maven, we were primarily working in this programming language called Closure, which it’s a little bit more of a niche language, but one of the things that Closure really talks about is, all your code is data, all your data is code, and really, the point of that sentiment is saying that the more that we can bring down into the data layer, even in a programming language sense, the more flexibility we have, the more structure that there is, the more isomorphic, let’s say, that everything can become. And when you start thinking about that way, a lot of things start looking like data problems and rules engines, and how can we abstract away what we’re really writing in this imperative thing into more of a code base/data engine. And so when you start doing that for, let’s say, five or six years, everything does really start looking like a data problem, you’re like, “Oh, we need to think CRM profiles to this email system?” That’s basically just a data problem, or, “Hey, we need to compute order totals on orders?”

SC: Yes, it’s like a bunch of imperative code to tax us all up, and add tax and shipping and figure that all out, but also it is kind of just a data problem of what we’re working on and how these all get applied on each other. After so long doing that, it was kind of like, well, the data layer is so foundational here and there’s so much you can do with it, working with these tools that are optimized for data, not just the the small amounts you might be dealing with encoder, figuring out an order total, but working with these bigger sets of data, like in the big data space and in data engineering and working with tools that are optimized for this, like Spark and Athena to query it or whatever your stack might look like, I just felt like a place that I wanted to spend my time on and focus.

LD: It’s interesting that you say everything is a data problem, ’cause I think it’s something that I’ve seen a lot in my history. I’m certain Sean’s seen a lot of it in his history, considering what we’re all doing these days. Talk to me a little bit about… Again, you kind of came into it, I think the way that probably a lot of data engineers have come into it, which is from some sort of the software engineering side and then had to fall into it because everything kind of, is a data problem at some point. So how have you seen the industry shift in the last few years or what have been some of the more interesting shifts in the industry that you have seen that have either made you scratch your head or made you go, “Okay, yeah, that makes a lot of sense, I can see why we’re moving in that direction?”

SC: Yeah. So, I credit a lot to the experience back in earlier days of Cloud migration really, when we were first standing up EC2 instances, there wasn’t even much to be said about BBCs, the most you would do is an IP whitelist, let’s say. But what was starting to become very clear at that point was even just with S3 and being able to store things, it got so much easier to just hang on to things. And I just look at that and look at 10 years past that, where even companies of relatively moderate size are now sitting on just troves of things that they’ve just stored, ’cause it’s been so easy. And I like to think of it, we’ve all become kinda data hoarders where we’re just kind of like, “Oh, just put it in S3,” which is fantastic, or I say S3, but similarly Azure, Lobster, Google Cloud store, similar, etcetera. But it’s so scalable and so easy to just hang on to it.

SC: And so I’ve seen that trend definitely increase, where you even look at some start-ups these days and you meet them and they’re like, “Oh yeah, we collect 10 terabytes of data a day,” you’re like, “Oh, okay. Between three engineers. Cool, that’s pretty awesome.” And that part has become really easy, but then when you look at these industry trends of, well, then what? What do you do with that? Are you processing it or are you trying to load it to a warehouse? Are you trying to derive value from it? Is it part of your product? That’s where all these challenges are still, and we haven’t really seen that part be really easy when you talk to this three-person start up, who’s collecting 10 terabytes a day. You’re like, “Then what?” And they’re like, “Yeah, then what? We try to clear it with Adena, these response times are kind of slow. We try to wrangle it, Schema change is hard. We’re trying to hire data engineers, turns out there’s not a lot of data engineers compared to more general software engineers, maybe they’re messing around with the Airflow and Spark and it’s just not like… It’s just not nearly as easy as it was to actually collect all of that data.

SK: It’s actually really interesting, ’cause one of the things that you’re touching on, Sheel, is something we’ve talked about in the past, too, which is, a lot of the foundational problems have or are currently being… So, for example, how do you store insane amounts of data? And in many ways, we’re also now seeing a lot of people who are solving out how to process insane amounts of data, whether it’s with Spark or Kafka or you name it. But then it becomes this question of, what next? And what do you do with it? And one of these you touched on were… Or that notion that a lot of people have software engineers, but they may not have a bunch of data engineers. There’s still a lot more software engineers out there than there are data engineers. And we oftentimes find people coming into the data engineering space from a few different directions.

SK: Some coming in from the software engineering space, some coming in from the infrastructure space. And so I’m actually coming in from the data science or data analytics space. How would you describe the differences of some of the perspective, the world view that those different folks have? And what are the strengths that they can bring into that? And even against both the backdrop that you had at Maven, where you guys, I think, had a really strong software engineering DNA, and perhaps some of the other students too. I don’t think, interacted as much there at those levels. But then also what you’re seeing from all the customers that you work with at Ascend. And where are those centers of gravity and what are the strengths that people can really play to make the most of their data engineering investments?

SC: Yeah, absolutely. So, I think starting with maybe some of the personas that you described, if you will. So, folks who have strong software engineering backgrounds and are helping to build these pipelines, are working now as data engineers, versus data engineers who’ve now specialized in this area; however they’ve done that, versus also data analysts and data scientists who are getting into this. So, starting with the software engineers, which of course is most key to my heart and my background. The skill set there is absolutely learning new frameworks, paradigms, languages, all of the stuff comes fairly easy, I think. You work between different languages and different frameworks all the time, with some examples, some good documentation, you’re gonna figure out what you need to figure out and build it and move on.

SC: And what I’ve noticed with these groups is that for a lot of businesses, software engineers are often ported around into different projects or areas depending on what makes sense. And so a lot of times when it comes to these data engineering pipelines, software engineers are called in when the complexity has increased past what their current skill set of the team is, maybe it’s a lot of data analysts working on a warehouse or something like this. It’s like, “Okay, we need more pipelines, we need to do more, maybe the software engineers can help with this, they’re fairly chameleons, can jump into any situation.”

SC: So, they come with that really strong skill set and mostly wanna get the job done and done right. So, they wanna build something robust that will run on its own using whatever is the right language for the job, the right frameworks for the job. And then ideally keep it to a lower maintenance, fairly automated and a fairly great pipeline. So, I think that’s what we see is a lot of the priorities of those folks. Sean, you work with a lot of software engineers as well. Does that sound kind of like what you’ve been seeing as well?

SK: Yeah, I think so, and I think the… We definitely see with a lot of the software engineering teams, oftentimes they’re dropping in and trying to create something that is super low maintenance, super high automation factor. Actually, Sheila, you and I were even on a call yesterday with somebody who was talking about a pattern that we oftentimes see, which is, you have a data science team that’s sort of thinking, you know, everything, the tail end of the data lifecycle, working their way back up stream. And a data engineering team or our software engineering team may be working their way from the top end of the data life cycle, kind of falling down, and part of the challenge was getting them to meet in the middle. And I think we see this a bunch, and probably worthwhile even digging in a bit on that is, there’s a lot of software engineering teams who are trying to build product that is low maintenance burden, high leverage.

SK: But one of the big shifts that we also have seen is they’re trying to do that where then the data science, data analytics team can actually self-serve on top of that, and trying to meet all the way in the middle becomes hard, right? Because you’re not gonna necessarily get a bunch of those other teams, the Data Science team to still go write really complex names with a ton of code on top of a platform that the engineering team necessarily builds, ’cause some of those tools are still pretty raw. And so what did you see? What did you all try at Maven? What do you see a lot of our customers, and frankly just other people out in the industry doing outside of this and that they’re trying to make stuff work and how that play out?

SC: Yeah, definitely. And I think kind of what you’re highlighting there is the hand-off points almost, where one team is trying to create an interface that the other team can pick up on. You want that interface to be fairly open so that the other team can do a good amount of self-service without having to keep coming back to the prior team, but tight enough such that they can deliver something consistent without having to constantly keep changing it. And so that hand-off point was always critical for us. Even thinking back to Maven period to put it in some kind of architecture terms, almost everything in actually part of the Software Engineering built systems was a series of microservices coordinated by Kafka as we see a lot of that today. But that wasn’t necessarily where the data science and data analyst teams wanted to pick up. They wanted to pick up with files that were already persisted, so didn’t have to worry about Kafka and reading in time before the Kafka persistent stream is expired.

SC: They wanted to have enough flexibility to say how they wanted to build their data model, and not have to keep going back to the engineering team every time they wanted some tweak of the data model. And so that hand-off point is really what crafted this interface for us where we wanted the data to be persisted, ready to be picked up in S3 or call it a data lake if you’d like, but relatively low level of transformation, so that if they wanted a convenience field for the user type, for example, which is an amalgamation of three different columns combined together, they have a platform such that the data analysts and data science team could create that definition from the raw data that was handed off. And so that was how we designed that interface. And I think this is, as we’re talking with other folks at Ascend who are interested in the platform, have a similar sentiment where they’re at a various stage there, maybe they’re still trying to define the interface and the handoff point, what team is responsible for what part of it and then what team is responsible to the other part.

SC: Maybe they’re trying to design it or they have an interface that they’re trying to move a little bit. Maybe it’s a little too restrictive ’cause maybe everything comes back as they take it back to the data engineering team or the software engineering team, and they’re like, “Okay, maybe we need to move this interface a little bit upstream, make it a little bit more flexible.” And so I think depending on where folks are, it’s either defining or moving that interface. But I don’t think we’ve found a lot of folks who are like, “Yes, this is it. This is the perfect one.” And maybe there never will be, but I think folks are just calibrating and finding it so that their teams hit the ultimate objective, move faster and then do stuff with this data. It’s not just about moving it, it’s about actually either doing data science on it or business logic and coming up with analytics to actually trying to get to that end goal.

LD: So it’s interesting, I’ll jump in real quick. It’s interesting talking about Maven being mostly Kafka data, and the whole idea of that’s not… The data scientists don’t necessarily cared about streaming in a previous life, as I’ve said before on the podcast, a company I worked with, worked with streaming data. And that was great, and we had a lot of data analysts and data scientists who came to us and they would said, “We want access to all the data that we have. I don’t really care where it comes from. I don’t really care what it is. I don’t really care… I don’t know whether we have talked to data or if we have data sitting in Spark or if we have… If it’s in a ware… I don’t know where the data sits. I don’t care where it sits. I just want the access to it, I need the access to it, the way that I need the access to it.

LD: And then on the flipside, we talked to the data engineering teams, we go, I can’t even get my Spark pipeline, so I can’t… I don’t know what you want me… There’s no way I’m looking at KStreams or Flink or any of that stuff right now, because I can’t even get the Spark-based pipelines out to the people that need them and that data out to the people that need them in a timely fashion. Where there are companies out there that are built on streaming data, they probably have data engineering teams of 30, 40-plus people. These are huge companies, have massive data engineering teams. But for I think the average company out there, it is no matter what kind of data you have coming in, it can be coming in on a streaming fashion, it can be coming in on a daily basis, on a weekly basis, on a whatever basis, I think it’d still be age old question of just like, how do you connect the dots, how do you connect the data to the people who need it?

SK: Yeah, I would totally agree. We see this a lot where people love to… I think, talk about the batch for streaming and there’s one going to dot… Is there one to rule them all? Yeah, just like we see this ETL versus ELT, right? And the reality is that the world is complex. Companies are complex, their needs, their people, the product they’re trying to build are complex, which is great because the complexity generally carries with it a fantastic amount of job security for all of us in this industry, but I’m sure we’re very excited about and appreciative of. But we oftentimes see different things really come, I catch a ton of momentum, and the pendulum sorts of swing in your hand and Gartner has their fantastic hype cycle that’s associated with this, where you see everybody sort of moving forward on like, “We’re all in on streaming. We’re all in on ETL or ELT, and passionate like we find over time, there’s just this balance where we even did this back at… The last company I was the founder and CTO of, we were really, really in with… Even before Spark and Spark streaming with Storm, which is an open source technology that came out of Twitter.

SK: And so we started doing real-time analytics dashboards for our customers. We powered online video distribution. And so the demo was, you pop-up a player and literally can start to see within a second the incremental tick marks of how many people are watching and viewing your content. It was like the coolest, sexy thing ever, where you go in demos, and people were really excited about it. And we went through this huge product initiative of, What if we moved our entire Analytics product, which is a huge, big, powerful, complex analytics product, all to real-time? And what would that take? And what would our customers do? So we did this huge amount of research with customers, and basically it was the, what questions are you trying to answer with the data? What things, activity, are you trying to automate with data? And what we found was there was still a very small component of what they were trying to do that needed and actually necessitated real-time data. As many of the things they were going to do had a human in the loop, right? It was, Oh, I am going to go publish a different piece of content or invest more in this strategy, that as a result, the latency was so high, that the need around real-time data, and actually was less than the desire for deep levels of historical data for more complex analysis.

LD: Yeah.

SK: And that doesn’t take away from the need for streaming, it just means you kinda need both. And so as we get excited about one pattern or the other, I think you need both. I mean, we even see this… Sheel, you talked about this where you hit on the… Like a lot of people right now are like they’ll grab all of their data and they’ll put it in, collect it in that data lake and try to run Athena on top of it, or we see a lot of the ELT, they’ll be like, I’m gonna collect a bunch of that data and put in my data warehouse, and then I’ll do ad-hoc queries and even create reports on top of it. And a lot of this was because folks are trying to get away from that ETL paradigm and the big data paradigm, ’cause the data engineering part is hard. Right? At some point, we always see folks who are trying to optimize, they’re like, “Ah, this is really slow, or it’s really expensive, or Hey, the infrastructure team is asking us if we can stop red-lining our Redshift or Snowflake cluster so much because we’re running the team queries over and over and over again, etcetera.”

SK: And so, you find over time people try and find that balance, and so I guess one thing that… Sheel, I’m kind of curious, ’cause you spend so much time with so many teams on this is, what becomes those tipping points? As we see folks going from ETL to ELT to kind of realizing those balances. If you’re somebody listening to the podcast, what are those patterns you should start to be aware of where you’re like, “Oh, pain is around the corner, we need to actually figure out how we find a hybrid strategy to make sure that we don’t lose our efficiency as a team, and as a company?”

SC: And maybe even more specifically around patterns of ETL versus ELT, or team patterns or I guess…

SK: Yeah, both. Like, What do you see as we start talking to teams where like… I mean, to be really frank, the vast majority of people we talk to is, they generally have come to us because they are feeling a pain, right? They’ve felt the pain, and they went to the Google and they started asking the Google like, how do I do X? Or how do I alleviate pain Y? So what are those common patterns that you see that they’re already experiencing? And what’s sort of their set up? And how do they get… Obviously, one way they help get out of this, of course, is with us, ’cause they’re talking to us, but how have people found their way of navigating that?

SC: Yeah, it’s interesting, and I would say there’s maybe a couple of different ways that I can try to bucket it, and one is folks that have created home-grown pipelines, and so this might be Python scripts, it might be Spark Jobs orchestrated with Airflow, it’s something that they’ve put together internally to solve some exact pain points as you were talking about, Sean, maybe we do want to publish a piece of content based on this analytics. Okay, great, let’s at least automate the report generation and the analytics, that doesn’t seem so bad, you know, maybe we run the SQL query to unload it to this bucket, to then maybe run some Python scripts on it, to maybe analyze it a little bit and publish it over here, so that’s kind of the home-grown area that I’ll call it. Then the next one is where we see folks who have fully embraced the ELT, and that’s the…

SC: I’m gonna try to just transform it as lightweight as possible in order to try to get it into the warehouse in a way that I can work with it in the warehouse, it turns out there’s some nuance in those words I just said of as lightweight as possible, and how it fits into that warehouse, like for example, if you’re working with a warehouse that doesn’t necessarily have the best JSON support, then maybe you do a little bit of upfront translation to try to break that apart, and your ELT has become a little bit of an e-little mini, let’s not talk about it, Hidden TLT and that kind of warehouse-centric world and SQL queries, and then you typically need something to help you start transforming that data, ’cause it does come in very, very raw in that case. So for example, making page views into sessions, making an order fact table from the raw orders, things like this. And so then you might be doing some SQL and some tools to automate the SQL processing in materialization, whether it’s something again, home-grown or maybe using something like a DBT or something like this.

SC: Then that’s the second world that I would say we see a lot of folks in, and then the third world is some hybrid approach of the first two that I just mentioned with some amount of partners or vendors mixed in there to help you facilitate those parts, so maybe it’s still a homegrown scheduling system, but maybe you’re trying to use DBT to try to run the actual queries, or maybe you’re trying to use a bigger partner or a vendor to automate even more of it, but then there’s still some home-grown side things that just do a couple of web hooks on the side or something like it. And I’d say the tipping points that we typically see here from these folks is that on the home-grown scripting side, oftentimes we see a lot of folks with just plain Python scripts or pandas or stuff like this, and they just outgrow it due to scale and/or complexity. So either the amount of data that’s being piped through that pipeline is too much and the scripts need to be upgraded to something like a Spark, and now you’re really paying into the deep technical of the data engineering space or the one that people often forget about is the types of data grow in complexity. So it’s one thing to say, page view data comes from a 1000 page views a day to now 10 million page views a day, it’s another thing to say, we have 250 different front-end events being tracked, how do we wrangle this and push it through?

SC: Each one might only be 10 events a day, so it’s not the volume of events that’s crushing you, it’s the variety of events that’s crushing you, so that’s typically what we see starts breaking down in those kinds of pipelines. And I shouldn’t say breaking down, but dangers that lie, that would require further investment, I should say. And then on the other side of it, when we start getting into this ELT world, we typically see two breaking points there as well. Ah, I said breaking points again. We typically see two lurking issues there. One is the level of customization that’s required in the EL starts becoming more complex. So, for example, the out-of-the-box Salesforce adapter that you tried to use can’t handle more than 50 custom fields and your business happens to have a 150 of them. So now, you’re like, “Okay, wait, hang on, I need to write my own EL.” And then you realize at some point, “Wait, hold on, I’ve just done ETL.” And so that can be one breaking point there. And then the other that we start to see there is more complex business use cases, so pushing into the data science. Leslie mentioned this earlier like, “I just wanna work with that data, I don’t care where it came from, but I need it in the way that I need it.” And that’s not necessarily in a warehouse, if you’re trying to run a bunch of machine learning algorithms on it.

SC: You actually probably want it more in a data lake and stuff like that. So, now you end up with this paradigm where you’ve used maybe an out-of-the-box adapter, you’ve put it into a warehouse, maybe you’re writing a bunch of SQL queries to transform it into a format that you want, but then you wanna do even more with it. You wanna run machine learning algorithms on it, and you’re like, “Ah, okay, well, now maybe we need to get it out of the warehouse.” Or maybe you try to use some warehouse solution that lets you run machine learning algorithms on it, like trying to run a Redshift through like a SageMaker or something like this, or you end up actually wanting to create into more open formats and try to get it into a data lake, so that you can then run clusters against it and things like this. So I’d say that’s the one that we often… Or the two different areas that we often see in that paradigm as well. And then of course, the third paradigm of mixing partners or vendors, interspersed in all that, same sort of trade-offs, just maybe a little bit more out of your hands.

SK: As you encounter a bunch of different implementations, maybe a fun question, hopefully not too controversial, well, maybe a little controversial, what is one of the patterns that you see people having adopted that they regret the most? Like the thing that seemed like a really great idea at first, and then they’re like, “Wow, we have to get out of this stat.”

SC: Yeah, that’s interesting. On the one hand, we’ve got ones that have become quite proprietary. So, they’ve invested quite a bit into a proprietary ETL system, and in order for them to do anything, it requires this huge upfront migration cost, so that they can break out the code into just more basic like Python or Spark code or something like this. And now that they’ve invested years into it, it’s like, “Okay, well, who’s gonna spend years unwinding this and getting us out of this vendor?” So, that’s definitely one that we see it quite often. I don’t know if it’s the one that we see the most, but we definitely see the most sadness around it, where it’s just like, “Oh no.” This is a really tough situation to be in. And we don’t really know how we’re going to do about it, yeah.

SK: Yeah, well, I feel like those are always the interesting ones ’cause you can then end up in these cycles where you lose that, release early and often sort of benefit of, you find these teams then, right? And then they’re like, “Oh my God, this is so painful, we can never morph this thing, we have to just build a new one from scratch.” And then that new one from scratch, oftentimes, they’re trying to do the next gen of it, but that can then go take a year or two, and by the time that thing is live, “Oh my gosh, it’s already dated too.” And you just end up in this like never-ending cycle. I’ve seen teams on. Versus, I’m sure like the way to just chop it up and figure out how to incrementally catch it up, which is probably less exciting, but actually from a project perspective, far more derisk and creates a lot more pain alleviation faster.

SC: And I’d say the same is true, it’s the same practices from software engineering just applied to data pipelines, ’cause this is exactly what we would go through in the consulting world of it, like, yes, you’ve got some legacy system, maybe the programming languages from 20 years ago, and no, we’re getting difficulty finding people who can work in it, the temptation is always, “Let’s build it from fresh.” Everybody loves greenfield, everybody loves working on the latest and greatest, but yet those projects are always very, very difficult, tend to go out of scope very quickly, tend to have a lot of, let’s say, internal people, systems, that have to be solved around it. So, more of a people problem sometimes than a systems thing. And a lot of times it makes a lot of sense to iterate your way out of these, “Can we replace one chunk of what this is doing? Can we start here? Can we see how that’s working? And can we make decisions that allow us to, in the future, since everything is going to change forever, like this we know, allow us to make future type changes more malleable and easier to keep up with?”

SC: So, I think this is why we see such excitement around microservices, call it a buzzword or call it the wrong choice for a lot of companies, if you want. But the reason there’s a lot of genesis around it is, because it’s solving a lot of these… Or let’s say, it’s solving or promises to solve a lot of these things, where one of these services is getting pretty out of date, it’s having a lot of trouble keeping up, let’s replace that one service without scrapping the whole system.

SK: Yeah.

LD: Yeah.

SK: Of course. It’s funny ’cause we used to always have the saying where like it’s… I think it’s a fool’s errand to try and build product and build technology that has zero debt, ’cause you get diminishing returns on trying to make sure that your system never has debt. And so oftentimes when we think about the benefits of microservice, is we used to have the saying that, “Debt compounds linearly across microservices and exponentially within a service.” And so the bigger your service gets, the more complex it is, whereas you can keep that debt and that sort of risk of sort of future pain far more compound, or far more compartmentalized, if you have well designed interfaces and a microservices model, where you can blow up that service and as long as you can adhere to that same contract, you can implement it differently and it makes you more flexible as a team and it derisks a lot. And it’s funny, because when we think about this from a data engineering and data pipeline perspective, we’re a solid five to 10 years behind as a field of engineering where things aren’t very much on the rise, things aren’t far more malleable in the sense of, “I can write part of one microservice in Python, another microservice in Java, and they can still talk to each other right over GRPC or Thrift.” Today, data pipelines are far more data equivalent of the monolithic binary, the Frankenstein which is constantly evolved and tagged on to, over the course of the last five years.

SC: Yeah. And to your point, the interfaces are very low level. A lot of times, the best we’ve done is, “I will write it to this S3 bucket on this path at this time every day and you can start your pipeline over there.” And well, as we all know, there’s a lot left to be desired there from an interface perspective on doing that hand-off point. What I am excited about, though, is this progression of taking these software engineering principles that a lot of the community has rallied around over the past 10 years, and bringing that to data pipelines. So, starting with earlier teams where it’s like, “Oh yeah, the only tests we run is in the production pipeline, that checks the validity of the data and if not, it aborts everything.” To now, we’re starting to see uni-testing frameworks and ways to evaluate this code and CICD and a lot of these principles is similar. And of course, including what you were talking about modulatization onto these pipelines.

LD: Well, I’m gonna cut us off there, ’cause we have to have something for the next time I force Sheel and Sean to join me on the podcast. [chuckle] So, thank you both. I appreciate your time and you will be hearing from them again, because they will definitely be doing this again. We’ll dive into some other topics. I wanna hear… Sean and I talked about it a little bit, but I wanna hear from both Sean and Sheel and hear them bounce ideas off of each other of, what the future looks like? So, that’ll be the next one that we have with these guys. So, thank you both.

SK: Thank you for having us.

LD: There you have it, folks. Between his history, actually building the data teams and being a data engineer and now he’s worked with a variety of companies to help with their data engineering efforts. I always, always, always learn something when I chat with Sheel and I hope you did as well. As always, you can visit us at ascend.io if you would like any other information or wanna reach out with feedback or questions you want us to talk through on the podcast or guest suggestions, we’re open to all of that. We would love to hear from you. Welcome to a new era of data engineering.

Ep 4 – Diving Into Data Engineering with Sheel Choksi

About this Episode

Transcript