Ep 23 - The Three Eras Of Data Engineering

Sean and Paul talk the three eras of data engineering teams move through as they get more mature with data processing. We unpack the kinds of metadata required at each stage, and how realistic it is to build a system that processes data in incremental packets instead of full reductions.

Transcript

Paul Lacey: All right, everybody. Welcome back to the program. This is the DataAware podcast by Ascend, the podcast that talks about all things related to Data Engineering and automating your Data pipelines. I’m Paul Lacey, your host and the head of product marketing here at Ascend, and I’m joined by Sean Knapp, the CEO and founder of Ascend. Sean, welcome.

Sean Knapp: Hey, everybody.

Paul Lacey: Hey, Sean. Great to have you back. It feels almost like Groundhog Day on this show, but it’s a good kind of Groundhog Day. It’s like, oh man, we always get on Thursdays to get together and we talk about some cool stuff and publish it in the afternoon. It’s great.

Sean Knapp: This really is one of the highlights of my week. We get to take a lot of the learnings from the previous week and oftentimes we don’t get enough opportunity to really just sit down and chat about the learnings and digest all the new things happening, so this is great. I’m really enjoying these.

Paul Lacey: Me too. I think that, as I listen to these when we’re doing the editing phases as well, the amount of stuff that is coming out is so great and it’s giving people the blueprint, I think, that they need to think differently about how they’re doing their Data Engineering work and how they’re building out their Data pipelines today, and that’s our goal. If we can get somebody to take a look at what they’re doing today and think about, “Hey, is there, can I 10 x this somehow? Or is there a different process that I should be employing?” That’s our goal with this show and so yeah, let’s dive into it, Sean. I think we’ve been talking a lot about metadata so far. The past couple of episodes we’ve talked about what does it mean to automate your Data pipelines? What does a Data pipeline automation stack look like?

We talked last time about the intelligent controller. We talked about it being a high anxiety controller that’s constantly looking for things that have gone wrong or things that need to happen within your data system and then going out and executing those tasks. I think that the thing that most people understand that you need behind that is good metadata, good information about your data sets and where they came from, what’s happened to them, what needs to happen to them in the future, that kind of thing, which makes sense in concept, but when we go at level down and say, “Okay, great. Now what if someone was going to start to build a system like this, what kinds of things would they want to track? What kinds of attributes do they need to know about their data?” How would you as an architect, Sean, start to unpack this challenge of what are the things that I’m not knowing about my day-to-day that I should be knowing?

Sean Knapp: Yeah, I think the short answer would be everything. We want to know everything about our data. I think we mentioned this last week too, which is metadata really is the new big data, which sounds super cute and cheeky and I’m sure half the people listening are going to eye roll all at the same time, but it’s true and it should be true as when we think about data and movement and processing and access, there’s so much to analyze and to track about the data itself. I have new data sets coming in. When did they come in? What is the distribution of that data? What systems are accessing them? What code are they processing that data with as they move it somewhere else and create new versions of data? All of these kinds of things. How many resources, I guess, also, that were consumed by that all of this is part of a really rich layer of metadata that we can and should be tracking about the data itself that flies through our systems.

I think even more so too, as we get closer and closer out to the end consumption side, who’s accessing data, what are they accessing it for? What are they doing with that data? Because as we start to track things like that, we can then figure out, well, are the things that they’re doing with that data similar to what other people are doing with that data and can we further optimize it or can we provide suggestions around other data sets that might actually be interesting to folks? There’s this whole wealth of data that we really can and should be collecting about the actual data itself. That’s not even to the semantic layer pieces too, that obviously I know we’ll talk about more. This is just in the baseline level of what a system itself should be tracking so that we can really intelligently and efficiently run the underlying infrastructure and movement and processing of data.

Paul Lacey: Makes sense. I’ve definitely heard and read some pretty compelling thought leadership in the industry from some of the analyst agencies that are out there and others that think really deeply about these sort of things that does talk a lot about the importance of the downstream metadata that you mentioned there, Sean, too, in terms of who’s accessing? How frequently are they accessing it? Are they accessing it via the most efficient way or do we need to trigger some things to happen, because this seems to be a very popular dataset in our organization and everybody’s doing this particular lookup on the dataset or they’re accessing this particular materialization? Maybe we need to actually go and move that into a more performant data store, for example. Maybe we need to take this virtualized dataset and actually materialize it somewhere. Things like that are super interesting to think about. I guess, on the other hand, it feels a little bit like sci-fi. If you’re somebody who’s doing Data pipeline work today and you’re sitting there thinking, “Yeah, of course it would be awesome if my system was autonomous and it knew that somebody’s accessing this dataset a lot and it’s very expensive the way they’re accessing it, so I’m just going to move it over here and then it’s going to be better.” But Sean, is that really the case or do you feel we are pretty close to having that start happening for folks?

Sean Knapp: I think we are pretty close. We’ve seen companies like Databricks talk about some of their, I’m not going to use their exact appropriate terms, but basically dynamic repartitioning of data based on access models to optimize the access inquiry patterns. I think we’re getting closer and closer. Whether that happens in 2023 or 2024, TBD, and whether or not it happens en masse or it’s early prototypes, again, TBD. But I would say when we think about this from the scientific perspective, there’s an abundance of prior art that says these kinds of things are actually quite possible. We’ve had Database engines and Query planners and Query optimizers forever. We think about the analyzing access patterns. It is just a bit of a other side of the same coin. A Query planner and optimizer takes, given your Query, let me better understand how to best pull the data off a disk as it currently resides, tapping into the indices and the partitions and so on, and then optimizing how I process and Query that data.

But in many ways, if you have all the ways that people have been accessing and processing, wouldn’t you then pretty easily think of, “Well, how could I better lay out this?” The different structure, a different approach, but I would say from a scientific perspective, I don’t think, it’s not an order of increased complexity or challenge, it’s just a different approach to solving similar categories of problems. Now, where I think we, or how we get there, I would say, is probably a sequence of smaller steps. We’ll start with basic Analytics off of the metadata, and we already do this in the Ascend product today, which is give you really deep visibility not just at the, “Hey, here’s how your Clusters are working, here’s how much your Snowflake or Databricks usages or here’s how much of the ascend compute layer that you’re using.” We’ll actually go deeper.

We’ll say, “Hey, for this particular job, here’s how many CPU seconds this particular job is taking on this particular dataset or component.” The reason why I think that’s so valuable is it starts to surface all of these hotspots. One of my favorite visualizations in the products is the sunburst visualization, which is those concentric circles that you start with, “Hey, which data plan am I currently working with? All right, which data domain and data products?” As you start to highlight out, it lets you really hone in on, “Well, who are the worst offenders?” Not the people per se, but the parts of a Data pipeline that are consuming a ton of resources becomes really fast and easy to hone in on. Oftentimes, we see 2% of your data sets or your components or your assets are actually contributing to 50% of your usage of an underlying platform. It lets you hone in really quickly on where should you spend your time to go optimize. Today, this is, I’d say, automation assisted development, but I think we’re going to continue to see, we certainly believe this, that we’ll continue to see a move from automation to assisted to automation suggested to eventually automation driven and approved optimizations, and I think we keep incrementally marching closer and closer in a way that still builds best and confidence with the developers that are responsible for delivering these platforms.

Paul Lacey: The pattern is the more you know about your data and your data sets, the more automation you can drive. If I was to sum that up, it sounds like right, Sean? If we’re talking about access patterns, the more we know about those, the more we can automate what behaviors the system takes around that. The more we know about compute patterns, the more we can automate. Is that right?

Sean Knapp: Yeah. As an eighties kid, right? What was it? The GI Joe, knowing is half the battle, something along those lines.

Paul Lacey: Yeah.

Sean Knapp: There’s somebody else who had a campaign of the more you know, but I actually can’t remember the words that came after that, so we’re going to go with GI Joe instead, but yeah, totally. I think the more information we have about how a system’s operating, the more we can optimize it, the more that then highly automated and intelligent systems can actually suggest new things too. I think this is one of the reasons why I think we’re just at the earliest of innings and we’re about to see this whole new era of really intelligent data systems that, whether or not we’re fully building with AI or heavy automation end to end Data pipelines, I don’t think I’m going to take that position just yet becuase I think it’s far enough out on the horizon. I think it’s an eventuality, but I don’t think it’s right around the corner just yet. But I do think we’re now in the era of rapidly improving automation and AI technologies based on that metadata for enhancing developer experience and productivity. I think that we’re very clearly entering this era.

Paul Lacey: It is interesting, when you bring up the example of the Databricks, optimization, Query optimization stuff, and I know that Snowflake has a ton of IP under the hood, in a similar vein, that you could just dump a giant blob of data in Snowflake and it will figure out the partitioning strategies and things like that to just make the queries super efficient, performant. I know we talked about that on the show before. There’s a parallel to this, I think, in the Data pipeline world, isn’t there? Where the Data pipeline, well, one of the things that we talk about a lot at Ascend is the partitions of data with regards to a Data pipeline. Can you say a little bit more about what that looks like and what folks should be thinking about?

Sean Knapp: Yeah. Well, I think dumping all your data, whether it’s in a lake or a warehouse and doing a bunch of queries on it and then incrementally optimizing how you get the data out is very, very, very related to, quite specifically, how Data pipelines work. This is like that classic pattern. I’ve watched this and it feels like a very meta version of Groundhog Day spread out over the course of a couple of decades now, which is just about every team follows the same pattern and we all kind of do it over and over again, which is, “Oh my gosh, I have all these amazing cloud warehouses and they have these amazing capabilities. Now let me just get all of my data into the warehouse,’ and it usually follows a data replication strategy, get it into the warehouse, do ad hoc queries against it, usually then introduce some sort of a semantic layer.

Oftentimes, even just the most primitive implementations of those would be non Materialized Views that then people Query through but still hit all of the raw data. As you do that, you go through this era then empowerment of the rest of the organization. People have access to all of the data, it’s instantly updating and they can hit it against these semantic layers that make more sense, that are easier and they start to put a lot of dashboards on top of it and you get a lot more usage of the data. Eventually, what happens then is you start to shift into this next wave, which is, “Wow, we’re eating a lot of warehouse dollars trying to Query a bunch of raw data that’s been sitting in our lake or our warehouse and now our dashboards are running slow, our costs are going up or hitting these other challenges,” and that usually then starts to pull in the new era of Data Engineering.

The old era was, “I just had to ETL some stuff to get it into my warehouse,” but now this new era of Data Engineering that we see really starts to emerge when you hit this ceiling around cost, performance, and scalability, which is, I shouldn’t be rerunning the same Query on all the same data multiple times a day or multiple times, even an hour or a minute. I need to actually optimize because the general philosophy is if you’re going to run the same Query on the same piece of data, you probably actually should be pre materializing that dataset so that it’s much smaller, much more compact, and you’re touching one 100th or one 1000th of the data volume as you go in Query. That is really where the world of Data Engineering starts to come in, and it can be Data Engineering, it could be Analytics engineering.

This is the second era which is, “A-ha, I have all this raw data. Let me go build a bunch of these cascading ELT transformations.” In many ways, the oversimplified notion is cascading Materialized Views. Every hour I’m going to refresh my data, I’m going to cascade all of those through, and that second era then says, “Hey everybody, stop Querying all the raw data. You can do it if you really have to, but let’s get you on the optimized. Your dashboards going to load faster, you’re going to spend less money, it’s going to be great, and you’re going to get higher level models that are just going to make more sense and be easier.” Fantastic. That’s usually the second wave. That starts to introduce some friction then between the data consumers and the data producers because when you need new data sets, then organizations will oftentimes then start to walk away the raw data sets and not give you access to them anymore, which I think is a mistake, but to each of their own, but you’ll start to see some of that separation.

Then the third wave starts to come in, and this is really where I think we’re starting to see a ton of teams need, which is, not just these cascading Materialized Views, but actual legitimate optimized pipelines around incremental data propagation. Because in the second era, really what we see a lot of organizations and teams doing is you’re rerunning the entire materialization, rerunning and processing all data for all time, every refresh cycle, which is still better than Querying all the raw ad hoc data if you have a high Query velocity, but you’re still having to process a bunch and that will get you enough mileage for a while. But in this third era, and this is where the biggest highest scale, highest volume Data Engineering teams have really shined, which is let’s assume that your data volume is so large, you actually don’t want to reprocess it every hour.

It’s prohibitively expensive or just prohibitively slow, those tend to generally go hand in hand, and instead the nature of your data such that you can do incremental propagation of that data and start to compress it down without having to retouch every record and every bit and every bite. That gets harder as you start to construct these pipelines because you have to figure out how to incrementally propagate your data, the correlation of data sets based on the code that’s running, but this is usually where you get an order or two of magnitude of cost improvement while introducing some additional complexity because now you’re not touching all of your data and this is this third era or wave of what I think is really how we think about Data pipelines and data access that we’re starting to see a lot of companies have to dive in. I think Data Engineering teams have been dealing with this for quite some time, but I think a lot of the Analytics engineering teams are just starting to do more of this and this is why we see things like incrementally updating Tables in DBT or dynamic Tables in Snowflake and other optimizations that you can do.

You’ve been able to do other stuff in Databricks similarly. That’s why I think we’re starting to see a lot more of these kinds of capabilities as teams are realizing they need to optimize the, basically optimize updates to their pipelines and their data models itself.

Paul Lacey: Yep, and I can certainly envision, by the way, that’s an awesome kind of three stage maturity model there, Sean, so thanks for outlining that. I can-

Sean Knapp: You like that?

Paul Lacey: Definitely. I love it.

Sean Knapp: I just made it up, but it seemed to make sense.

Paul Lacey: This is why we do these shows because the stuff that we come up with is pretty awesome. You heard it here first folks. Yeah, this is not scripted, but it makes a ton of sense to think about processing data incrementally, but I guess where my head goes to next is, awesome, as long as everything is steady state and you just get another set of data in, let’s say it’s from the last hour, from the last day, from last week, however frequently you want to refresh your Materialized View, you pass it through the transformation logic and the steps and stuff, and then you wind up with data that fits into the Table and it goes into the Table. Cool. What happens if you change the logic that defines the view? Then do you wind up in this kind of race condition where you have newer data that’s been processed by newer code going into the same Table as data that’s older that has not been processed by the same code? How does that work?

Sean Knapp: Yeah, that gets pretty tricky. This has been, so this is one of the things that it’s a problem you only encounter when you get to this third stage because, in that first era and that second era, the first era, you can change your code, but you’re rerunning whatever the present state code is on a hundred percent of the data at Query type. You can change your code all you want and maybe if you had a bunch of nested queries or views Query and other views, at some point maybe you broke something in that Query tree, but you’ll figure that out. That’s the point. Basically, in that world, all the code being, all the data’s being run through the same version of code, so you don’t encounter that problem there. Similarly, in that second era, again, if in Data Engineering terms, if you’re rematerializing every view, every Table, every single time along your graph of cascading sets of data transformations, what Data Engineering calls a full reduction, you’re reprocessing all data into a single component every time.

Same thing. You’re running all the data through the same version of the code and it happens to be a new version of the code, but you’re running all your data through, so you get it can actually punt on that. This is why I think it’s fairly interesting for a lot of teams, which is, you start to kind of realize how important this becomes in that second era, especially if you want something to be a really reactive to developer change. You’re like, “Oh, I want to change this. I want to immediately see the ripple effect.” You start to see that benefit, but really where you desperately need the intelligence around changes to code is in this third era because the whole third era is really predicated on this belief that you don’t want to reprocess all of your data all the time as it’s too slow and it’s too prohibitively expensive.

You want to reserve when you actually process all of your data for when it’s going to materially matter, which is generally when something materially changes like, “Hey, I have all these blocks of code that define all of these data models and all these downstream things that depend on them. Let me actually only reprocess this when I know I need to,” which is generally when the code itself changes and if the code didn’t change, you can pretty safely say, “I don’t need to rerun it.” That’s, I think, in this third era of these increasingly advanced optimized pipelines that I see a lot of the world moving towards. That is when the dependency on change detection matters. Because if you don’t know when you need to reprocess a bunch of data, you actually end up with a bunch of your data running through old versions of code, a bunch of your data running through new versions of your code.

There’s sometimes where you may actually want that, but more often than not you don’t, and you do want to go in and what’s called backfill and auto propagate the change, because, again, change can be both code or data. I think this is where it becomes so incredibly powerful and this is where we get to lean on things like metadata to automatically detect, “Ah, you changed something. Let me go rerun that data through the new version of the code and make sure that all those downstream dependencies can rely on the right version of code.” Now, I can say before, as a follow on if you want, there’s all these amazing things that we get to do, again, backed by tons of metadata and all these other smarts that you say, “Hey, you did change the code, but it actually wasn’t material change, so we’re not going to go light up a bunch of reprocessing,” but there’s a lot of other really nifty things that we can do that help us avoid things like that too.

Paul Lacey: Yeah. When I’m thinking about this, I’m a very visual person, you might’ve caught onto, so I’m always like, “What’s the diagram in my head that’s showing what’s going on here?” It really feels like you need a system that, A, tracks the relationship between a slice of the data and the code that operated on that slice of data. Coming back to the partitioning story, we call it a partition, but a partition is just another word for it. It’s a packet of data. It’s a data set. It’s an incremental data set that came into the pipeline. It came in at such and such date, it was operated on by such and such version of the code and it wound up in this place, and I think in a lot of ways that sounds like a lineage kind of a thing. It’s just like, “Hey, we just know we where it came from. We know what happened to it, we know where it went,” but it’s not just like a human knows that, it’s the controller itself needs to know that so that when you update code, it can go back through its entire catalog of not just the Tables but the slices of the data in that Table and say, “A-ha, there’s three partitions in this Table.

Two of them are now up-to-date. One of them is now obsolete based on this new code change. I need to go then run this new code on this one partition of the data in this Table so that everything’s up-to-date.” Or as you mentioned before, it could say, “Ah, you know what? It doesn’t matter actually because it’s a windowing function. It doesn’t apply to anything that’s more than 30 days old, so I’m just not going to rerun it.”

Sean Knapp: Yep, exactly. I think the, partitioning, I think can tend to be hard for folks who aren’t familiar with big data systems to really get comfortable and familiar with over the course of the time. Once you wrap your head around the concept that it’s really just, it’s a chunk of data, as you put it, and it exists because in big data systems, the data volumes are too large that we generally, it’s prohibitive to create indices and do record based updates. We create partitions, which are our sort of quartz grain notion of an index, and we do it by grouping data that’s very similar together on the assumption that if we’re going to reprocess some of this data, we’re probably going to reprocess all this data because it’s so related to each other. As we wrap our head around that, it’s that notion of, “Oh, well, yeah, I don’t have to actually traverse lineage all the way down to a record level, but I can trace lineage at this partition level because all this data’s heavily correlated, and as a result, now I can do all these automated things that let me do these really neat behaviors that, when my data was smaller and fit inside of a Database and was backed by indices, I want that same kind of experience, but I need to be able to do it when my data volumes are thousands of times larger.”

Paul Lacey: Yeah, yeah. That’s great. Professor Sean in action. Thanks for taking us back, but that kind of spawns another question, Sean. Obviously with an incremental system, it would make sense for one of that partitioning keys to be, or for the default partitioning key to be the timestamp of the data when it came in, but are there other partitioning strategies that you might use? Would you repartition this data in the midst of a pipeline, for example?

Sean Knapp: Yeah, we definitely see customers doing that, and there’s a variety of different ways to do it. Partitioning strategy is this amazing art meet science for Data Engineering. The two most common partition setups that we generally see are going to be partitioning, and both are on time. One is time of event when something actually occurred. The second is actually time of collection or time of retrieval or receipt as those can be really, really different. The reason why that matters so much is, generally, this classic problem of late arriving data. Something may have happened 15 days ago, but it was on your mobile device and it just showed up today, or maybe it was patient health records and you’re getting a data drop from a healthcare provider, but they were just super late, and so now all of a sudden you have a bunch of stuff from last month, not this month, but you’re trying to do a bunch of Analytics and statistical analysis on it.

We see, first and foremost, a lot of partitioning around time and really oftentimes the desire to even do things like two-dimensional partitioning. It’s like, “Well, when did I get the data? When did the event actually occur?” Then oftentimes you want to do some reshaping of that over the course of time. Usually, what we’ll see happen in a lot of pipelines is you tend to index, think of the upstream. I know this is a podcast and I’m literally gesturing as if anybody can tell what I’m doing here.

The upstream, the early life cycle of a Data pipeline, you’ll oftentimes see be partitioned, not by even the event time, but by the receipt time. When did things get collected? As oftentimes, you’re doing simpler things like, “Hey, I’m going to clean up my data. I’m going to run a bunch of data integrity checks. I’m going to filter out stuff that’s just noise. I’m going to figure out how to do a bunch of restructuring of that data.” But oftentimes, you’ll find that those are what are called map operations that you’re not trying to do heavy correlation of data yet. It’s just kind of like the get your data ready and so that, you can generally do. If you’re not correlating records, you say, “Hey, I’ve got a block of new data. Let me just produce a block of process data out of this and do that a bunch of times.”

Then where you start to do what’s usually called reshaping is, “Hey, I have a bunch of this now. I’ve cleaned up my data. I have a bunch of statistics on it and now actually want to reshape it, and if I had some data that came in super late, I want to actually get that data with the data that I already have. After I have to reprocess that all collectively, because I’m doing correlation of that dataset. Cool, I’ll do that now.” But that’s usually where you’ll see the reshaping and moving more towards event time versus collection time of data. Then oftentimes, what we’ll see is in the latter life cycles, it starts to hit in that mid-stage of pipelines, but usually towards the other. This is where you’ll start to see some need and desire or, in rarer cases, partitioning off of totally different dimensions. The reason why I say it’s rare is there’s a small percentage of the overall data sets in the world that are really, really, really large are ones that don’t have time components to them.

Generally, time’s one of the only things that continues to march on, and so data sets that are time affiliated are one of the handful of data sets that just are guaranteed to always grow and have been growing for decades. Those tend to be the larger data sets oftentimes, but there are fundamental reasons why you see other things to really want to reshape your data around that’s not even time-based. Oftentimes, it may be you’re doing heavy correlated analysis on your data, but you’re an enterprise company and you’re building models for our customers, and we have actually a bunch of our own customers are enterprise companies that build models for their customers and for each one of their customers., They’re not building unified, aggregated data models, they’re doing it just for that customer on their customer’s data. After you do a bunch of the same standard processing and cleaning of data, most of your code is really going to be operating on just one customer’s data set at a time, and you actually want to reshape that data and say, “Hey, just let me partition both by customer and timestamp,” for example, usually because used to wanting to do timestamp stuff.

We’ll see that, and that tends to inject mid-pipelines and mostly down to the tail end of those pipelines you’ll see a lot, because that’s when you should get individual access on those data sets and if the access patterns are only ever accessing a single customer at a time, you want to make sure that you partition that data so you don’t even have to, when one customer is Querying that data, you don’t even have to look at all the other partitions updated. They can just hone in on just the dataset that they need.

Paul Lacey: Sure, and I think it’s worth calling out too, in the logical construct that we’ve got going here, that we’re talking about partitions with regards to the Data pipeline. This is not necessarily creating another set of data within Snowflake or Databricks or whatever kind of underlying platform you might be operating on. It’s really just the pipeline needs to know these things about the data and sort of catalog the data in a way that it can know where to go to perform these kind of updates and things like that. Is that right?

Sean Knapp: Yeah, exactly, and this is where we’re in this, as I mentioned earlier, in this really cool stage of just starting to get more automation and AI assisted pipeline development as this historically has been all of the things that Data pipeline engineers in this third era of large scale, highly optimized, highly efficient Data pipelines have always had to answer, which is, there’s this series of questions I always like to pose out of, what do you do when new data shows up? What do you do when your code changes? What’s the correlation of the data? How do you incrementally process? These are all the questions we have to think about as Data Engineers and then write all the code to make sure that the data is correct and accurate as it reaches its end destinations. As we collect more and more metadata, as we create better statistical profiles on data, as we track better lineage, the systems themselves can get more and more intelligent around how to optimize the data through those incremental stages to really make those pipelines more and more efficient and solve, and eventually, it does go all the way out to end consumers accessing it as well.

But this is why, oftentimes, you think of pipelines so different than Query behaviors, but to me, it’s two sides of the same coin. In many ways a pipeline itself, especially intelligent pipeline, is just a proactive Query optimizer. It’s just looking at the entire DAG and all the things you’re trying to do and figure out what is the most efficient way of optimizing that pipeline.

Paul Lacey: That’s a really unique way of looking at it. I like it. We’ve got two things that came out of this, Sean, that I think are going to blow people’s minds a little bit, and we can certainly unpack more in future conversations that we have around this, so love it. Yeah, let’s go ahead and call it there, Sean. I know that we could continue talking about this for hours and we will. That’s the entire point of these conversations, but I think for right now the eras of data and Data Engineering, I’m going to really think about that over the next week or so and come up with some good conversation for you about that. Let’s double click a little bit more into what this new era looks like and we can go from there.

Sean Knapp: Yeah, let’s do it.

Paul Lacey: Super. All right. Well, thanks Sean, and thanks everybody for tuning in. This has been the DataAware podcast with Ascend, and we will see you next time.

Ep 23 – The Three Eras Of Data Engineering

About this Episode

Transcript