Ep 25 - The Benefits Of Data Pipeline Automation

Join Sean and Paul as they talk about how an intelligent data pipeline controller brings traditional developer techniques to the world of data engineering. Learn about Sean’s framework for deciding how frequently you should refresh your pipelines and see if streaming and batch data processing are converging. Plus, what is the definition of real-time? All this and more in this week’s episode!

Transcript

Paul Lacey: All right everybody, welcome back to the program. This is the DataAware podcast with Ascend. The podcast about all things related to data engineering and data pipeline automation. My name is Paul Lacey and I’m your host and I’m joined by Sean Knapp, who is the founder and CEO of Ascend. Sean, welcome.

Sean Knapp: All right. Thanks for having me again.

Paul Lacey: Yeah, it’s my pleasure, Sean. We would never not have you on this program so you can feel assured that you’re going to be welcome back every week unless your schedule doesn’t allow for it. But we certainly appreciate you making the time. Yeah, and Sean, we had a great conversation last week about how you automate data pipelines and what kind of a software stack would you need for that. We talked about the eras of data engineering, which was really cool. We kind of mixed it live on the show. We talked about how there’s the first era, which is where you just dump all your data in the Warehouse and you query against the raw data. The second era is you go, oh, my God, that’s too expensive. Let’s materialize some of these views so that we’re hitting the views and not the raw data. And then you go, oh my God, that’s expensive to completely rebuild the views all the time.

So then how do we do this in an incremental fashion so when new data arrives, we can propagate it through and put it in the right places? And understanding the complexity there of now we’ve got a bunch of different slices of data that have been run through different versions of pipelines, if the pipelines change ever, if the logic changes. So how do you keep all that stuff straight? So just to kind of recap for listeners that maybe haven’t been along the journey with us the whole way, that’s kind of where we ended last week. And so this week, Sean, we talked last week about… Man, there’s so many cool things that you can do with a data pipeline once you have this intelligent controller in place that actually is aware of the data and is flowing through your pipelines, it’s tracking… It’s understanding relationships between the data sets and the code that’s operating on the data sets and that kind of stuff. So maybe we can pick up there, Sean, and just talk through what are some of the unique things you do once you have all this infrastructure in place?

Sean Knapp: Yeah, absolutely. And I think a lot of it comes down to this whole notion that when it comes to data pipelines and the data products that we produce with those, very little is actually static. There’s always things that are changing. We’re changing our code, the data is obviously changing. There’s new data coming in on a continuous basis. Otherwise, you wouldn’t need a data pipeline. You would just run a few scripts and call it a day. And so when we think about the really magical things we can do with greater levels of intelligence around this controller, it happens when it comes to things like, well, what happens if I did have new data come in? Where does that data need to go? I need to run it through different pieces of code that obviously is going to run as part of certain pipelines, but you start to get really interesting things that an intelligent controller can do.

For example, let’s say I have a bunch of new data that came in and I’m running it through a pipeline, but at a certain stage of the pipeline, the end result of the data for some reason didn’t actually change. Maybe because it was filtering things out and all of the new data that came in got filtered out. So the end result of that stage didn’t differ from before. A classic pipeline will just continue to process all of the data and we try and move things down and issue a bunch more jobs and reprocess a bunch more data as opposed through.

An intelligent controller can say, “Hey, by the way, you ran a bunch of code on that data but it produced the same outcome. So I can rationally assume that if I run all of the other code on the data, if it’s same code on paying previous data, it’s probably going to produce the same output. So let’s just skip that whole thing, not incur a bunch of costs, not make the pipeline take really long. And we can just halt here and say that the pipeline’s done processing.” And so that awareness from a control plane lets you do things like that. It also lets you do things where let’s say I’m aggregating a bunch of data, a classic analytics report and how many users click button X per day, and geography for example. Because they have a bunch of new data come in. There’s things like the classic late arriving data problem, a standard incremental pipeline that’s moving data through. We’ll grab a bunch of data, aggregate it by day, produce that result, put in some optimized table that then I can go and run BI reports on or even feed other parts of my system.

But what happens though, if I have late arriving data? For example, I get a new partition of data, a new file shows up from a log server and it has 97% of the data was from yesterday, but 2% of the data was from the day before, 1% of the data was from the day before that. Sounds kind of crazy, but it happens far more than you would expect. And when you’re trying to run a pipeline on that, a classic approach would be, ah, just ignore anything that didn’t show up on the day where we’re processing it or add, just kind of cheat and count it in the day where we’re processing it instead of the day where it actually occurred. You kind of get this, ah, statistically it’s not that material. So maybe we’re just kind of… We squid hard enough and it seems reasonable. An intelligent controller can actually do really smart stuff that says, “Hey, I know the code you’re running on is operating on the data as if it were already oriented around the day where the event happened. So let me actually not just process and propagate that data through for yesterday, but because there’s new data that applies to the day before that, let me actually reprocess that day’s data too.”

And even the day before that because that had data change, but I don’t have to do for all the other days because I know intelligently that there’s only data that applied to three days. And so a really smart control blade that keeps tabs on all this code and all this data can do these amazing things around data propagation because it’s not doing anything magical that we couldn’t write code for. It’s just it would be a lot of code. We would make similar decisions if we were inspecting every stage of every pipeline to make these decisions, but codifying those and writing our own code to do a system like that and do it thousands of times over at every stage of every pipeline, that’s the prohibitive part where the control plane can apply similar logic that we as reasonable humans, if we were manually moving data through pipelines would obviously look at and say, of course that makes sense.

That’s the right approach. So those are I think two really good examples where an intelligent control plane can do things that are prohibitively hard to do as a software engineer, as a data engineer, especially at scale that we find that data engineer teams just don’t do. It’s too hard, it’s too expensive from a time perspective to invest in.

Paul Lacey: And it’s also difficult to get human level intelligence into any piece of software today. So there’s got to be some proxy for being able to do this. But as you mentioned Sean, it can’t just be rules-based, right? Because if it was just simple logic tree rules-based type processing, it would be way too much code and way too brittle to be able to do all this kind of stuff. So it’s where we come back to that whole fundamental element of the fingerprinting of the data sets. So partitioning for sure, and I think that’s one of the things that is oftentimes overlooked nowadays because everyone has forgotten for the most part that partitioning used to be a thing or maybe not forgotten, but they said, good riddance, so happy we don’t have to worry about that anymore.

But from a pipeline perspective, you do kind of have to worry because you need to have these discrete packets of data and then you need to be able to fingerprint those packets of data and then fingerprint the code and then understand the relationship between those two fingerprints in order to be able to have the controller have a higher level of logic around what needs to be processed in the system, right? And I know we talked about that last time, but that’s kind of the foundation.

Sean Knapp: Yeah. It really boils down to the… In many ways, and not to get too geeky, I’m sure will pull myself out of this in a bit, but if you think about it, there tends to be two different strategies, which is, hey… In many ways pipelines are about proactive caching of data. When you materialize the data set, you’re really proactively caching it. You’re saying, “Hey, I’m going to use this later.” And so there’s two approaches. One is as you’re moving data through a pipeline, you’re essentially taking a more imperative approach classically, which is, well, I’m going to mark these, the existing data as dirty, if you will. If we think of a classic cash-based model and I say, “Ah, well I have new data that would apply to this thing because I analyzed it, so I’m going to mark it dirty so that then that gets invalidated in the cache, I’m going to go replace it later.”

And that becomes really hard and really, really brittle and very hard to coordinate over the course of time. And it’s one of the reasons why the approach we took at Ascend was a different approach, which is actually a hashing model. The fingerprinting as you described, which is… In an oversimplified form, think of it as like a really fancy tag for partitions in a pipeline. It allows us to say, “Hey, we can fingerprint a ton of stuff and we just ask the system, do you already have all this stuff and tell us the things you don’t have? We’ll generate those for you.” And so it allows the system to… Rather than constantly trying to check and mark things as dirty for replacement later, it’s a more reactive model of the blueprint based on the fingerprints of the world that is supposed to exist from a data perspective and go figure that out. That becomes a much more durable and scalable approach over time.

Paul Lacey: Yeah. And it makes sense too from a feasibility perspective as well. You can do SHA hashes, you could do them very quickly. You can compare them. They changed the contents of something changes, right? So if you change your code or if you change your data, the hash is no longer going to match what was previously fingerprinted and stored into the metadata. So it allows you to build this kind of huge scale system without having to build the huge scale of software around it, right?

Sean Knapp: Yeah. Yeah. Totally.

Paul Lacey: Yeah, so that’s a great innovation in and of itself to think about, and so that’s great. So getting back to some of the stuff that we were talking about around the benefits. So a couple of benefits that I heard was, you can auto reprocess and repartition your datasets on the fly without touching the entire dataset. Super helpful. I don’t think there are many slash any data pipeline controllers on the market today that can do that at scale without a lot of manual intervention as you mentioned, right? A lot of folks, if you don’t know what the dataset is, you kind of just have to reprocess the whole dataset and you wouldn’t even know that, “Hey, I just reprocessed 99% of data that was the same at the end of the run versus the beginning of the run.” You just know that it’s just a trigger. I got a new data set, I’ve got to re-execute these pipelines, that kind of stuff. That makes a ton of sense. I think the other thing you might’ve mentioned or kind of touched on Sean, was the ability to stop and restart a pipeline in the middle of a pipeline run. That’s pretty unheard of for a lot of folks, right?

Sean Knapp: Yeah, totally. And you’re right. That is what I’d say is another third great example of this intelligence control model, which is as your system is using these fingerprints to check, you have the right data in the right place, you can actually quite reasonably stop a pipeline anywhere and resume it as you could just pick up where you left off before and you go back to your same checkpoints with your same integrity checks. And this is something that in a classic pipeline world just has never existed and people haven’t been able to solve for this. And the classic approach to it has generally been, I don’t really know what was in there. The pipeline partially ran, so some of my data is probably sitting where it needs to be. Some of it’s not. It introduces this whole world of, well, do I have to delete that data if I try and write it, am I going to end up with duplicate records?

All sorts of very stressful things to have to worry about. Whereas with a control plane, the system’s ability to say, “Oh, yeah, well, we stopped halfway,” so those things actually pass their integrity checks and that stuff was committed. Literally, I could tear down and shut down every single server, reboot them all back up, they can inspect the state and just pick up exactly where they left off. That’s really neat. That’s one of the ones that from a developer perspective, when you hit a break, you don’t have to worry about, well, how do I then go troubleshoot and triage my whole system to spend 80% of my time actually worrying about the data integrity piece versus what broke and how do I get it rolling again? Instead, you can just focus on, ah, let me go fix the data logic or fix the data that broke the pipeline and neatly pick up where you left off.

Paul Lacey: And that’s one of those software engineering concepts that is very ubiquitous for anybody that’s used to writing code that operates on other types of systems, right? But something that’s super hard to do in the data world because of the inertia or maybe the gravity of data and just the sheer massive volume of these data sets. So if you’re a developer working in Java, JavaScript, something like that, inserting a breakpoint in your code is super cheap, right? You can stop the system instantly, you can expect variables, you can do whatever you want, then you can restart it because all this processing and memory and the data sets that it’s processing on are very small in general. But with a big data mindset, it’s like, oh, my god, so expensive to stop a pipeline in the middle and say, “What does the data look like at this point?”

And then without the ability to checkpoint the successful steps that have run and being able to persist those data sets somewhere in the data plan underneath so that you can then go back to and say, “Okay, great. Now let’s restart from this point forward because we know we persisted and materialized all the results of each former transformation, and so now we’re just picking up from this point onwards and now we’re going to start rolling this through and we know exactly which partitions are being affected because of the fingerprinting, blah, blah, blah.” We can do all this stuff, which brings regular software development techniques back into the repertoire of a data engineer, and you don’t have to develop on really scaled down data sets anymore, right? You might even be able to develop on production sized data sets.

Sean Knapp: Yeah, totally. One of the things we were talking about in one of our internal meetings yesterday, which was, man, the ability to even do smart things like reuse data from multiple data pipelines. Once you have an intelligent control plane in place and say, “Hey, I know technically you’re operating on totally different pipelines that may look entirely different,” but when you’re going, I want to branch my product pipeline and I want to tap into these data sets, the ability for a system to say, “You know what? I’m actually watching all your data. I’m fingerprinting the code and I’m fingerprinting the data, and I understand that the correlation between these,” and you just branched the pipeline that Doug and the building next to you is working on too, and 98% of that code in your pipeline is the same as Doug’s code. And you know what? We should just be able to reuse that.

You shouldn’t have to reprocess that data, you shouldn’t have to duplicate your storage. You should actually just be able to run that whole pipeline and tap into the same originating data. Same thing goes for… Some companies will even reuse Dev and Staging and prod data, which we have a number of customers that do that as well as others who will want further isolation, will at least reuse Staging and production data and be able to actually point and reference them to the same originating data sets, but safely so that if staging failed for some reason it doesn’t inhibit or corrupt any of the data in production. And so you can do all of these really cool things around reuse of data as well as you highlight, which is really quite cool and magical.

Paul Lacey: I can’t imagine any data engineer listening right now, not getting a little bit excited about the potential here, Sean, to make their lives a lot easier and also to unlock new levels of experimentation on their pipelines as well, right? That’s the fun part about being an engineer is run an experiment, see what happens, get feedback quickly, go through that loop several times, find some novel way of solving an intractable problem and then pushing it into production and then watching the ripple effects of it going across the organization. And that’s kind of why almost anybody gets into the profession in the first place, but then being able to actually do that at scale is game-changing I think in this particular industry.

Sean Knapp: Totally agreed. And that’s why I think from a software perspective, we all oftentimes think about the… Well, I’m used to being able to branch code and I can operate on my own code and nobody’s stepping on my toes because it’s my branch. I got a little local copy of it and it’s great, but how you get that in a data world is so hard because you use the appropriate term, which is that there’s gravity. Code can move and it runs on things, but you can easily move it around. Whereas data, if all of a sudden you want to replicate somebody’s pipeline, more often than not, you have to go actually copy all of the originating data, reprocess all of it, or some subset of it if you’re at a Dev setup.

And that takes time and it’s expensive and it’s slow. And when you trying to scale that to one, to two, to 10, to a hundred, to 500, to a thousand developers, that can get really prohibitive. But the part of the magic of how intelligent control plane work is to say, “Hey, actually we don’t have to rerun all of that,” because you didn’t actually change the code that read on the data so we don’t have to rerun it. And so it gives the same benefit that we see from software and Git and branching, to data that nobody’s been able to figure out how to really crack wide open yet. And that’s one of the things I get really excited about, literally.

Paul Lacey: Yeah. Would you go so far, Sean, as to say that this is about the separation of code and data? Is that a concept… We saw the cloud data players make a big deal about separation of storage and compute because they were two closely coupled things and all the previous big data infrastructure things. Would you say the same is happening with data pipelines?

Sean Knapp: I see what you’re doing and I like it, and yes, I do. So I think the short answer to your question is yes. And I think one of the really magical things about this is in many ways, it’s making the two aware of each other but able to actually operate independently. And the fundamental challenge that we’ve seen historically was you either had your data sitting somewhere and you’re just querying it and you just end up with a jumble mess of code reading against that data, or you came from more of a pipeline world and you were very code centric and past centric and pipeline run centric and the data was a complete afterthought. And that you just ran it and you produce some data and there was assumption that the data existed somewhere that it’s perfectly fine or valid or you wrote code around it.

But this is where we get to actually say, “Hey, these two can operate and there’s an intelligence layer that knows how to actually separate those two.” And say, “Hey, you can change the code and we’re going to inspect the code and understand its relation to the data.” Completely independently and vice versa, we can look at the data and understand its relation to the code but completely independently. And the separation of those two from a logical perspective I think makes a lot of sense.

Paul Lacey: Opens up the ability for people to be able to decide when and where they want to reconcile those two things, right? Right now, we’re kind of stuck. We have to force a manual reconciliation on a global scale by rerunning these pipelines and orchestrating these really sophisticated interactions between different pipelines and whatnot and dependencies and stuff like that, versus if you want the system to automatically reconcile all the time, you can. Your data sets are automatically all the way up to date. If you want to pause pipeline, specific pipelines such that you can do a lot of work and then see what’s happening up until a certain point and then let it reconcile for the rest of the system, you really get that flexibility, right? In terms of how and where you want that to happen.

Sean Knapp: Yeah, absolutely. Absolutely.

Paul Lacey: That’s amazing. It sounds like some great benefits to building data pipeline automation controllers, right, Sean?

Sean Knapp: Good thing. So clearly.

Paul Lacey: Who wouldn’t want to do this?

Sean Knapp: I think we’re increasingly entering obviously this era of intrigue and excitement around not just automation, but artificial intelligence and the things machines can do for us. And I think this is even in our little pocket of the world around data pipelines, this sheer excitement now around, I got to go build a lot more pipelines and I got to go feed more things than ever before with these, and the systems are getting so smart that the monotonous things, I really can now trust them to take over for me and not in full autopilot fashion yet, right? There’s no FSD, Full Self-Driving for data pipelines yet, but we’re getting increasingly more advanced. Maybe we’re in intelligent cruise controller, auto steering territory, and I think we’re going to get more and more sophisticated as we really start to blow those doors open around how much metadata we can collect and the things we can now do with it.

Paul Lacey: Yes, absolutely. It’s a great time to be a data engineer and it’s really a great time to be working at companies like Ascend that are cracking this problem wide open, right? And offering some of these solutions to folks so that they can go and do what they want to do, right?

Sean Knapp: Absolutely.

Paul Lacey: One other thing I wanted to touch on Sean, was the… So after our conversation last week, which was highly instructive in terms of a new mental model of thinking about how you want to work with data, when we talked about the three eras and we ended with the incremental era, which is very new for a lot of folks. The whole industry is trying to wrap their heads around it. A lot of people are thinking about wanting to do this but didn’t know where to start, which is great that we’re able to have shows like this where we basically lay out the framework for how you do things like this. But one of the things that I was noodling on afterwards, Sean, it seems like it’s somewhat related to what we used to have as completely separate things, which is stream data processing and batch data processing. And there were wildly different technology sets because one of them had to operate in [inaudible 00:25:09] real time or whatever your “definition of real time is.” That’s a really fun thought exercise over beers is to get a bunch of engineers together and be like, “What’s the definition of real time?”

Because at some point, it’s just a micro batch, it’s just a millisecond duration micro batch versus a subsecond micro set… Micro batch. But a lot of the things that people have to do in the Streaming world, it’s just to keep track of the data, the packets of the data as they’re flowing through the system, keep track of their status in relationship to the various stages of processing that are happening on them. Understand that if code is changing, which ones have been operated on by the previous code versus which ones have been operated on by the current code base, that kind of stuff. Then ultimately finding a place to land them at rest. All those little packets of data, it starts to seem like the batch world is moving towards that kind of a paradigm as well with the incremental data processing. So in your mind, is there a difference anymore between streaming and batch data processing or are those two things converging?

Sean Knapp: I think they’re converging. I don’t think they’ll ever actually fully converge. I have seen… I think we went through the hype cycle of streaming. And honestly, it was my second experience doing that as the last company I was a founder and CTO of… We built up a huge analytics advantage over the course of eight years and had a huge data engineering team and we’re tracking, if I recall correctly, it was something like 4 billion new data points per day coming into the system. Something ridiculous like that, if I recall correctly. And we had this really neat real time dashboard of somebody in some random part of the world pops on, and within a couple of seconds you get to see their blip on the radar of live viewers on your dashboard. And it was really, really neat. And when we decided then to… All right, we’re going to go really hard on real time, because our customers are super excited about this too.

Some of the product team went and did a bunch of research with customers. They’re like, “All right, let’s go through all the things that you do.” And if you ask everybody, “Hey, do you want this visualization, this report, this visibility in real time?” The answer always is be, “Well, of course, yes.” The harder part is when we started to go through and dig deeper and deeper, a lot of the questions we’d start to ask would be, well, how often are you going to look at it? Is a machine going to respond to this input or is a human going to respond to it? And what’s the response time for the human to do something? And so what we started to notice was it was a really strong and large percentage of the work, which was, hey, somebody’s going to look at this once a day or twice a day.

And it’s a really deep dive analysis onto something, mostly in the analytics space. And yeah, well, real time would be really neat because it’s novel and super nifty. Maybe we’re going to get to the point where somebody’s going to look at it a few times a day, something like that. And then what we found was that the real time streaming things where, hey, it’s a newspaper who’s putting video content up on their website and they really want to know as soon as something starts to trend so they can put more behind it, but there’s an automated system, like a machine learning engine that’s recommending things and it will react much faster. And the reason why I think that’s interesting and very educational is when we think about the classic batch versus the classic streaming, one of the challenges is, most use cases today I think folks have found, is they don’t require subsecond steaming.

Most don’t really require submitted streaming. There’s obviously use cases in banking and finance and so on, that obviously do and make a ton of sense. But really deeply analytical style workloads, data correlation workloads generally don’t. And so the reason why I think that matters a bit is when we think of a use case perspective, everybody would love everything real time, because it’s cool and it’s neat and why not? And the challenge then becomes, well, how much absolutely have to be real time? Which is very clear use cases for real time, but the majority use cases I would contend, don’t have to be real time. So then the question goes into the, well, why not make it real time if it’s neat and cool and beneficial? What are the costs doing something real time? And it usually comes down to the really basic question of what are you doing with your data?

If it’s literally, I have a little piece of data that’s something somebody’s doing and I’m going to look at it and look at it in full isolation from all the other little bits of data that I have, sure, make it real time because it’s really, really easy from a physics perspective, it can propagate through a couple of machines. Those machines don’t have to ask other machines for other pieces of data to correlate it. They can just deal with it in isolation, keep moving it through, perfect example. What tends to happen is the more the thing you’re doing with your data requires correlation with other bits of data, that’s where we start to get pulled more and more into a batch world. Because at some point, if the domain of data that you’re trying to pull in to correlate every new bit of data that’s coming through, and you’re trying to do something that as you requesting all sorts of other data and you want to deal with it in a group, that starts to get really, really hard to do in streaming because the laws of physics basically didn’t say, well, shoot, you need a ton of servers to store all this stuff in memory, or you need a ton of servers to be really close.

So you can ask for all of these pieces of information, but the workload you’re creating across all these machines is much larger and you’re still trying to propagate something in subsecond or sub five or ten second. And the amount of work required to fulfill that real time while pulling in all these adjacent pieces of data, that gets really hard and expensive. And so as a result, that tends to be a lot of the reason of why not new streaming? It is just hard, and it doesn’t mean it’s not solvable, but it’s generally solvable with heavily provision scale and infrastructure that then is very expensive. So that’s why we see… And expensive from a resources’ infrastructure as well as a resources’ developer perspective. And so why I think that becomes really interesting is… What we’ve started to see is the convergence piece that I think is really interesting and mostly happening in the batch world is, well, all right, I don’t really need this in sub ten second or even sub minute.

It’d be really neat if I could get this every 30 minutes, or every 15 minutes, or every five minutes. And so that’s introduced this world of micro-batch where the batch systems that are getting faster and faster and do really smart things like, well, let’s partition data so we can get data that’s heavily correlated with each other, well, then to move it through a pipeline all together because we can efficiently batch process correlated data. And so we’re going to move them through in these little mini blocks rather than one record. We’re going to move them through in a hundred thousand records or a million records at a time, and we’re going to propagate them through faster and more efficiently. And the things that we might likely correlate that data with, they’re all going to already be together. So we’re moving them in a big block across the machines.

So that’s where I think what we’re seeing is the incremental propagation for a large scale efficiency is also now paving the way for, well, shoot, I can get a system that’s pretty darn efficient at incrementally propagating small blocks of data, so let’s do it faster and let’s get more throughput through. And that actually seems to be, I think, the best harmonization of these. I don’t think it’s going to take over streaming per se. I think the streaming use cases are very well protected and at least not for the next five to 10 years are we going to see those probably radically change. But I think we’re going to find the… Well, if you really want a dashboard that’s going to refresh every 5, 10, 15 minutes, we’re going to find really intelligent automated pipelines that know how to do this incremental piece really efficiently. And I think that’s going to see a huge surge.

Paul Lacey: Totally. So taking away the performance penalty of frequent data processing, but as you said so eloquently there, Sean, also not doing it at the kind of hyper-scale, hyper frequency of streaming because there would be a performance penalty of doing that, right? There’s got to be some sort of a bell curve here, even with incremental processing of just the frequency of processing versus the expense of doing that on a regular basis. And maybe there’s a sweet spot somewhere in an organization to say, “Yeah, doing this every night is definitely not functional anymore for the business because we have people who are operating during normal business hours who need to make decisions on this data as it’s coming in,” and possibly even tweak things within an hour SLA or something of something happening. But at the same time too, we can’t afford to, nor should we pay for that person to know what happened five seconds ago out there in the world because it’s just they’re going to be making a decision every 30 minutes when they refresh their dashboard.

Sean Knapp: It would be a very stressful job to basically be the intern who’s just staring at the screen waiting for a light to go green or red. I mean, there are stressful jobs like that. They happen in oil rates where it’s say, we need a realtime streaming system because all of a sudden some IoT sensor is going to trigger, and if we don’t stop this, there’s millions of dollars of damage to the rig. Or we risk human safety. And those people generally have very stressful jobs and they’re very, very good at that. I’d say for most of us just sitting there in an office looking at a dashboard periodically, and it’s not a machine responding, the response times I think are… We’re not quite as strict.

Paul Lacey: Makes sense. Yep. And I’m glad I don’t have one of those jobs, Sean. T.

Sean Knapp: Rue to that.

Paul Lacey: Yeah, making me feel better about sitting here in the safety of a home office and just looking at how many people are reading about data pipeline automation every day does not necessarily need to be a sub millisecond type of an operation. And thanks for clearing that up.

Sean Knapp: Yeah. Absolutely.

Paul Lacey: Very instructive. Awesome. Well, thanks again for all this, Sean. This has been great context and I do feel like we are really starting to wrap our arms around what a data pipeline automation system is, which is great. So thanks for that. I think there’s definitely more ground to cover and definitely a lot more territory in terms of some of the newer technologies that are being folded in. Things like semantic modeling and whatnot are very interesting threads that we can pull on there and some very interesting threads when it comes to how this now integrates with some of the AI movement that we’re seeing take off in the last 12 months. So I’m looking forward to diving in on those kinds of concepts and those topics in the next couple of episodes here, Sean. But thanks for setting that foundation. I think this has been really great for our listeners. I appreciate your insight there.

Sean Knapp: Awesome.

Paul Lacey: Super. All right, and thanks everybody for joining. We’re going to go ahead and sign off, and this has been the DataAware podcast by Ascend. Please subscribe on your favorite podcast platform so you don’t miss an episode. And we will see you next week.

Cheers, everybody.

Ep 25 – The Benefits Of Data Pipeline Automation

About this Episode

Transcript