Ep 12 – Orchestrating Data for Success with Automation

About this Episode

We’re back! With a new season of the DataAware podcast, and in the first episode, Sean and I had the chance to chat about one of the foundational principles of data engineering—data orchestration. What is it, who needs it (spoiler alert: everyone), and just how data engineering organizations, and data teams overall, need data orchestration to evolve over the next few years.

Transcript

Leslie Denson: Data orchestration, the often unsung hero of data engineering and data pipelines. Sean Knapp and I are back with a new season, and first step is a deeper discussion on data orchestration and just why this often lesser desired portion of the data engineer’s day is just simply so critical. In this episode of “DataAware”, a podcast about all things data engineering. Hey everybody, we are back with another episode of the “DataAware” podcast. We know it’s been a while. We hope you guys have missed us as much as we have missed you. I am here today with Sean Knapp. Hello, Sean. 

Sean Knapp: Hello. 

LD: Hello, I have missed just having these chats and these conversations.

SK: I have really missed these, too. I feel like we’re doing these Zoom calls all day long, but these are actually a way more fun of doing it.

LD: I know, and we don’t. Again, we have them, we just don’t record them. Which maybe we should, and we’ll start doing this again.

SK: I think that is a good idea. 

LD: Yeah, I think so, too. So super, super excited to kick off what we are calling season two of the “DataAware” podcast. So stoked, you guys will be hearing some really great content, we have some awesome stuff planned for the coming weeks. So what we are starting with is a topic that is near and dear to us here at Ascend and something that as we were talking about “season two”, we realized we haven’t really talked about that much as we did the first series of episodes and that’s data orchestration. Which is really A, core piece of what we do here at Ascend, and B, just of data engineering overall. So we’re gonna dive into that and get into the nitty gritty of data orchestration today. So Sean, I’m gonna turn it over to you to explain what likely most people who are listening to this podcast already know but for those who may not, like my parents who listen to this podcast, ’cause they like hearing me talk. What is data orchestration?

SK: Happy to. And similarly, my parents also listen to this podcast, so, hi, Mom. 

LD: Aww.

SK: I know, right? I think there’s a lot of literature out there and a lot of people are talking about what data orchestration is or can be and so on. And to demystify it a little bit, I do believe it is helpful to boil it down to some of the basics which is, Look in the data ecosystem, we have incredibly large volumes of data across a lot of different tools and systems, and we have a lot of these engines that can process data and create these new derived insights and valuable, refined data sets for us. And orchestration is a pretty simple entry for way of, it is the thing that tells the systems how to process the data.

LD: Right.

SK: We hear a lot about ETL and ELT. We hear about reversed ETL these days. We hear about data replication, ingestion and so on. As we have these systems that store the processed data that move data around, the orchestration methodologies and the orchestration systems are the things that help us automatically tell those systems what to do, as ideally, as optimally and efficiently as possible. 

LD: That makes sense to me at least, and I’m sure it makes sense to everybody out there. Are there different steps in the orchestration process? I’m going to assume, based on what I know about data engineering now, after a little over a year here at Ascend. You ingestate it in, you transform it, it goes out to somewhere, to be either used in BI tools or to be used in AI, NML platforms or to be used in wherever you want to use data. But the idea, to your point, behind orchestration is that, do you want this run daily? Do you want it run hourly? There are all these different kind of things that you can be doing to orchestrate how all of this is happening. Do you want this run before this? Do you want… However that is. Are there steps that we should be aware of? Is there a process behind this that people should be aware of? Are there just different things that people should think about as they are thinking about the orchestration process? 

SK: I think definitely yes. And trying to at least demystify some of this at the start. I do believe that it crosses whether you’re doing ETL or ELT, and any time you’re, frankly, going to do something with your data in some way, you’re going to run some code on your data, you’re going to do orchestration. And I think there’s various levels of sophistication, that people work their way through, that, is sort of helpful, and you graduate from one level to the next as, frankly, your need continue to climb. In really basic steps, oftentimes people are generally starting to do orchestration with what we would call a timer-based model, or even for those of us who long for those simple old days a cron. [chuckle] Where you just set a cron job and it’s every day at 2:00 AM, do X. 

SK: And X may literally be, read this data out of my warehouse or my data lake, and for example, let’s say it was your website visitors. And it is, count up the number of website visitors by browser or by geography, and sum those up and write it into a warehouse or a database table that powers a BI tool. And just do that once a day, so that everybody the next morning when they come in, to see the website traffic, has that visualization ready in their BI tools. This is the joke of… Oftentimes I tell folks, big data oftentimes starts at just really large pivot tables and we are counting things up and slicing and dicing in different ways, but conceptually, not too different. And that’s really, at large scale, a basic orchestration in a pipeline would just be a timer based one. 

LD: Yeah. It’s funny I think, a lot of people in different job functions maybe don’t realize how much data engineering work and orchestration goes in behind the scenes of what they just natively see on a day-to-day basis, even just in tools that they’re normally using. Like, you say this and you’re talking website visitors, and I think back to, it used to drive me batty. We do a legion meeting every Monday. And one of the reports that I used to present that we swapped just to another report now is, you will remember, was a report coming out of Google Search Console, and it used to drive me nutsy, because I would pull it every Monday morning and it would always say, “Last updated four hours ago.” And I’m like, “But I want it updated now. Can I have it updated now? And it’s like, no, no the orchestration behind it had it updated, which was I think at that point, was 2:00 AM Pacific Time. Whatever. Whatever time it was updated, it was updated at that point. And that was what the massive orchestration tool behind Google Search Console had things updated at. And that is just in the marketers day-to-day life cycle of pulling their reports. In every tool, there is something like that being done, so, yeah, it’s interesting where you don’t always think about it, but that’s the work that’s being done behind the scenes. 

SK: Well, I think that’s a really cool segue to some of the other things, too, that we see in orchestration, that is often times most people certainly even the monolithically large incredibly successful tech companies will still do things that largely are sort of cron or timer-based trigger models. Oftentimes, because you have big drops of data at a particular time, all of your data is up and correlated properly, so you can just do a big block batch of it for the previous day, and tends to be fairly efficient. Then what we start to see over time is people… You start to get two new levels of complexity tied to orchestration. One is trigger-based and the other is related to that, which is dependency-based. This is the notion of, well, what if I want if I to orchestrate my data faster, or what if I have multiple things I’m trying to do as steps along the way? 

SK: So trigger-based models are usually the when some other external event happens, start my orchestration process, and it’s usually, think of something that starts on the far left of a graph, which we’ll talk about in a second, but at the far left of a graph. The second thing that we see that I mentioned is the dependency-based, and this is really where we’d start to see more advancement. I would call it the 200-level based model of data orchestration, which is really the, what happens when I have this chain of dependencies that I need to manage together? For example, this first thing happens, but then I need to have these other three things happen afterwards and they’re interdependent. And so this is what’s often called a DAG or a directed acyclic graph. Which is basically a bunch of steps that don’t loop back on themselves. So it always goes… Usually drawn left to right or top to bottom, and it doesn’t loop. And the idea behind this 200-level notion of orchestration is follow the chain of dependencies, and so long as a previous step is successful, we can continue on those downstream of subsequent steps. This is where we start to get some really interesting pipelines being built as they can follow these dependency chains of a directed acyclic graph. 

LD: Is there a world in which a company would do this but not have orchestration? Not orchestrate their pipelines or not orchestrate a subset of their pipelines? And just say, “Nope, we’re gonna run this once and then we’ll come back and run it manually as we want to.” I’m just trying to think of what there might be use cases for of on kind of the “anti-orchestration chain” and what the benefit of that would be, and then what maybe the pain points of that are, that would move those pipelines… Like make those people move from doing that to going, “Okay, never mind, we actually do need to orchestrate this stuff.” 

SK: Sure, yeah. I mean oftentimes, you will see teams start with, “Hey, we’re just gonna take all of our data and we’re just gonna put it into a data warehouse.” Oftentimes, right? Usually where you’ll see teams start doing that. And usually it’s tied to two use cases or patterns. One would be, “Hey, we’re a data to science team, our data itself doesn’t change very often. For example, it’s patient health records, we’re doing a bunch of research, we just got a huge drop of data for last month, and we’re just looking for patterns.” And as you go and try and productionize your insights in whatever models you start to develop, then you might want to orchestrate, so it’s a continuously updating system. 

SK: But at the start, you’re probably okay with static data. The other pattern we oftentimes see, oftentimes comes from analytics side of use cases. Which is, “Hey, I have a bunch of data. I’m literally just gonna drop it straight into Snowflake or BigQuery or my data warehouse. And I’m gonna dump all of that raw data in, and I’m gonna run my BI tools directly off of that raw data.” And you’ll see that happen a fair bit, and honestly, it keeps your world fairly simple, and so long as you’re not hitting cost or performance issues, it’s probably fine. And that’s what we actually see happen a lot is, people just start… They’re dropping that data in and they’re running their BI tools, and honestly, what generally starts to happen is they either get expensive or they start to get slow because the data volumes increase, or you want more canonical models that you can build on top of. 

SK: And that’s really the point where you start to see that introduction of data pipelines is where teams are looking at, “Hey, I’m doing the analysis of how many users with this language in their browser came for our website everyday, and gosh, it feels like we’re re processing the same data over and over and over and over again. Shouldn’t we be able to just compact that down and only take the small incremental new data and analyze that and get a more efficient system?” So that’s actually the second, is usually in the analytics style use case. And then actually I would say that the third that we see… The third driver as why people build pipelines is, look, you have a bunch of different teams. They all have their own data systems, and this starts to get a little bit more into that world of data fabrics and data meshes. Which is… Look, we had a team over here and they’re rocking it on… Our data science team is rocking over here on Databricks and our analytics team’s over here, and they’re rocking it on Snowflake. And they need to go share their data and we just need data pipelines to actually replicate that data across, because they don’t know which systems they want to use or they want to fundamentally use different systems, and so we need to make sure that we can at least connect the data so that they can share it across those teams. 

LD: That makes sense. And I can also imagine that, to some degree, if you have five pipelines, your need for orchestration is very different than if you have 5000. 

SK: Yeah. We ran the survey last year, and I’m sure we’re gonna find similar results this year, which was, when we ran our DataAware poll survey, 96% of data teams are at or over capacity. And one of the big things that data teams require and gravitate towards is automation, as they don’t wanna have to keep clicking the button and manually running the thing over and over and over again. They wanna automate it so they can move on to build the new thing. There’s an overwhelming demand for us all to go build the new thing. And so we do see the more team scale, the more complex the worlds get, the more they need to lean on automation to really streamline their own day-to-day. 

LD: So orchestration, especially on the automation side of it, but orchestration in general, it touches all pieces of it. It touches all pieces of the pipeline, it touches ingestion, it touches transformation, it touches the delivery out. But it touches the maintenance, governance, just the overall optimization also of the pipeline, which is kind of the plumbing of the plumbing for lack of a better way of putting it, just right now off the top of my head, if you will. Which is like the nitty-gritty that really nobody wants to do. Am I correct in saying that? 

SK: Yeah, I agree. I do not think any engineer wakes up in the morning and is like, “You know what I really wanna do? I wanna go orchestrate some pipelines.” I’m just not sure that’s on the top hit list for anybody. I’m sure there’s a handful of glorious souls who we wish we could replicate who are willing to do that, but I think the building of data pipelines and creating that sort of powerful impact is really compelling. But I think the hard part that oftentimes happens with orchestration is, you’re creating these constructs and these instructions, literally read this data from F3, perform this code on it, and then push that result to wherever. 

SK: And that’s cool the first few times you do it, but then usually what happens in orchestration is, the things you codified into that pipeline, might run on this cluster with this many resources at this time. The thing about orchestration is we may codify that once, but the data is forever changing. The data may be larger, it may be… We may have just gone through Black Friday, and so now our traffic volume is 2, 3X what it was before. I’ve just started a big drop of new data from a partner at the turn of the month, and so all of a sudden those assumptions we made no longer fit, and the thing breaks. And that’s the classic pain point that we see as engineers, even when we go and write orchestration code is, I got paged at 3:00 or 4:00 in the morning because that pipeline warped, because something. There was a new column, or the driver boomed and ran out of memory, a litany of different sort of very common ways that things fail. 

LD: Everybody loves 3:00 AM pages. 

SK: Yeah, the 3:00 AM OOM page, it’s fantastic. We’ve all received them. 

LD: It gets the heart rate going. [laughter] Great cardio. It’s definitely the kind of cardio that you want. I’m a marketing person. If I get a 3:00 AM page, like, the website’s down. That’s not good. I can only imagine, and I only get those… I can think of, on one hand, the number of times I’ve gotten that kind of thing in my life, so I do not envy you guys at all. 

SK: Well, hopefully we’re all doing a good job of getting people out of that pain, too, but for a lot of people in the data ecosystem, it’s still an all too common of a pain. 

LD: That’s why you need automated orchestration. So, what are some of the other cool things it can do? Okay, so it helps with some of the maintenance. And we talked about some of the other things, but talk to me about just… And we’ve touched on it a little bit, I think in different ways, but talk to me about some of the cool things that it can do when you’re just talking about the transformations themselves. When you’re talking about the code itself, that is the whole reason we’re running data pipelines, is the transformation and the codes. So other than just saying, I want to run this at this time, in this number of times, or I want to run this at midnight every day, what does orchestration do for you? What are some of the cool things? Or let’s put it this way, since I’m just rambling, I’ll ramble on a little bit more, maybe a really fun question will come out of this. What are some of the really, really fun things either you’ve done or you’ve seen a team do with orchestration? I know that sounds bizarre, but what’s something where you’ve seen a team orchestrate a pipeline and you’re like, “Oh dang. That was smart. Like that’s a cool way of doing that.” 

SK: Yeah, definitely, as we get to the sort of really advanced parts, I think that there’s… As we kind of work our way up, the basic, the 100 level orchestration is the just get something to run, get some piece of code to run at a particular time, on a particular piece of data, and usually I think that’s the baseline level of, I just wanna go do something, usually as an individual, you’re trying to help optimize your job a little bit. 

LD: Right. 

SK: So just 200 levels start to be the… I’m gonna run a pipeline that’s moving data around and doing a little bit of multi-step transformation logic, which we get to that higher level of value of or connecting across data silos and data teams and data systems, and I’d say the sort of 300 levels start to be… I’m creating these derived new valuable datasets, and this is something we start to see a bunch of our customers do, too, which is, I have data coming in from 15, 20 different systems, they are reading from Cloud APIs to Cloud warehouses to somebody’s On-Prem Oracle database, and I’m taking all of this data in, and I am creating new derived datasets refined models of data that nobody else has access to. And then I’m actually taking that data and continuously publishing it. 

LD: Yeah. 

SK: We have talked about before, as we go and publish datasets, and some people call these data feeds, people in data mesh and data fabric, our world has slightly different terminology for them, but the sort of net of what these are, is they are the data pipeline equivalent of any microservices road of an API. They are a refined, published new data product that entire rest of world can build on top of, or at least the entire rest of your team, company. And so I’d say from a business value perspective, the ability to continue to up-level people into working with these higher levels of refinement of data is really cool and exciting. And then the, I’d say, probably maybe even the 400 level, we start to see people tap into is at really large scale and at really large volume. How do you run these things efficiently? How do you get them to run faster? How do you only re-process data when you need do? How do you keep things running more tightly and more compactly, especially as you build more and more of these, your cost doesn’t explode on you, and that your need for infrastructure doesn’t explode on you, too, so you can still run a really efficient ship that fuels the rest of business. 

LD: Makes sense. If you could change anything about the way you see orchestration happening or the people orchestrating their pipelines and or the way current orchestration tools, including ours, handle orchestration in the next six months. You could snap your fingers, and orchestration changes, what would you do? What do you think is the problem with orchestration right now, how would you fix it? If you think we need to, you can say we don’t need to. 

SK: I definitely think we need to fix it. I’m pretty clear in mind and strongly opinionated on this one. 

LD: I love that. I do. 

SK: Yes, I’m sure. Look, I think the way most data pipelines in the data ecosystem today is orchestrated, to me, I’d say, I think for the most part, it’s still fairly primitive, and I believe that is a by-product of the fact that orchestration isn’t a cool or sexy problem to go solve. I think it is a critical problem to solve. 

LD: Yeah. 

SK: But it’s not necessarily the coolest thing to go solve, and as a result, we haven’t solved it en masse in the industry, but when we think about orchestration, you’re moving data through a pipeline. In many ways it is, the inverse of what a query planner and a query optimizer does for a database. 

LD: Right. 

SK: And what I see most pipelines as doing today is there are largely mainly constructed query plants, when I say mainly constructed, they are based off of highly codified rules with conditional logic that is based off of assumptions of how that pipeline will run, based off of assumptions of the cluster, its capacity, its workloads at the point in time where that code was written and based off of the nature of that data, the volume velocity, variety, et cetera of that data, all at the point in time where that piece of code that pipeline was envisioned and built. 

LD: Right. 

SK: And the problem is then, even though the data itself is very much in motion, the definition of that pipeline is very static, and as a result, that is quite problematic as it is prone to break a lot. And that’s what we see, I think across a lot of the ecosystem, and that is one of the reasons why when we think about this notion of imperative versus declarative pipelines, we see in most maturing software engineering domains, this move from imperative to declarative. In an imperative world, we’re describing the how, declarative world, we describe the what… Especially for pipelines, when they’re in continuous motion, you have that ability to, in a declarative world, automate far more and provide a technology usually in the form of a control plane, with a domain specific context aware control plane, which is a lot of fancy words for basically just a badass orchestration system that is always running, that understands everything that’s going on. But in the creation of doing so, we now actually can introduce new technologies that can automate and adapt far better than statically crafted, mainly crafted pipelines. 

SK: I think we’re still very much in the early days of maturity around orchestration and adoption of declarative technologies, similar to what we’ve seen in other spaces like Terraform for infrastructure, Kubernetes for container orchestration, React for front-end engineering. As data engineering and data orchestration matures, I think we too will move much more declarative in nature. 

LD: Run that through for me of what that means for the industry like I’m not a data engineer because I think, as everybody’s picked up on, hopefully, by now that we’re in season two, they’ve picked up on the fact that I’m not a data engineer. I’m assuming, based on what I… I mean, based on what I do know and based on what I’ve heard you say. And I’m gonna throw out the high level business benefit because that’s what I do as a marketer, and then I’m gonna let you dive in deep on what this actually means for the technical users. I’m assuming that the… Like running this down means self-service data pipelines aka, more people can do more things with data. You get data products to market or out to the business faster, and you can just overall accelerate how quickly you get data from source to business user and whatever they’re doing, whether it’s analytics, AI, ML, whatever that might be. Just faster, easier, because more people can work with it, because when you say declarative versus imperative, you’re talking about being able to use something like SQL, which even I can sometimes do. Am I on the right track? 

SK: Yeah, absolutely. I think it’s very much about broadening the reach and accessibility of a particular domain, in this case data pipelines. And it is about being able to go faster and spend less time on maintenance, which gets to the tremendous amount of business value, too. I do also think, too, and oftentimes we find with us on the engineering side, too, we’re like “Ah, but that means, like we’re making it easy and we solve really interesting hard problems,” and I think that’s one of the things that bears repeating many, many times over the course of this podcast, which is even moving to a declarative model doesn’t necessarily just simplify down the data pipelines. It removes you out of the muck, as Bezos used to say. 

LD: Yeah sure. 

SK: Or the introduction of AWS, which is a lot of the things that we have to write today, you don’t need to spend that time writing code for anymore, and more importantly, we look out across our customers, for example, and they’re building incredibly sophisticated data pipelines, incredibly sophisticated. And they’re doing a ton of them, some in SQL, some in Python, some in Scala. They’re solving problems everywhere from the supply chain for food to warehousing equipment to pool equipment. 

LD: Yeah. 

SK: To containers in ports, Ah, and so they’re all solving these really interesting problems in really hard ways. And I would contend the reason that they’re able to solve those problems is they’re not mucking around in the orchestrations part. 

LD: Sure. 

SK: And instead they’re spending all their time applying their intellectual horsepower to the data itself. And I think that’s where we get this profound impact, is by being able to focus on the core problem of data that we’re trying to solve, not the movement of it. 

LD: That’s a super valid point, being able to actually work on the business logic. Being able to create the business logic that like makes the thing do the thing. Well, I think that’s a fantastic place to leave this off, unless you have anything else that you wanna add about orchestration. 

SK: I think this is a great first section on orchestration. 

LD: Okay, well Sean, thank you very much, and we will… Like I said, I am very much looking forward to continuing to do these more and more, which we will, because like I said, we have some really exciting ones coming up. So, looking forward to it. Thanks, Sean. 

SK: Thanks, Leslie. 

LD: Well folks, there you have it. The hows and whys of data orchestration, and just what Sean… Well, really both of us, think the future of data orchestration should bring. If you’re interested in hearing more about how automation can help you with your orchestration efforts, you are in luck. We recently announced the inaugural Data Automation Summit. Head on over to Ascend.io to learn more and register for this free virtual conference. Welcome to a new era of data engineering.