Ep 13 – Data Automation: The Pros and Pros. Are there Cons?

About this Episode

In this episode, Sean and I have the chance to chat a little more about a foundational aspect of data engineering workloads—automation. We look at the confusion that frequently surrounds data automation and how some data teams can avoid the major pitfall of going too far too fast, so they can realize all the benefits automation brings.

Transcript

Leslie Denson: Data automation. It may not be what makes the world go ’round, but it’s pretty close when it comes to today’s data workloads. Sean and I talk through the endgame benefits of a strong data automation implementation, along with the reality of what it takes to get there, in this episode of Data Aware, a podcast about all things data engineering.

LD: Hey everybody, and welcome back to another episode of the Data Aware podcast. Today, Sean and I are back again. Hello Sean.

Sean Knapp: Hey Leslie.

LD: How’s it going today?

SK: It’s going great. How are you?

LD: I’m doing well. You know why I’m doing well? 

SK: Say more.

LD: Say more. Because we are talking about one of our favorite topics, data automation.

LD: And we like data automation a lot, and we should like it a lot. We’re actually hosting the first annual, our inaugural as we are calling it, Data Automation Summit here, coming up in April. So I’ll give you more details on the outro of this on how you can sign up for that. But we love data automation so much, we’re hosting an entire two day summit about it. So this is clearly something we enjoy talking about and really just, kind of love spreading the word around. So super excited. Super cool. What do you think?

SK: I’m really excited. We already have a great line-up of speakers coming together, it can be…

LD: Yeah.

SK: Pretty incredible tracks, it’s gonna be a phenomenal couple of days.

LD: Yeah. Super cool. Super exciting. Really fun. We are seeing some really, really, really fun things happen around data automation. And if you’ve heard us talk in any of our previous podcasts, you will have heard data automation, and the concept of it be sprinkled throughout all of those, because we really think it is such an integral part of the full life cycle of data engineering. But we’re gonna talk specifically about it today. So… I’m gonna start with the 101 question, which is the… If you are like me when I first started at Ascend, and/or when I first started in the data space, and you know nothing about anything having to do with data, what is data automation? ‘Cause it sounds really simple. Is it as simple as it really sounds?

SK: Conceptually, yes. In practice, no. I think that’s the interesting and fun part about data automation is, historically, we used to oftentimes use the notions of data automation and data orchestration synonymously. And I think the really important part about data automation, and when we think about the differences between these two, is data automation is a broader concept. It includes and encapsulates things like data orchestration, but it involves the broader pieces of extract, transform and load, delivery of that data to other systems. It incorporates and involves the registering of metadata to track how data’s being used throughout the organization. It includes the broader portions of, what would classically be the data engineering and analytics engineering principles, for data pipelines and data platforms.

LD: That makes sense. And there’s also probably, to be fair to everybody in the data space, there’s probably pretty broad swaths of data automation as well. There’s probably automation that’s specific to analytics. There’s probably automation that’s more specific to maybe source or ingestion. There’s probably automation, excuse me, that’s more specific to the actual data flow, data pipeline. There’s probably automation that’s more specific to the application side of things, there’s probably different types of automation. So maybe… Is there… When we talk data automation… Is there a difference between all of those? What types of automation are… When we talk automation, are we talking about? Maybe kind of give a little bit of context around that.

SK: Yeah, absolutely. I think in many ways, we oftentimes think of automation and even orchestration as they kind of pair together, dimension as the same. But there’s also a huge difference between cruise control, and autopilot, and self-driving cars. So for example, if we think about automation, even when we’re extracting data from different sources, one of the frameworks that I would give us to really think through is, the difference between a classic scheduling-based model or even as we talked about in our last podcast, timers and triggers from an orchestration model, and something that is actually more adaptive and more responsive, automatically. Not to use the word in its definition…

SK: But for example, as we go through more advanced layers of automation, we usually find systems that build more sophistication off of the things that they’re doing. They’re collecting statistics around the data that they’re moving around, they’re profiling the data, they’re building historical models as to how often that data tends to change and in what ways, so that they can optimize how things move through. And I think that’s a really important first piece, and when we think about extracting, it’s the difference between just reloading all of the data every day versus hey, automatically checking, and watching for how often new data changes, pulling that in, unifying it with the previous data sets. And then even actually most importantly too, are registering that data with the pieces that have accessed it before, the systems that accessed it before, so that they know how to continue to move that new data coming through.

LD: Cool, that makes sense. Let’s break it down even further. We’ve talked a little bit about automation from the orchestration piece of it, which makes a lot of sense. And again, a small plug for the other podcast, if you haven’t listened to it, we talked a whole lot more about it in our previous episode of the season that talks about data orchestration. But data engineering, you can’t talk about data engineering without talking about ETL. So when you’re talking about ETL, extract transformation load, or ELT, or ETLT, or reverse ETL, or any other way that you wanna put those letters in together, however we wanna order those letters or put around those letters, talk to me about automation in each one of those, or what the different steps per se, or how it impacts, or just what folks should know, I guess, or… And look, I may not be able to keep up with that conversation, but feel free to dive into the 201, 301, 401 levels or graduate studies level if you want to because I can just sit over here and be quiet while you’re doing that.

SK: Yeah, happy to. See, when we think about in that sort of classic ETL model, oftentimes when we’re doing extract and transform and load, we’re also orchestrating pipelines, and usually, we’re also registering that data somewhere with some sort of a catalogue or other external system. And on top of all of that, we’re usually, or at least, hopefully, increasingly trying to adhere to some sort of data office principles as well to find some best practices there around how we move faster and more nimbly with data. And as we’re building these systems and as we’re running these ETL pipelines, or ELT, if you were, at this point the lines are blurring very quickly, and as you said earlier, it’s much more of a ETLT or TLT these days, automation really weaves through all of these and I think that the connecting thread is metadata oftentimes.

SK: So as I mentioned, when extracting data, it is more advanced models that actually maintain poor history of everything that’s been extracted before, understand the profile of the data that’s been extracted before, understand the load parameters from the system that it’s allowed to access. For example, can you read data with only one thread because it’s a really small database and you don’t want to overwhelm it, or can you parallelize it and have a thousand different workers or pods or nodes all pulling data from the same place because it’s an object store? And so that automation factor is really tied in the extract notion to the actual nature of the system and understanding how it works.

SK: What then ends up happening is as we extract data, as we build profiles on that data to automate and fuel the other parts, as I mentioned, that metadata starts to feed through and weave through all these elements here. When we get into transformation, automated systems go beyond that notion of orchestration, they understand the historical resources required, they understand the lineage of individual components and even down to a columnar level where when new data arrives, and in our last podcast, we talked about how you orchestrate data as it flows through a DAG, a directed acyclic graph, well, understanding the nature of data as it moves through, not only is tracing through the lineage of the operations but understands dependencies down to a partition level so that if I have new fragments of data that become available, automation can actually weave that data through the transform level and run more optimized jobs that don’t have to reprocess as much data, that are faster and cheaper and more efficient and less prone to error.

SK: And then honestly, when we get to the load side, this is actually one where there hasn’t been as much work until recently with automation, but when we’re transforming data, we have to put it somewhere. And when we need automation, we need things that can do operations on that resulting data, and they can answer questions like, “What if there’s already data in there? Is the data correct? Is it even of the right schema? What do I do when the schemas don’t match? Do I update it, do I delete it, do I raise an error ’cause I need a human to come help me?”

LD: Right.

SK: And I think that’s the… As we trace this metadata line through and follow it through with automation, that really becomes how we start to see this incredible benefit for going even beyond just orchestration into full-fledged automation of ETLT, or TLT pipelines is this ability to have a system that is far more data-aware, that flows everything through. Sorry, I didn’t intend on doing that, but that actually worked out quite nice.

LD: That’s what you say, what y’all can’t see is I gave him a look when he did that. I don’t hate it, but yeah, no, it makes a lot of sense. I feel like you talked about this in there and so I’m gonna jump around a little bit from where I kind of had my questions going, ’cause I do feel like you started to go in this direction and so I wanna pull that thread a little bit of, say I am a data engineer, or say I am a data team lead or whatever, and… Everybody out there can laugh, it’s all good, it’s okay, I’m laughing with you at the thought of that, and I’m looking at this and I’m going, “Yes, automation, God help me. We’ve got way more work than we can get done and we have to figure out some way to automate this so that I can do the things I need to do and my team can actually handle more stuff ’cause we’ve got more stuff than we can do.” What is the handbook? What is the guide? What is the steps… What are, rather, to be correct in English, what are the steps that you need to go down to sort of create that strategy?

LD: Like I said, I think with some of what you were talking about just a second ago, you started to walk that path, but are there some sort of… Obviously, it’s different for every company depending on what they have. I don’t think that there’s a one path fits all, but are there some general guidelines, I guess, that people can think about, that we’ve seen based on the work that, A, you’ve done in the past in your career, and then B, also the work that we see with our customers?

SK: Yeah, definitely. When I think about, conceptually at first, and then you’re probably gonna pull me into more pragmatic advice, but when we first start conceptually, a lot of automation follows very similar patterns to what we’ve even seen some of the tiers of artificial intelligence, where first you start to go with heuristic-based models, then you move towards more statistical-based models, and eventually you can go into far more advanced models like true machine learning with neural nets and so on, and we see this sort of progression and the reason why I spend the time on this is as you try and build more advanced systems, it requires more metadata, and it requires actually far fewer hard coded rules and far more flexible and dynamic adaptations that you allow the system to operate on behalf of you as the developer.

SK: And the reason why I mention this is, one, oftentimes when engineers are creating their data pipelines, their ETL or ELT, when they’re trying to automate their data pipelines, they oftentimes start with really hard and fast rules designed to the specific use case, and part of that challenge then becomes when the environment changes, when some new parameter changes… So, a huge lump of data comes in, or it’s a leap year and you didn’t sort of factor that in.

SK: All of these assumptions that you bake into your code, all of a sudden then makes the system brittle, and those assumptions and that conditional logic you put in may no longer hold. And so as we move more and more advanced with automation, it turns into much more of a moving up stack. Remember we talked before around that shift more towards declarative models as opposed to imperative?

LD: Mm-hmm.

SK: And then the second piece is, all of that requires a ton more metadata, and one of our favorite sayings here is A metadata is the new big data, and I really do believe that, where to invest it and to achieve high levels of automation, we really do need to collect far more metadata, not for us but for the systems to automate those data pipelines, and that becomes a really important theme for most teams. To give everybody something pretty tactical and tangible is you should be collecting incredible amounts of metadata about your data, profiles of the data, how it’s being accessed, by what jobs, with what resources, where it’s going to, why it is going there, and where it goes from there. And the reason why I think that becomes so important is to move out of just sheer baseline of orchestration and into far more advanced levels of automation, you can’t do that without having tremendous volumes of metadata with which your automated systems can base their decisions.

LD: Are there unintended consequences? I hear sheer volumes of metadata and I think, “Oh God, now you have to track that and store that and know where it came from and know where it went and understand that metadata.” Are there… Is there… But that may not be a problem. Are there unintended consequences that come with having an advanced automation strategy as part of your data engineering strategy?

SK: Well, I would say definitely one consequence is you can’t rely on your production pipelines to wake you up at 3:00 in the morning anymore. You’re gonna have to set an alarm clock for that.

LD: Very true.

LD: I mean, there could be absolutely… We’re gonna… I mean, I think everybody knows the benefits. We’re absolutely gonna talk about the benefits too, but also I think there may be some unintended… Consequences may be a strong word, but unintended just things you need to think about.

SK: So, one, it’s hard to build highly automated systems. It is harder to get up and over that hump than it is to build manually orchestrated systems. You can skip a lot of steps in those early days by not trying to go full-fledged automation and just kinda throwing some elbow grease at it and making it work. I think that’s okay. It definitely takes time to build a system that is that automated. I think we’re going to be entering an era over the next couple of years where there’s going to be introduction of a lot more automated systems, not just orchestration, but going beyond that to full-fledged automation as we’re seeing in a lot of other software domains as well. So I think that certainly is something that we can count on and rely on is, this isn’t a problem that most people are going to have to solve for, but instead are something that they can just build on top of and take advantage of. So I think that that’s certainly one.

SK: As far as concern around the sheer volume of metadata, the good news is… And this is where we get to fall back on our strengths in the big data world is… Metadata is still an order or two magnitude less than the order of actual data that we work with…

LD: Yeah.

SK: And that actually becomes some of the fun aspects of what we get to start to do is metadata itself is no longer a put it in your Postgres database and run inquiries and keep tabs that way, but it is a get your metadata in the same data lake or data warehouse systems that you use for all the rest of your big data processing, and you can do the same magical things on your metadata that you’re doing on your actual data itself.

LD: Yeah. No, that makes sense. And I guess I just… It sounds like all utopia and roses and candy, oh-my, and so, what are the… What if somebody just dives head first, what are… Before they hit the first roadblock and go, “Wait, I didn’t know this is hard,” like what are the things to think about?

SK: I do think that the sobering one is, it’s hard work to build this. Just to be… Ultra-transparent, we’ve been at this for many years at Ascend as a team. Building high levels of automation definitely takes a lot of effort to do. We spend time talking about domain-specific control planes, and…

LD: Yeah.

SK: High levels of automation where you have to actually be context-aware. And our specific context, data aware, and at that time. [chuckle] The thing is, it takes a tremendous amount of engineering resources to build a system that is that sophisticated. And many companies may not have the stomach for it. Or they may start it and they may not understand the deep levels of complexity required, and the sheer level of engineering effort it takes to get to the other side of that vision.

LD: I think about the normal roadblocks that, to be fair, any team within an organization faces when they start any sort of new project. Which is I have my goals, or my OKRs, or my… Whatever your company uses, I need to hit for this quarter, or this six months, or this year. And it’s awesome that I have this project that I wanna start that I think that, in six months or a year is going to make a tremendous amount of difference, but… If it means that I can’t hit my short-term stuff, it falls to the wayside. If it means that I am spending too much time trying to do this, if it means I’m spending too much time trying to create this automation platform, and therefore I cannot do the other 20 things that are on my plate to make the business run, it falls to the wayside. And so they probably are having to choose between what makes things that are a year from now, and what just gets the job done.

SK: Yeah.

LD: And that’s tough.

SK: Totally. Well, and this is a bit of a tangent. But I think it’s a really important one, which is… We see a ton of teams, and to be honest, this isn’t even a data thing, this is just more of a engineering thing, where more often than not teams are being pretty pegged with workloads already these days. We’re all working in some pretty exciting, transformative, and high growth areas, if you’re in data and tech work today. And what we see is, oftentimes we undergo these pretty massive and exciting projects, but the engine still has to keep running, the wheels still have to keep turning, and… When we put that then against the backdrop of data teams, as we found for the last two years in a row, data teams are at or over capacity. It was 96 and 97% of our last two years of surveys.

LD: Soon to be for the third. Our third one’s going in the field here soon.

SK: I’m hoping at least the trend continues and maybe it’ll only be 95% this year.

LD: Yeah, me too.

SK: But when you put that against the backdrop of how overloaded teams already are, you take any project and optimistically it may be a six or 12-month project, you put it on a loaded team, you double that time frame, by the time that team even has an alpha version of this product, you’re already a couple of years further along in the industry. The technology landscape has completely changed…

LD: Yeah.

SK: And you’re just kind of in this stuck spot. And I do believe that this is the challenge. It’s a catch-22 where, if teams had higher levels of automation, they would have a lot more free time to invest in the next gens. But they don’t, because they don’t have automation. And as a result, they’re somewhat trapped in these cycles and…

LD: Yep.

SK: I think… When we see the sheer growth of demand for data engineering teams, and analytics engineering teams, we see that there’s so much more demand. And that’s where we really do see that demand for talent outpacing supply. We’re just not creating enough data engineers, and analytics engineers fast enough…

LD: Yeah.

SK: As an industry. The only way you solve that ultimately is, we need more leverage for the really talented folks we have, that are not in the industry. And you get leverage through automation.

LD: Yeah. And to be fair… With that, I will say speaking as, again, coming from totally different. But… It is an example that is similar, but different in that I have personally spent a lot of free time, weekend, nights over the last two weeks, just going into our marketing automation system, and putting a bunch of frameworks in place that I have not had over the last year and a half. Now that we are… A year and a half of learning, going into the next year saying, “This is how we wanna scale our marketing team. This is how we wanna scale how we’re doing things.” So knowing that, let me put these frameworks in place so that I have a better understanding of some of the metrics we wanna do. So that I have a better understanding of this, this and this. And so let me put these frameworks in place, and let me get these reports out, and let me get this, this and this done.

LD: I was actually showing one of my team members earlier today, some of the reports that I just have, now, automatically updating, and some of the views that I have automatically updating. And it is such a relief, and it is such a weight off my shoulders. And it is one of those things where I am glad to have spent free time, because I don’t have time during the day, ’cause we’re busy. Lovely thing for our company is that we’re very busy, I don’t have time during the day to do it. So I’ve spent free time doing it because it just… We have an automation system to do it, I just honestly haven’t had the time to set it up. So now I’m spending the time to set it up, and it is so wonderful because A, it’s going to save me time moving forward, B, it’s going to lead to better performance, C, it’s going to help us scale.

LD: D, it is just… It means it’s gonna be a better use of all of our time, and all of our abilities, and all of our things. They’re so… It’s better data. It’s all of these things because I have filters in place, and I have work flows that automate… Okay, so this person came in this way, so they get routed this way, and they do this, that and the other. And it’s amazing. And speaking is just like, again, at a marketing who’s setting up her marketing automation system, probably the way it should have been set up six months ago, but finally having the time to do it, ’cause we have a little bit more site line into how we wanna do a few of these things. It’s amazing. So, I can only imagine somebody who is spending infinite, about infinite more time in their day dealing with data than I necessarily deal with data. The ability to automate things like that, I can only… How much better I feel, I can only imagine how much better they would feel if they had that kind of automation set up.

SK: Totally, when I think… You touched on something too, which is… Maybe this is just the inner geek in all of us. I definitely get that delight where when something you see running, and it’s just perfectly automated, and you’re like, “Oh, that used to take me so many clicks…

LD: Right.

SK: And now, it just works.” And I think the… Something you said too that kinda remind me is the… Oftentimes, when we undertake these automation efforts, right? Usually we see these automation efforts and are like, “Oh, this thing’s creating so much pain. We’re going to start from scratch, or we’re going to throw a baby out with bath water and rebuild this whole thing.” And it’s funny ’cause the, as we talk through these, and as I touched on a little bit earlier, ain’t nobody got time for that. We’re busy. The wheels are still turning. The engine’s still running. It’s that old saying of, how do you change the tire while you’re still flying down the racetrack…

LD: Yep.

SK: Right? And had seen us even upgrade and invest a lot in our marketing automation and so on is these things can be done in smaller and incremental stats, right? We see a bunch of teams, we do it internally. We see a bunch of our customers do this, which is, “Hey, we’re just going to incrementally be migrating things over. We’re not gonna re-platform everything all at once.” I was slacking with one of the heads of data from one of our customers the other day, and he had this really great, what made me very happy was, “Hey, we love Ascend. We still have systems that are running on the older stuff. And we just have a policy, which is don’t touch it, but once it breaks, we just moved it over to Ascend. And then, it’s now fully automated.”

LD: There you go.

SK: And I think that’s a really good learning that everybody should adhere too. Well, will ideally be, use Ascend, but even if you’re not using Ascend, the other… The broader takeaway…

LD: Yeah.

SK: Is also the… Find ways of doing this incremental.

LD: For sure.

SK: Don’t try and conquer Everest all at once. Find the thing that is creating the most pain today. Something that is just eating away at most of your time, and find a way to automate that piece, then find a way to automate the next one and systematically move your way through.

LD: ‘Cause you’re never gonna know… And this is what I’ve found. I laugh also. I created… Forgive me, I’ll talk about another thing that I created. I created this wonderful wine spreadsheet ’cause I like wine. I collect… I collect wine. And that makes it sound like I’m a lot more of a wine connoisseur than I am. I’m not, but I just, I like wine. I have a lot of wine in my house.

SK: I do think those are two wine refrigerators I can see in the background.

LD: I do have a big wine refrigerator here. I have one here, one here. I have a bunch of stuff in closets. I have a lot of wine.

LD: And my friends laugh at me, but I have this big one… I have an app on my phone. I found it was too difficult for me to use, just ’cause it’s whatever, but I have a big wine spreadsheet. And I created a couple of weeks ago when I was going through and taking the inventory. Yes, I have to take inventory. And I, over the last couple of weeks, as I’ve been using it, as I’ve been adding and drinking wine, I realized that it was… To know that’s not automated. Yes, for sure. But I’ve realized that it is… And the same thing happened with marketing automation system. We just aren’t quite as far into the process is why the wine thing came up, but I’ve realized that oh, the way I first iterated on it and created it doesn’t work for me. And so, it has gone through four iterations. I’ve changed things like three or four times, fairly significantly to make it easier to use for how I actually am using it in day-to-day life. And I think one of the other things that I was talking to a team member about with these changes to the marketing automation system that we’re making and some of the reporting that we’re doing is, it’s going to change.

LD: As we move forward with this, this is probably not the end result. This is not the final result of what this will look like. We will learn. This will be what it is. And then, we will learn how we’re using this and how we need it to adjust for our fit. And so to your point, don’t rip and replace. Don’t try and conquer Everest on the first batch. Do it in small increments because I guarantee if you conquer Everest on the first batch, you’re gonna hate it because you didn’t… You need to use it before you really know how it’s gonna work for you. And if you try and conquer Everest on the first batch, it’s probably not gonna work for you. To do it, increment with it, learn how you use it, learn what works for you, and take that learning into the next one, and figure it out from there.

SK: Yeah, that totally reminds me of one of the things we’ve heard from another customer a while back, which is you wanna make change cheap.

LD: Yes.

SK: And the idea behind this is embrace the fact that we are in a very fast-moving industry. And embrace the fact that we don’t all know where the technology landscape is going to go. We know it’s gonna be exciting. We know it’s gonna move very quickly, but we also know it’s gonna change a lot.

LD: Yeah.

SK: And I think the… Unfortunately… And probably, I think for valid reasons, but also very valid reasons as to why we should not break away from this pattern is the data ecosystem from an engineering perspective. How we build, I would argue in many ways, is a solid decade behind the rest of engineering and software engineering principles. Today, we still build very much like older traditional Waterfall-based development models. We take on these massive architectural projects, we take quarters until anything sees the light of day and we just really hope we got it right. But it’s a moving target, right? And so it’s really hard to know. And the longer a project takes, the more that target can move and the higher your risk is because the higher variability as to where you’re gonna end up.

SK: And what we see a lot is, change is expensive, because these are long projects and you’re hoping you get your direction correct. And you…

LD: Yeah.

SK: Hope you actually are intercepting that curve properly. And I think there is a really important thing for us in this automated ecosystem to embrace which is, we need to be able to move faster. We need to move more towards proper Agile-style development, whether it’s Scrum or even Kanban style of development for data pipelines. And maybe this is a good… We should talk about this at some point, maybe in the next cycle… The notion of DataOps versus DevOps, and the…

LD: Yeah.

SK: Notion of how do we move faster. ‘Cause in many ways, DevOps was born because we wanted to enable more software engineers to write more software faster yet safely.

SK: If we look at most of the DataOps principles today, they are about how to enable more data people, data engineers, analysts, engineers, data scientists, etcetera, to work more with data, faster, yet safely. And I think that becomes really important as we look into this new era and figure out how do we move faster…

LD: Yep.

SK: And, most importantly of all, DataOps requires automation to bring it full circle. It requires that leveraging of metadata because you can’t move fast enough manually. You can only move fast enough with high levels of automation, and that’s what DevOps brought in to software development was high levels of automation into the process of how do you build. Same things for DataOps and Data.

LD: Well, I feel like if people wanna hear more on this topic, then maybe what they should do is check out the summit that’s happening April 13th and 14th, possibly. I mean, they can also contact us any time or just reach out to us. We’re always happy to talk about data automation, but also the summit. Just maybe.

SK: Yeah, I think it’s a good idea. I like what you did there.

LD: Yeah, maybe, maybe, maybe. And I promise at the summit you won’t have to listen to me pontificate, just other people who are way smarter about this topic than I am. That’s a beautiful thing about the summit as well. So Sean, thank you, I appreciate it.

SK: Thank you.

LD: Always enjoy these chats.

SK: Me too.

LD: Honestly, what I really do appreciate about so many of these topics is exactly how applicable to folks outside of traditional data engineering like, well me. Which is part of the reason why I, in particular, am so stoked for the date Automation Summit. Which is a free virtual event on April 13th and 14th. Where you get to hear conversations, case studies, Q&As and more, from real world practitioners that are implementing Data Automation strategies in their date stacks. Head on over to ascend.io to learn more. Welcome to the new era of data engineering.