Ep 21 – What Is Data Pipeline Automation?

About this Episode

Sean and Paul talk about whether spreadsheets will become self-aware now that Python functions are available natively in Excel, and what the definition of Data Pipeline Automation is given that so many people say they want it.


Paul Lacey: All right. Well, welcome, everybody. Welcome on the program. This is DataAware Podcast by Ascend. My name is Paul Lacey. I am the head of product marketing for Ascend, and I’m joined by Sean Knapp, who is the CEO of Ascend. Welcome, Sean.

Sean Knapp: Hey, hey. Thanks for having me again.

Paul Lacey: Great to have you back. We’re getting into a bit of a groove with these things, which is great. Sean, today I wanted to talk about a couple of different topics. I think the one that we’re going to get into that’s the meatiest one is going to be around data pipeline automation, which is something that I know is very near and dear to your heart, and central to your founding thesis for Ascend and what we’re building out, all the technology and stuff that we’re doing here at Ascend. So, it’d be great for our listeners to unpack that a little bit, and what does that mean to automate data pipelines?

But I promised that we would do hot takes on this program. And there is a hot take, actually, that just came out. This morning, actually, I noticed a press release that you can now use Python inside of Excel. Microsoft announced a partnership with Anaconda. And now, using the beta version of their cloud Excel product, you can actually start writing Python formulas alongside your traditional Excel formulas, and you can execute those, which is really interesting. So Sean, I’d love to get your take on what that means.

Traditionally, growing up, I guess, most of the data analytics that we would have done would have been in a spreadsheet type of a format, right? And as the cloud has democratized more data warehouse type technologies, more and more of us have started to gravitate towards doing some of these larger-scale analytics in the cloud. But this almost feels like a pendulum swing back towards spreadsheets in a way, now that you can run some of these advanced Python operations inside your spreadsheet. So what do you think that means, and what does that portend for the analytics world going forward?

Sean Knapp: I mean, so many things. Clearly, the lines are blurring. And I’ll get into that in a minute too. I have to, at least, first highlight the fact that my parents, who have spent a fair bit of time in their careers in Excel and spreadsheets and numbers … My mom is a tax attorney and accountant, and oftentimes would ask, “Hey, your last company you started, we understood. Media content online. Got it. Check. We understand that.” But she would ask me, “I don’t understand, what exactly is big data? What are you guys doing?” At least, back when I started Ascend. And my general explanation was, “For the most part, just really large pivot tables, just with tens, if not hundreds of billions of records, and thousands of columns.”

Well, in the big data realm and with data pipelines, we do a lot more. A lot of it boils down to … Look, I just got to add some stuff up and filter out some other stuff and merge some other things. And what has happened over the course of time as we hit these entirely new waves of scale, it’s just harder. The technologies are more powerful, but they’re just rawer, and so they take a little bit more oomph and hands-on grit to make them work. Oftentimes, the things we’re solving for may not be the exact same. They’re similar.

And so what’s happened over the course of time is we are seeing not just a convergence, but the boundaries really start to blur. And I think we’re seeing that for a couple of different reasons. We’ve already seen Google Sheets, for example, can tap directly into BigQuery, and vice versa. So we’re seeing these boundaries really start to blur, as we’re looking for people to have access to more familiar interfaces on top of larger, big data infrastructure, but it’s also going the other direction. And we’re seeing that the finance team, the data analysts, also are looking for the ability to tap into more, really interesting capabilities.

And I think the announcement specifically around the Anaconda integration, this is really neat, because it’s the same thing that Snowflake has done with Python integration into Snowpark, which is not a free-for-all in the Python ecosystem, because there’s a lot of security concerns that you still have to work out. But in working with access to vetted, really bolted-down packages and libraries with Anaconda, you can have a controlled environment, a trusted environment, but access to all of these new libraries and capabilities that really extend the functionality of the platform that you’re on.

And I think what we’re going to continue to see … And we’re just in the really early days of this. We’re going to see a huge amount of push around this, as not only do we find the data experts continue to have interest in this realm of tapping into more interesting things, especially around data science and a lot of the new things coming out, but we’re also seeing the potential for GenAI, to make this more accessible for people who don’t have software engineering backgrounds, who actually get GenAI to help them write code and run code for them that can tap into these really magical libraries.

And that, I think, we’re going to see at an increasingly large cadence and pace over the next couple of years, which is great. This makes me so excited, because all the lines, all the boundaries, all the … You’re this role or you’re this type and this is your skillset and you stay in your box. Not only are we breaking down the classic data silos, but I think we’re breaking down those skillset silos, and anybody can be anything they want all over again, which is fantastic. And so we’re going to see this really cool era of people reaching out of their classic, quote, unquote, “boundaries,” and tapping into new skills, which is super cool and exciting.

Paul Lacey: It is. Yeah. And it does feel like our spreadsheets have taken one step closer to becoming self-aware with this, and the ability to bring, like you said, some of those libraries and stuff to bear. And it’s really interesting to see the fusion of these two technologies get closer and closer. I loved your analogy there, Sean, about doing the pivot tables and stuff at a large scale.

That’s how I used to explain what I did, when I first got into big data technologies too. I’m like, “Just imagine a really, really big spreadsheet and you want to do some simple operations on it, like aggregations or filtering or whatnot.” It’s millions of rows long, and so if you try to do that in Excel, it would crash or it would take four or five hours running on your desktop. You can push that up to the cloud, and you can get that done pretty quickly.

But the spreadsheet still remains kind of like a universally accessible framework for a lot of people, right? We understand. From an early age, it’s drilled into us. You can go in, you can have … Okay, here’s a column, here’s a row, here’s a cell. This cell plus this cell equals that cell. And you start to build up more and more complex logic around that.

Even now, as a marketing leader and working with a lot of sales leaders, a lot of us still do most of our work out of spreadsheets. I’m maybe slightly ashamed to say that in a public forum, but at the same time too, so much of the one-off type analytics and stuff, it just lends itself so well to just drop a spreadsheet and start playing with it, and see what it looks like, do a little manipulation, that kind of stuff. And then you get your answer, and you can kind of move on.

So the fusion of bringing a Python into that mix is going to be quite interesting. I mean, what do you think that means, in terms of … Are people going to start doing more sophisticated prediction type stuff in spreadsheets and machine learning type stuff?

Sean Knapp: I think so, because oftentimes we see heavy productization of certain kinds of capabilities around machine learning or forecasting and so on. But what we find too is there’s sort of an 80/20 kind of rule, where 80% of the kinds of things you want to do from a modeling or forecasting perspective you could probably productize, because it’s fairly standard. Then there’s the other 20%, where we’re going to want to do really interesting kinds of capabilities. In the maturation cycle, the appropriate thing to do is to not actually try and productize yet, because we probably don’t even know what most people are going to be able to or want to do on top of spreadsheets yet with these new kinds of capabilities as you have access to all these new data sciencey libraries in Excel.

So the right thing to do, I think, is to give people access to a lot of these raw capabilities and see what they go create, and see where you start to find this normalization and standardization of behaviors and exploration. So I do think it’s going to be really cool to see how people really harness these capabilities. And then, over time, the most common patterns probably will get productized, so you don’t even have to write some Python. You can instead just click a couple of buttons. And I think this is the forever cycle that we see of customization to raw capabilities, eventually to better productization.

I think one of the interesting byproducts of this … You didn’t ask me, but I do think it’s interesting, because it’s related to our conversation from last week, which was … We talked a bit about how a number of CIOs feel that they’ve overempowered their organizations, as now they’re seeing the cost skyrocket. I do think that we’re going to continue to compound that issue. Oftentimes we see this with users on top of … If you think of the … I want to run all these amazing capabilities on top of Excel, while these data sciencey capabilities in Python libraries. And the natural fit is to then say, “Well, let’s take Excel as that middleware,” almost, but then push it down to, in a Microsoft world, a Synapse or a Snowflake or a Databricks.

But what we oftentimes find in that sort of set of technology is the actual workload that goes all the way down to the underlying engine, which is really where most of the cost is, the actual workload’s not super efficient and doesn’t end up being super optimized. And so I do think what we’ll see is, over the course of time, really great adoption that will then probably drive increasing cost, which will then probably drive a need for … Hey, we need better optimization around how we’re accessing data, how we’re querying that data, how we’re creating more refined datasets.

It probably actually takes us a bit to our conversation today around automation in general, but I think this is that forever cycle that we see around data, which is open up amazing new capabilities to tap into what previously was very hard to access and tap into. Sets of data, volume, size, and scale, with modeling and other fancy things that we can do drives huge excitement, which drives huge adoption, which drives huge cost. And then the cycle goes into a … All right, now we’ve done a lot of exploration, now let’s actually make this thing more efficient.

And I think we see that cycle play out over and over again. And that’s where I get excited about this, because we’re giving all these people all these amazing new capabilities, and eventually it will go from not just sheer empowerment, but to a empowerment with efficiency, optimization, automation to make it economically viable as well.

Paul Lacey: Yeah. And that is a great point too. What comes to mind, to me, is spreadsheets are oftentimes … Not oftentimes. They are, by definition, a snapshot of the data, right? Spreadsheets are not live in the way that data warehouses can be live or other data repositories can have a constant stream of fresh information, fresh data coming into them. By definition, spreadsheets are just a point-in-time snapshot, and then you can run some analytics that sits backwards-facing.

And you get some really interesting things out of that, but it’s not being backed by live data pipelines that are bringing data in to your models on a regular basis so that you can continually refresh those, which I guess takes us, like you said, to our next part of the conversation, which is … Yeah, it’s great that all these data science type capabilities are coming down stack and now they’re in the hands of people who write spreadsheets, that are very comfortable in spreadsheet world, but then how do we take that back up the stack and say, “Okay, great. Now some analyst or some marketer has hacked together a spreadsheet that creates a really interesting view of our data and all of the things that we’re doing across the company. How do we get that updated on a daily basis?”

Now you’ve got to go back into … Okay, yeah, let’s build the data pipeline, and let’s get this data ingested on a regular basis, and then let’s run all these things. Let’s take all these Python stuff and all these Excel formulas that you’ve done and put them into SQL or put them into Python scripts that run at scale. So yeah. And we got to do that. We got to do that day in and day out, and we got to maintain it when it breaks and all that kind of stuff too. So yeah.

I guess, let’s switch gears and talk a little bit about that problem, the data pipelines. So when we say data pipeline automation, Sean, that’s a very broad term, and I can imagine it’s very generic for a lot of people, because there’s a lot of words in there that people understand. Data pipelines, get it. Automation, get it. But what does it mean to put those two things together? In your vision, what does it mean to actually automate your data pipelines?

Sean Knapp: Yeah. So I think automation is … When we do the communication training inside of Ascend, we had this notion of blur words. You’re supposed to avoid blur words, because they can be very nebulous and hard to define. And in many ways, automation itself is actually a blur word, which is … What does that word mean? And for the Princess Bride fans, oftentimes you sit with folks and you hear them describe automation, you’re like, “I don’t think that word means what you think it means.” I’m probably dating myself too.

But the whole notion behind automation is it is this slightly nebulous thing, I think, for most people, which boils down to … There’s lots of things I’m doing on a daily basis that I have to worry about that feel like I just shouldn’t have to worry about. We talk about automation for self-driving cars. We talk about automation for infrastructure and container orchestration, like with Kubernetes.

And I’m first reminded of … And then I’ll actually get to answering your question, but first reminded of this survey that we do on an annual basis. It’s a blind survey, not affiliated with this end. Go out through an independent research group, ask hundreds and hundreds of data experts, from CDOs to architects to data engineers, et cetera. And one of the questions always is are you investing in automation or not? Do you plan on investing in automation?

And I won’t do any spoilers for this year’s data, because I think we’re going to be talking about that in a week or two. So no spoilers on that, but I can tell you last year, it was roughly 3.5% of people said they actually had automation in place, and something like 89% of people said they planned on doing something about it in the next 12 months.

And why I find that so interesting, first from an automation perspective is, man, there is a lot of people who want something that very few people have. And these are the people we’re talking to in the data ecosystem, and so clearly they already have other tools or systems in place. Hopefully, they have platforms like Ascend, but there’s a big market out there. They probably have classic open-source tools around, like Airflow or others, to help schedule and orchestrate. They might have traditional platforms, like Informatica or Talend.

And the interesting part for me, first, around automation was, well, if only 3.5% of people say they have automation, clearly the collective definition of automation is not what the market currently provides, otherwise that number would just be here, because the market penetration of the combined sets of scheduler, traditional ETL tools, et cetera, is greater than 3.5%. And so clearly, there’s something there that people are drawn to and want that just doesn’t exist today.

When I think about automation, and where I think there’s huge potential, and, frankly, why I started Ascend, it was watching data engineering teams just perform these continually repetitive tasks. And it wasn’t what I consider to be a very primitive notion of automation that I think most of the market has already rejected, which is the primitive notion of automation is orchestration and scheduling. And that is really the very basic version of … This time on this trigger, run these series of steps.

And the reason why I don’t think that qualifies as automation is it’s [inaudible 00:18:15] intelligent enough, but off of the things you have to worry about as a developer. So when I think about some of the things around pipeline automation and the need for automation, it actually goes back to what are the key questions that you, as a pipeline developer, have to answer at a really high level before you start writing code?

And so those questions are generally things like how often is my data changing upstream? When my data changes, what do I do with it? Am I propagating it through? What’s the correlation of a new block of data with previous blocks of data? Do I have to reprocess all data for all time once the new data comes in and changes, or can I just incrementally process data? If something fails along the way, what do I do? Can I safely restart? Do I need to alert somebody? All the way up to … What happens when I’m running code on data? What happens when that code changes? Do I need to run the new version of the code on all my data, or do I just run the new version of code on all the new data, or is there a certain point in time where I need to do it?

There’s all these questions that, when we think about the pipeline itself that we have to answer, it actually requires a lot of domain expertise, thinking about, well, what is the nature of the data? How is it being used downstream, as well as how is it coming in? Do I have late-arriving data, for example? Am I building pipelines with mobile data, where I could actually get people phoning home 30 days after an event occurred?

So all of these kinds of things we have to write code for when we build pipelines. If it’s just dumped straight into a warehouse and I can query it, yeah, sure, just dump more data in and do an ad hoc query, but that becomes expensive and unoptimized, per our earlier part of the conversation. And so when I think about automation itself, it is designed, or should solve the problem of how do I not have to think so much about all these questions and the answers? And can I instead say, “I don’t know, you’re the system. You figure out the relationship between the code and the data, and you just go do the smart stuff. I’m just going to tell you what I think of the data, and how it should be optimized, or I want to tell you how it should be processed and shaped. And you go figure out how to optimize it as the system itself.”

And I think that’s the ultimate goal and panacea of pipelines, is be able to treat the whole world just like you’re querying warehouse tables. Let it figure out how to optimize the propagation of data. And better yet, not even have to think in the same sort of granular level that we query today, but just create semantic models of the data and let the system be smart enough based off of all the metadata it tracks, based off of all the heuristics that it can monitor around the data and the pipeline, and let it figure out how the data moves through, how it gets processed, how it gets optimized. I think that’s the real, ultimate goal.

Paul Lacey: That’s interesting. So I guess, to summarize what I heard there, it’s about the orchestration, and being able to graduate from very simple trigger-based or time-based orchestration to saying, “Hey, I want the system to figure out when it should run pipelines.” I don’t want to be in the business of telling it, “Hey, if it fails, then … If this, then do that,” all these logic branches and stuff like that. So in your mind, do you think orchestration tools have an expiration date on them, when we think about the future of automation?

Sean Knapp: I think they do. I think classic orchestration just has to evolve. I think classic orchestration and scheduling tools shove the entirety of the cognitive burden onto the developer. I think that’s very natural for the early stages of evolution. We see all of the classic model is … Timer and trigger, execute DAG. And we can keep adding more and more convenience layers around how to create these, but if you think of cognitive load required, you still are forcing whoever’s building that thing to think through all the side effects of every step, and everything that could go right, hopefully, more importantly, everything that could go wrong, what happens with the side effects of those steps.

And the more sophisticated the technology gets, the more it’s capable of offloading that. And we’ve seen, across other domains… We’ve talked about this before. We’ve seen across infrastructure, to front end, to databases, the move towards declarative models and systems, and even declarative control planes, because of the sheer power of that automation.

And so I do think, over the course of time, the value proposition of just traditional orchestration tools goes down pretty dramatically. The reason why I think it goes down slower than I would argue it could is we already have a lot of systems built around these models, which are very imperative models. And oftentimes, it takes retraining of our brains to think in a declarative model.

And I’ve seen this from front-end engineering to infrastructure to data pipelines now. So across three drastically different domains. And it requires a retraining of the brain, which I think is a really interesting conversation in and of itself, where once we’re used to these imperative tools, like jQuery versus React, we start to think differently. And probably infrastructure with Terraform or Kubernetes is even better. In a declarative mindset, we actually get to stop at the first few steps, which is what is the outcome we want to drive? What is the state of the system that we actually want in an ideal state? And we describe that.

With imperative tools, we always have to go way into the weeds and think of, well, how are we going to get to that state and how are we going to ensure that we’re always in this state? And that’s the cool part about retraining our brains to be more declarative is the human brain starts at the declarative model. We’ve just been hammered into this new, imperative structure of, oh, I’m going to quickly get through what I want to achieve and get into all the minutia and micro details that shouldn’t matter about how I’m going to get there.

And so we have to actually retrain our brains to stop earlier in that process and think back up to the higher-level constructs, which takes time, but when we get there, it’s really cool, because you can just get through so much more. You can think faster, you can build faster, and you free yourself from the grunt, grinding part of implementing.

Paul Lacey: So would it be fair to say that in this world that we’re imagining, which actually is closer to a reality than I think a lot of people think, once they spend a little bit of time with the Ascend platform, because we’ve been doing a lot of work on the Ascend side to make some of this stuff possible. But I guess the ideal state would be you define the transformation logic that creates the models, and then you just stop, and you let the system figure out when and where to run the pipelines in relation to existing data and new data that might come in on an interval basis, or even on a streaming basis if you’re doing CDC or things like that, micro-batch.

Let the system figure out how and when to run the infrastructure to create those models, and then if and when you do want to change those models, you just change the definition of the transformation that creates the model, and let the system figure out where does that need to go. Does that need to rerun the entire dataset? Does it only apply to a window? Is it a windowing function? Does it only apply to the last 30 days of data, if it’s telemetry data, for example. So I’m only going to run it on a partition of the dataset. It’s not the entire dataset. Is that kind of like the ultimate vision?

Sean Knapp: Yeah, exactly. And I’ll be the first to admit too, I don’t think we’re all the way to the ultimate vision yet. I mean, I think the ultimate vision is you just blink and the system will build the pipeline for you. So we’ll wait for Elon and Neuralink to figure out those last pieces. But I think we’re getting closer and closer, and at a pretty rapid pace, from a technological capabilities perspective, to just simply being able to understand the outcome that the user wants.

And this is actually the cool part about declarative platforms is even the product feature cycle usually evolves into a … Well, afford the user the ability to articulate their outcome, such as, “Hey, I have a data propagation SLA. I need my data, from when it shows up, to be all the way downstream into the final materialized models within X number of minutes.” And that’s the outcome that they want. So I have a freshness guarantee, or I have a gate on propagation until these data quality checks are met.

And affording those users ways to basically communicate the actual higher-level constructs versus foresee them to have to build all these imperative little constructs, and that’s the cool part about the product integration cycle is once you have a powerful engine built around this declarative model, it just becomes similar to what we’ve seen with Kubernetes, the perpetual cycle of creating further affordances to the user of articulating what they want to achieve. And that’s a really cool cycle.

So that’s why I think we’re over the hump from a … Is this a technologically possible scenario, to be able to create. And I think we’re already over the hump on that. I think we’re now in the rapid expansion of capabilities for users to continue to deliver increasingly complex and nuanced sets of outcomes with their pipelines.

Paul Lacey: That’s awesome. And it’s great to hear, as everybody that’s tuning into the show, I imagine, is involved in data pipeline creation or maintenance at some level, to understand that there is going to be light at the end of the tunnel, where you’re not constantly having to maintain these data pipelines and respond to errors that happen in the middle of the night just because something that you wrote is no longer compatible with the dataset that’s flowing through the pipeline today, and it’s something that you didn’t foresee happening when you wrote the pipeline maybe six months ago, a year ago, a couple years ago. And now you’ve got to get sucked back into maintaining this code, and trying to update it.

If you’ve got a system that’s actually dynamically responding and figuring out what the new data model is and how that relates to the code and the codebase and the data, incredible. You unlock some incredible potential.

Sean Knapp: Yeah. I think we end up … If we and the industry do really well at delivering on this promise, I think that the future state is we’re accelerating data engineers, analytics engineers, you name it, into a data architect role, where they get to think far higher level around the movement of data, the modeling of data, the value of data, and get to propel them into that stage of their careers even faster, which I get really excited about because I think it’s a higher-impact stage as well.

Paul Lacey: That’s amazing. And that is the application of automation at the end of the day, right? It’s getting rid of some of the rote work so that you, as an individual, can progress in your career and start to work on more and more strategic, higher-level things, which is great.

And Sean, I hate that it always feels like a cliffhanger ending on this show, because we’re always like, “Oh, and we’re out of time, but there’s like five more things we could talk about with regard to this, so say tuned.” I promise this is not like an M. Night Shyamalan-style series where we open up more questions than it answers on every single episode. But I do want to actually mention that there’s a lot of things we can talk about under the hood, about how we’re actually getting this done and what the technology requirements are, so that people that are thinking about this, and especially the folks that tune into this show that really like to grok how things work. Not just what they do, but how they do what they do.

So maybe that’s something we can get into next week, Sean, is peeling back the onion or popping the hood. Whatever analogy you want to use. Look at how the system actually does what it does. What are some of the hard problems you have to solve in order to get this to be a reality. So let’s go ahead and leave them with that cliffhanger, Sean. But yeah. Thanks, everybody, for tuning in. And can’t wait to continue down this journey with you on next week’s episode. So this has been the DataAware Podcast with Sean Knapp and Paul Lacey. And we will see you next time.

Sean Knapp: Awesome.