Ep 19 – The Evolution Of Data Stacks

About this Episode

The one where Paul and Sean talk about why Sean founded Ascend, how automation has disrupted so many parts of the developer stack in the past, and why it’s coming for data pipelines next!

Transcript

Paul Lacey: All right. Well, welcome everybody and thanks for hopping on the program today. I’m Paul Lacey, I’m the head of product marketing for Ascend.io and I’m joined by Sean Knapp, who’s the CEO of Ascend.io. So welcome, Sean.

Sean Knapp: Yeah. Thanks, Paul.

Paul Lacey: Great to have you on. So for those that have been following avidly, I’m really pleased to announce that we’re restarting what used to be called the Data Aware Podcast, and I guess Sean today, we can figure out if that’s the name that we want to stick with, maybe even talk a little bit about what that name means so that folks that are going to tune in understand why we call it the Data Aware Podcast versus other options. Yeah, so we’re going to restart this program as a weekly program, which is great. So we’ll be talking about all kinds of things related to data, data engineering in particular. I can imagine we’re going to hear a lot about generative AI in data engineering going forward. Because I know, Sean, that’s a passion of yours and since we can expect you as a regular on this show, that’s amazing.

So we’ll dive into that. We’ll listen to what you’re hearing from your friends in the field and contacts and VCs and whatnot. It’s super interesting as well as what we’re doing at Ascend, right? So the improvements that we’re baking in, what we’re hearing from the field, that kind of stuff. Figure we’ll also touch on some hot takes if there’s anything that comes up during the week that’s interesting and buzzing in the community, can take a look at that and understand what that means.

And then anything else that comes up related to data pipeline automation, which I know is something that’s very near and dear to your heart, Sean, and a lot of folks that are current Ascend customers and folks that are future Ascend customers, all kind of thinking about, “Hey, how do we get out of this rat race of building data pipelines by hand and maintaining them at scale and fixing them every time they break. And how can we of put some of that stuff on autopilot?” So kind of broad prospectus for the show. Is there anything else you would think we’d bring on here, Sean?

Sean Knapp: Yeah. I think you hit on key things, which is obviously we’re going to be really excited to talk about generative AI and even I think it’s a little cousin automation. And then I think we’re in a very fast moving space. There’s a lot happening every single week and I’m really excited to dive into those trends.

Paul Lacey: Awesome. Yeah. Well, let’s do it. So I guess maybe we’ll start at the top, Sean, Data Aware Podcast was what we used to call this. That has something to do with the Ascend platform, does it not?

Sean Knapp: It does. I love the duality of that description part of it because there’s so much happening inside of the data ecosystem that we as listeners and practitioners inside of the industry want to be highly aware of and see the trends as they’re moving. The second piece that I really loved about this was this notion of data awareness itself from a technical standpoint. Obviously Ascend as an automation platform, we invest quite heavily in the smarts of the system to not just run some code for you and move some things around, but to fundamentally be cognizant and aware of your data and the linkage to the code that’s operating on it. And so I really love the combination of those two, both from the human perspective for us as practitioners, as well as from a technical perspective of how do we build an incredibly intelligent system that drives these intelligent pipelines?

Paul Lacey: Love it. Yeah, and the idea of the kind of linkage between the two is pretty impressive as well, right? When you have a data aware operating system, when you have a data aware control plane essentially, then you as a consumer can actually be more data aware because you’re less in the weeds of trying to figure out what’s breaking and how to fix it. And you’re more actually upleveling into, “Hey, I’m actually working with data. I’m writing high level transformation logic that is actually … it’s munging the data and it’s getting it into a good state to be able to produce analytics with and other kinds of products and things like that. So you as a user actually can become more aware of the data as opposed to the infrastructure underneath.”

Sean Knapp: Yeah. Totally. I think that that’s this cool way of describing how we think about these, what classically has been this imperative versus declarative systems. And I won’t get too geeky too early on the podcast, but oftentimes, we find these more primitive systems early on in the technical evolution of a space, they kind of pull you into the weeds. And so for you as a practitioner, you’re pulled deep into the weeds of how are the systems working and how am I moving things around? And as practitioners, we really want to pop higher and higher up into the stack so that we can spend more time thinking about the data, the data as an actual product and what it means for the business, how we drive value from that data. And it always has felt like this constant struggle where the technology’s pulling you into the weeds and we as practitioners want to elevate and get more impact-oriented.

And when you have a technology like Ascend or let’s leave room for other very awesome technologies out there too, though, of course, we’re biased and we think Ascend is the awesomest of those, but when you have this really powerful technology that is aware of data and the code and everything that’s happening to your data, it does actually then free you to rise up higher into that or higher up on the stack and really orient your efforts, your thinking more towards the impact of the data versus the bits and bytes and movement and processing of the data. And that’s a really cool benefit of more intelligent and automated systems.

Paul Lacey: Absolutely. Yeah. And I think it’s even to the point where to be seen for sure, but it feels almost like there’s a new persona that’s emerging in some of our customers certainly where they’re thinking more about the automation and the impact of their data. I know you’ve called it impact engineering, Sean, in some of your thought leadership work and talks that you’ve given, but it’s thinking about how can I take what I’m doing and supercharge it, basically copy myself, clone myself so that I don’t have to be doing as much of the rote work, I can step up into that next level. And so it would be very interesting to see if there’s an actual title that emerges out of that. We saw analytics engineering emerge out of the fusion of analytics and data engineering and also the shift right, I guess, of these technologies and the kind of paradigm of allowing analysts to work with data more because the technologies are becoming more accessible.

You don’t have to be a Hadoop expert or you don’t have to know how to optimize a Presto cluster in order to run queries on large volumes of data anymore because these platforms now exist that they kind of abstract that away from you. And so we have people that are actually closer to the business that are able to work with data more directly. And so now we’re starting to see that the fusion of those folks who know how to code actually gravitating a little bit more into the engineering world. We should be able to see some of the engineers actually gravitating right, gravitating towards the business a little bit more so that they are getting more involved with the end products that they’re producing and a little bit less with the machinery.

Sean Knapp: Totally, totally. I think this is a trend that we’ve seen across spaces, not just obviously the data ecosystem, but we’ve seen across other engineering landscapes over time, which is this constant shift right from a where can we deploy resources? Where can people have the greatest impact or even drive the greatest outcome? I’ve oftentimes heard this called outcome-oriented engineering or development as well. And I think that is the really cool thing that is happening in the space today is we give people the opportunity to shift right their focus or as I like to describe, the moving up the stack where you get better visibility and can see further along in the horizon, if you will. Because we free them from [inaudible 00:08:25] that we used to have to. And for me now, 20 years into the space of engineering and even data engineering, as I wrote my first MapReduce job back in 2004, it’s just been this consistent march right and up if you will, to constantly free yourself from the weeds that you used to have to just drudge your way through to continually getting to focus more and more around the impact.

And it’s just really amazing the things we can achieve now as a single person used to require armies of people even 10 years ago. And that’s one of the things that gets me so excited is what will we as a single person be able to achieve and accomplish five years from now that even today, which requires so many people. And as I think about that, what gets me so excited then is the sheer rate of innovation that we’re going to start to see is going to be really exciting as more and more companies really adopt and embrace this new approach to how we build data products.

Paul Lacey: Yeah, it’s going to be exciting and to see the impact that data is going to have and just about all areas of the economy is going to be phenomenal to watch as well. But you hit on something, Sean, that I’d like to circle back on. So you mentioned that you wrote your first MapReduce job 20 years ago. For those that are kind of just tuning in, kind of just getting to know you a little bit, how did you get into data and data technology and come to found a company that’s based on that?

Sean Knapp: Yeah, well, I’ll use a saying from my parents, which is I’ll try and not make the long story boring, but all through college, I did my undergrad and master’s in computer science. And it was a really actually cool experience to get to start at Google in 2004. And all of my master’s research was not in data, it was actually in human computer interaction. So CogSci meets computers and back at Google in 2004, most of the R&D effort had been deeper in the stack. There wasn’t even a role called a frontend engineer. I was hired as a Java developer because I was the fuzzy stuff that was higher up in the stack, which probably helps date me a little bit. But the really sort of incredible part about how I got into data was as somebody that was really interested in human computer interaction and cognitive science, we ran all of these experiments on web search, just pushing pixels, “Hey, what happens if I modify the user interface this way or that way or I realign things this direction or we re-rank things this other way?”

And that very quickly turned into a lot of experimentation. And I was fortunate enough where Google already had an experimentation framework in place and one of the teams there had actually even done an amazing rev of experimentation of a new framework back in 2005, I believe. And what was really awesome about this was as a kid fresh out of college, you could actually, on a Tuesday, go push around some pixels and on a Thursday write a MapReduce job and see what happened and you start to scale that and start to push more pixels around and run more experiments and see more and more results. And that was a really cool experience and a very impactful one in those formative years of the sheer empowerment you feel when you get all of the data and the feedback loop around all the things you’re trying is you’re pushing to try and drive a better user experience.

And so how that got me into data and obviously was that was the initial experience, but when I went with two other coworkers to found my first company back in 2007, we were focused on the media space and totally different space than anything we’d ever interacted with. But the first thing that we did was we built out a ton of data and analytics capabilities because, A, how could we get anybody to pay us for a product if we couldn’t demonstrate the efficacy of it? So we had this deeply rooted data culture, and B, we could provide a level of visibility that others had just never had before into that consumption. Over the course of eight years, we’ve built out this amazing data engineering team, really innovative technologies, got to work with all the cool new cutting-edge technologies that were coming out over the course of that almost decade.

And ultimately as we grew, we were acquired and had a really fantastic journey and outcome. But back in 2015, as I was starting to then think through, “Hey, what’s going to be the next big shift in the space?” Let’s zoom out and let’s assume that the new cloud data platforms and warehouses, snowflakes, the Databricks, the Google big queries of the world, et cetera, let’s assume these have been ubiquitously adopted and we have now taken off the table the challenges that most companies at that point in time were trying to solve for, which was, how do I efficiently and with reasonable performance store and process incredibly large volumes of data and let’s take that off the table. Let’s assume that that problem’s solved. What happens next? That’s what got me really excited ultimately to start Ascend was the next problem is a scale problem, but is a fundamentally different scale problem.

It is not a bits and bytes, speeds and feeds. It is a people and complexity problem. It is once you take these amazing technologies and open them up to 10 to 20 times as many viable users and you take the ever expanding number of data products that we need to produce and the ever expanding number of data sources that we have viable data and repositories in, you get this exponential expansion in complexity and scaling for that is a different kind of problem. And solving for that is really interesting. So that’s what ultimately got me excited about this new wave of automation as when you have to solve for scale in a human complexity and system complexity problem history has shown time and time again that we solve for that through automation. And that’s why over the course of a 20-year journey, it’s gone from just writing your own MapReduce pipelines to building these amazingly large scale, high volume, high velocity systems to now actually seeing the next big macro wave of innovation that happens in this market is going to be around productivity, automation, human efficiency.

Paul Lacey: Yeah, yeah, absolutely. And I guess we see that happening in other markets as well. We can think through, you mentioned Hadoop and MapReduce and my first technology, big data technology was also MapReduce and it was a managed service for Hadoop. Even back then 2015 timeframe, people were still fighting a little bit saying, but I’d like to tune my clusters. I know what these parameters mean and how I can optimize and all this kind of stuff. But now we have Snowflake and Databricks and amazing platforms like [inaudible 00:15:58], BigQuery as well where you just drop the data in and say, “Run a query for me.” And it will figure that out under the hood. It will figure out the partitioning strategies and sharding and whatnot that just make it perform lightning, lightning fast. And of course, if you want to throw more money at it, you can and it goes faster.

Sean Knapp: Just like cars, you throw more money at them and they go faster.

Paul Lacey: That’s right. Yeah, exactly. Drop into a Ferrari and yeah, you can do whatever you want. Probably don’t want to take the kids to school in a Ferrari though. That would get expensive. But yeah, so it is sort of like we’ve seen that progression already happen in the data plane world where people used to really geek out over this stuff and they used to really have their hands in the gearbox, so to speak, pretty intimately controlling and fine-tuning these machines. And now they just say, “You know what? I’m just going to send a check off to Snowflake or Databricks every month and they’ll take care of it for me.” So we’re already seeing that.

Sean Knapp: Yeah. And I think it’s a really cool shift and we’ve seen it progressing over the course of time and you tend to have some of the early adopters in these waves. They’re like, “Oh my gosh, thank God I don’t have to deal with this anymore.” And they drive a bunch. You see some of the laggards who still sort of hold on and are the, “But I could still optimize this job a little bit better myself.” Which usually the argument is like, “Is it worth it? Does that drive a greater actual outcome or could you take your superpowers and apply them elsewhere?”

And I think what we’re seeing over the course of time is pretty uniformly organizations now are realizing, “Hey, there’s so much benefit to be gained to taking all of our intellectual horsepower and deploying it to the higher order challenges where we can really move the needle for the business and lean on these amazing technologies that are available to us now continue to lift us up.” It reminds me very much of what was it way back when AWS launched in basis of this whole, “We want to rise you out of the muck and get you out of the muck.” And I think that’s the whole big premise today is we don’t want people in the muck. We can lift you out of all of the monotony and keep you focused on much higher order, higher impact challenges.

Paul Lacey: Totally. So you mentioned that you managed a data engineering team. How large was that team and did you start to feel some of these problems while you were there?

Sean Knapp: Yeah, we definitely did. I think we peaked around 50 or 60 folks across the data team, and that’s factoring in the whole function from the infrastructure to the data engineering to the product management function. And I think that the biggest challenge that we ended up seeing wasn’t the can we build a really large scale system that we had a few billion data points, new data points per day coming through the system. And that was doing near row time processing and really powerful analytics around consumption of different pieces of content. And we had all these really amazing data products that we produced out of it. And the challenge faced was hard. When we wanted to create a new report or a new piece of functionality that was hard and expensive and we built this system that was amazingly powerful, but if we wanted it to do anything different than what we had originally envisioned, it was hard and things that conceptually we would think about, “Hey, this would be awesome, maybe this should be a two to four week project.”

Really were turning out to be a nine to 12-month project. Once you start to get that timeline extension, it’s really not viable as a business. And I think the reason why I think that really compelled me down this path of Ascend was that we had some of the absolute best and brightest, hardworking, talented folks. We built this amazing system and what I ultimately started to realize was, “Wow, in these rapidly changing domains and ecosystems, the next challenge is really going to be this complexity around complex systems and how fast an organization can move and can adapt.” And oftentimes as we build these huge systems to do these incredible scale, we’re trying to squeeze and juice out all of that performance, at the same time we lose all of our adaptability and flexibility from a technical architecture perspective. And we lose that ability to say, “Oh my gosh, we need to go launch a new data product that does this new thing, that opens up this new market for the business and we need to go do that in three weeks.”

And if you can’t get that out of your architecture, you’re going to fundamentally be limited as to the kinds of things that you as a team and as a business can go after and pursue in those opportunities. And so one of the big sayings that we have here at Ascend is the measure of great systems, especially in fast changing markets, is how effective they are at making change cheap and making change easy. And you do that by high levels of automation, high levels of data integrity checking and safeguards and so on, but the easier you’re able to make change and adaptability easy for folks to really pursue, the more dynamic, the more innovative their teams and their cultures can be. So that was really a core tenant of the design behind Ascend was we want our customers to be able to move faster, be more efficient, we want to deliver that giving a product that is highly automated and easy to use.

Paul Lacey: Right. Yeah, that makes a ton of sense. So essentially taking what was the benefits of, but also the drawbacks of a traditional data engineering system and kind of flipping it on its head and saying, “What if the system is able to handle 70, 80, maybe pushing 90% of the traditional rote data engineering work that people are doing to kind of just keep the lights on,” which as you mentioned, you had optimized at scale. So you had built the creme de la creme system for doing one thing and one thing extremely well, but at the expense of being able to then do something completely different.

If we were to project forward let’s say 10 years out, maybe even five years out, what’s the vision? What’s the ultimate goal? Is it you’re able to just kind of say like, “Hey, I got this data from BigQuery and I got this data from Google Analytics, can you sessionize this and then associate it with user profiles that I’ve got in my CRM data?” Is it sort of that level of automation that the system just kind of does it for itself or in your mind, is it more, “Hey, we’re still writing code, but it’s sort of assisted code writing and auto completes and that kind of stuff,” when you think about how does automation actually impact data engineering?

Sean Knapp: Yeah, so I think we’re absolutely moving in that direction. For those who have spent a lot of time with me, certainly the folks inside of Ascend know I’m incredibly bullish on generative AI. And so I think there’s this amazing future where for us, we’re going to be able to utilize gen AI in incredible ways to articulate higher-level concepts and lean heavily on the technology to deal with a lot of the monotony. I think we’re in this awkward state right now where… What was it? About a quarter ago when we had a kickoff and I showed this superhero meme, Shazam, where the kid all of a sudden becomes a superhero and he’s like this amazing, “I have all these superpowers, but I can’t even go pee in this suit, so I don’t even … the utility is kind of awkward at the same time.”

And the reason why I thought that was so appropriate, which is we have all these amazing new superpowers and they work like 70% of the time, but this is awkward growth stage too, where we’re trying to figure out, “Hey, what can we use these technologies for or what are they really good at? Can we trust them a hundred percent?” The answer there being no, not yet at least. And so how do we find our way through this? And so when I start to think and plot out a course of really where I think things go, one is I do think as spaces mature, we see not from a gen AI perspective, but even just more basically from a data engineering perspective, despite there being incredibly large number of geographies and industries and verticals and use cases, over time all the best practices really start to emerge. And so we start to see a lot of standardization and normalization around the things people do to achieve highly diverse outcomes.

The things they do tend to actually coalesce around a handful of actual things. And the reason why I think that first wave is really important is the sort of cardinality, the space of things that generative AI or automated systems have to tackle, then actually it becomes more manageable, and that’s the thing that makes it first pretty ripe for automation and even more generative AI things. So what we’re seeing right now in the current landscape is it doesn’t really matter if you’re trying to do a sessionization of user activity or healthcare forecasting how you’re building the data pipelines to do that at a lower level is becoming more and more and more common and normalized, and that is fantastic. Because that means we can productize a lot more of that layer and we can automate a lot more of that layer. So what I think is going to happen over the course of time is we’re going to keep seeing this standardization and normalization happening, which then opens it up for better automation, which lets us move our world higher and higher up.

Then I think what does happen is we get, in the short term, the lift from generative AI because I think gen AI is going to do a lot of the things that frankly just we as engineers don’t want to do, like test our code and write documentation. And so we’re going to lean on gen AI to do that and a lot of ideation and curation of some baseline things. I do think as we get more and more metadata consolidation and more centralized systems, gen AI can do far more interesting things on top of metadata repositories. And so I do think we’re going to ultimately get to the place where you can describe, “Hey, what I’m looking for is X,” and then over the course of the next year, we’re going to start to see some of those things emerge and it’s still going to translate it down into some form of code where somebody’s going to have to look at it and audit it first.

And I think it’ll start to work its way up to, not code, but visual navigating of data sets, and we’ll work our way from there and eventually it’ll work up into more and more polished no-code style interfaces. But I think it’s going to take a while to get there, and I think as we get into the more really bespoke custom-building, again, that’s much further out on the horizon, but that’s great if we can make it faster and easier for everybody to do all the really common things. Nobody likes writing sessionization functions for user activity. The stuff has been done time and time and time again. It would be fantastic if we can literally just tell a system to do it, and it’s smart enough to go do that, and then we can go do the really fancy things like forecasting healthcare outcomes and other nifty things that really impact the world.

Paul Lacey: Save people a lot of cut and paste from Stack overflow and customization.

Sean Knapp: Yes. Amen.

Paul Lacey: Yeah, the age-old development paradigm. Maybe it’ll finally come to an end, but you’re totally right. And I guess one of the things that it kind of struck me… So I’m one foot in the technology world, one foot in the marketing kind of creative world, and when you say that a lot of the things that produce varied outcomes are similar process, I think about creativity as a process. I think about, there’s two big movies out right now, and there’s the Barbie movie and there’s Oppenheimer, two very different stories, very different feel, very different experiences. They both went through the same process. They both kind of went through the, “Okay, first we’re going to write the story down in a script and then we’re going to write a storyboard and we’re going to pitch it, and then it’s going to get funding, and then it’s going to go and we’re going to go hire a studio and we need these kinds of folks to do this. We need these kinds of folks to do that, blah, blah, blah.”

And at the end you get a movie and it could be a very different movie, but it still goes through that same process. And those are the kinds of things that machines are good at doing, at automating. Because when it comes to the point where it’s like, “Okay, we always do these five steps for processing a data set, for example, sessionizing user activity for example,” and then just the subtle tweaks in the specifics of whether you’re looking at Google Analytics data or you’re looking at heath analytics, clickstream data, log data from other kind of sources and the machines can kind of figure out what those tweaks are, or we can actually program those tweaks. Then it becomes a, “Okay, here’s another one of these things, click run and you get the outcome that you’re looking for.”

Sean Knapp: Totally.

Paul Lacey: Makes a ton of sense. Super. All right. Well, Sean, I know that we’re going to try to keep these to around 30 minutes or so, so I think we’re getting close to our time. We could continue this conversation forever, which is great. That’s what we’re essentially going to be doing on these weekly sessions. I can’t wait to talk a little bit more maybe next week about the semantic models and how you see those impacting this, because when we think about making templates that machines can then interact with, certainly seems like there’s something out there and a lot of folks in the graph database world are kind of gravitating towards some of these things and Sparkle and RDF and that kind of stuff. So lots to get into in the future and when we talk about and think about automating data pipelines, but I think we’ll go ahead and sign it off for now. So thanks, Sean, for joining and really appreciate having you on and can’t wait to get to the next one next week.

Sean Knapp: Likewise. Me too. Thanks, Paul.

Paul Lacey: All right, we’ll see you then. Cheers.

Sean Knapp: Cheers.