We are thrilled to launch our new DataAware podcast! With so many fantastic conversations happening with customers, partners, peers, and more in the data engineering realm, we decided we couldn’t keep all the insight and knowledge to ourselves any longer.
In this podcast, you’ll hear from a wide variety of folks who are building, managing, and using data pipelines talking about trends, best practices, pitfalls, and the overall fun of data engineering.
In this inaugural episode, I chatted with our own Sean Knapp about the trends we’ve been seeing in the data engineering realm, as well as the shift the data engineering function has taken over the last several years. You can check out the first episode of the DataAware podcast below, or jump over to Soundcloud to subscribe, with other platforms coming soon!
Episode 01 Transcript: An Into to DataAware and What Data Engineering Means Today
Leslie Denson: Are you ready for a fun and fantastic journey with the new DataAware by Ascend podcast? We are. And to get this new series kicked off right, I, your host Leslie Denson, am joined by Ascend’s founder and CEO Sean Knapp to talk about the latest and greatest trends and best practices that all have to do with data engineering and the vast world that surrounds it. Join us for this, the very first episode of DataAware, a podcast about all things data engineering.
LD: Hey everybody, welcome to the inaugural episode of DataAware with Ascend. My name’s Leslie Denson, and I’m here with Ascend’s founder and CEO, Sean Knapp.
Sean Knapp: Hey, everybody.
LD: To talk about all things data engineering in and around the space. We’re gonna have a really fun time on this podcast and in the episodes that we have coming up for you guys, talking with folks around Ascend, our customers, our partners, and just other people in the industry about what they’re finding with trends with data engineering, some fun use cases that they have, some best practices. What’s always fun with worst practices that you can learn from and even more. So we are really excited to get started with this today. It’s been something that we’ve talked about internally for awhile and are just stoked to get going so that we can chat with you guys more about data engineering and chat with you guys more about what you’re seeing around that space. So, super stoked. How about you, Sean?
SK: I’m super, super stoked.
LD: Awesome. Well, with that, why don’t we talk a little bit about exactly what we’re gonna be talking about through the life of the podcast, which is some of the recent trends and changes we’re seeing with data engineering. There’s a reason why we have a company. There’s a reason why there is a massively growing industry around the idea of data engineering ’cause there is a lot of really cool stuff that’s happening. So, let’s talk a little bit about what are some of the more exciting things that you’re seeing happen right now, Sean? What’s got you super excited about this space?
SK: Yeah, I think we see a couple of really interesting trends, and I think a lot of the roads are pointing back to sort of a shared set of goals across vibrant types of teams. One of the things we see is from the earlier adopting data engineering teams. The people who were the early adopters of Spark and built a lot of other systems, even on the “O dots” of those technologies, and have been moving really fast with a lot of the streaming technologies and so on as well. The trend that we’re now seeing with a lot of these folks is they’ve met a lot of their goals and their needs around can we store and process and move it around fast enough? Sort of volumes of data, right? Sort of volume and velocity goals and needs have been met.
SK: And so now what they’re actually trying to do, we see a lot of teams circling back is trying to alleviate their maintenance burden, trying to automate and robustify a lot of their technologies. So, we see that with a lot of the data engineering teams today. What we also see is a ton of these other teams, the people who haven’t been in the data engineering game yet, but who really depend on data engineering, the data scientist, the data analyst, the ML teams, etcetera, all who sit downstream, right, who get data feeds in essence from data engineering teams. They have a really strong desire to self-serve and to actually work their way further upstream, so to speak, earlier on in that data life cycle. So that they can offload from their data engineering teams and can actually move independently as they brought out their own products. And so we see this combination of these two teams coming together, really investing heavily and democratizing and creating this new wave of data engineering where whether you like to operate deeper in the technology stack at lower levels, and sort of deeper in languages, and infrastructure or all the way up to that, the higher end of the stack, and self-serve, and create some of your downstream feeds across the board.
SK: Everybody’s starting to line around this notion of self-serve and democratization, and that obviously has a lot of requirements that go with it.
LD: Yeah, it’s interesting. So in past lives, places that I’ve been before, and we’ve talked about this before. We would go to Strata and we would talk to the same sort of audiences. It was a little bit before they were even… There were data engineers, but the term wasn’t quite as around, and then the data scientists. And you would hear the data engineers just go, “I’ve got so much on my plate. Like there are so much that’s happening. I can’t like… I’m working 12-hour days to try and keep up with the pace of business.” And then you talk to that point, the data analysts, the data scientists were like, “I can’t even begin to get access to the kind of data I need in that amount of time that I need it to get questions answered.”
LD: It’s that push pull that I feel like is kind of ages old, has been around for a while, that it’s becoming even more and more pronounced as the cliche of data getting faster and data coming in more, and they’re just being growing exponentially amounts of data that’s disparate and all over the place. When you add in to the number of technologies that people are trying to use with it, it’s just a problem that seems to be compounding itself over and over again.
SK: I totally agree. I think we’ve focused for the longest time in the data engineering world around this notion of scale, right? And so we focus heavily on, How do I scale to more bits and bytes and records and to the data equivalent of the old speeds and feeds. Yet, the scale challenge we have next is actually that of dependency and complexities. How do we build ten times more systems, ten times more products, 100 times more products with ten times as many people? And the intricate dependency graph, if you will, of all these people doing more things is actually the next scale challenge. And that is why we see things like the notion of data ops emerging, which becomes a really fancy way of basically saying, How do we get more people doing more things with data faster and safely? Same thing as we saw it from DevOps a number of years ago. And that’s really where I think some of the most exciting things… And where some of the most exciting innovations and things are going to happen, are in the progress forward of making it so that more people can accomplish more goals, and build more complex systems that don’t drag with it this linear or even exponential increase in complexity and maintenance burden. And that’s I think this new era that we get to look forward to, hopefully in the data world.
LD: No, and I think that makes a lot of sense. And it’s one of those things where when you look at it a little bit more cynically, everybody… You wanna be able to push off some of that work onto the people who are actually doing it, and by push off, I mean make it easier. So you wanna make it easier and make it accessible for the line of business users to be able to access and work with that data, which goes into a topic that I also wanna talk about today, which is the idea of the shift from your traditional ETL to this idea of ETLT where there’s… Transformation happens maybe in more than one spot, so that the business users can really interact with it and get what they need from it, while the data engineering team can focus on… I hate to call it the harder problem, ’cause that’s not necessarily what it is, but they can focus on other issues without having to worry about some of this.
SK: Yeah, the… I would describe it as oftentimes the more persnickety problems.
LD: Yes, that is a good way to put it.
SK: I think everybody’s, frankly, this day and age, we’re all tackling hard problems, and a lot of it comes down to how do we get the right people with the right skill sets to tackle the appropriate problem. When we look at the evolution of data engineering, we have seen these two camps start to emerge. The first camp was the modern cloud, data lake, data pipeline, engineering cohort. It was the next wave of ETL. You go and launch your big data lake originally on HDFS and your Hadoop cluster and then move it to cloud, and you use Hadoop and then Spark and so on to go and process large volumes of data. And the challenges five, 10 years ago were heavily around, I have to go tune that system, I have to deal with the replication factors and synchronization of data, tweaking JVM and Spark executor primers and all of these other things. And these were persnickety problems. And as a result, that ETL world really stayed deep in the lower level technology stack.
SK: In many ways, it’s like watching people writing their own… Self-optimizing their code as opposed to using self-optimizing compilers, ’cause they didn’t exist yet, so you had to go really deep in the tech stack. Those were hard problems and they required deep understanding of how those systems work. Well, what happened is, well, one, that’s hard, and it’s slow, and there’s not a lot of people who who can go that deep in the tech stack, and so as a result, the business ends up having to still solve certain needs, so what we saw in response to this is the ELT world really emerging. Cloud data warehouses, whether it is BigQuery and then Redshift and then Snowflake, and everybody really started to move towards their ELT world, replicate that data, get it into your warehouse, and transform it on the way out, or even doing cascading transforms on that warehouse, and that works for many use cases, but one of the sayings I heard recently was, that works fantastically when you have a pretty homogenous ecosystem, everybody has standard data sets, they have standard data logic, they want to apply to it, but the world’s also messy, and the messier the world gets, the more you need the ET versus the LT.
SK: And what we are really seeing is most companies and teams at some point can’t get away from the fact that they need both, and what they really actually want is they want teams to be able to self-serve on both, and they need this ETLT notion, or some things you’re gonna transform upfront because they’re highly repeatable, well-known, well-formed patterns, and then there’s other things you wanna transform on the way out as it’s ad hoc, it’s interactive, and you don’t know the repeatable pattern yet. But that hybrid model is something that we see everybody need today, and the business side has for very long periods of time been dependent on the IT engineering side to create those ETL pipelines for them, and frankly, what we see time and time again is the business really wants to go over to the data analysts, the data scientists, the ML engineers and so on, also want to create their own systems, and they want to self-serve, and moreover on the infrastructure IT engineering side, depending on your company’s terminology, we’re seeing increasingly everybody wants them to do the same thing too, ’cause nobody wants to be the bottleneck, nobody wants to be buried behind a queue of tickets. Most of these teams are already so underwater, they would love to actually enable people to self-serve so that they can actually start to chip away at the next big wave of technology challenges for their business.
LD: It goes back to this, and we, again, in past lives, we almost used it, we used to call it failing fast with your analytics, but then it really is more of the iteration process too that this enables, which is, sometimes you don’t know what you don’t know or you don’t know how you need to interact with that data. You don’t know what you need that data stream to look like, you don’t know what the analytics needs to look like. And honestly, until you get… You get it piped into your BI tool and you get some graph out of it, sometimes you don’t even know the answer that you’re trying to get or the question you’re trying to ask, and so being able to make it such that the business can iterate on these data pipelines and iterate on that analytics faster, just helps everybody.
LD: It’s one of those things where, okay, so maybe you’re using… Maybe your analytics team is building their pipelines in development, but it still has to go through data engineering for production pipelines, ’cause that’s something that we see a lot, having the analytics team have the ability to self-serve and do that and do that iteration faster such that you only have to give the engineering team one thing to do just makes everybody’s life so much easier.
SK: I totally agree. You touched on something too that… I’ll pull on that thread a little bit, which is a really interesting one to pull on too is, one of the patterns we’ve seen in the data ecosystem, that’s very different than the software engineering ecosystem, and I know that there’s a ton of overlap, of course, is most data systems are still treated, and how we build and execute projects, how we actually iterate or don’t, and even how we deploy our code, we tend to follow far more of the traditional software engineering practices when it comes to data pipelines and data infrastructure because the data is big and it’s heavy and it’s weighty, it’s hard to move around, so what happens as a result then, is we end up doing really crazy things where… Like, we end up with the equivalent of a monolithic binary. We end up with like very waterfall-y development process as opposed to a much more agile process. One of the ways that we designed the Ascend product, and really just because, heck, this is how we were used to working as software engineers and what we missed when we moved to the data world that we wanted to get back was, nobody was refactoring things, we didn’t have more modular systems, like how we are used to building into a microservice that sits on top of other microservices, so that we can have nice modularization of code complexity.
SK: We were finding we couldn’t get that, and so, we started to do things where we said, “Well, we should actually be able to break things up into smaller components, and let the system figure out how to optimize them, what to persist and what not, and how to move data through. But give us incremental building blocks where, for us… ” I may write a microservice, even in a sum-stack, we have everything from Node React Redux to Go based microservices, to a bunch of Scala stuff, yet they all talk to each other over standard, well-formed APIs, and we can mix and mash languages and complexities. We should be able to do stuff like that in the data world while still actually popping up much higher up, and thinking about these as micro-pipelines in many ways, and micro-datasets. Gives us a lot of acceleration factor from an engineering perspective.
LD: It does, and I don’t think anybody would disagree with the fact that the way they think about data and the way they think about their analytics, and the way that they’re thinking about all of that as a whole has obviously shifted pretty significantly over the last 5-10 years, and I think… And even in the last two years. I think it’s gonna continue to shift even more. But what you’re seeing is that everybody pops up with their preferred tool or their preferred programming language, or their preferred… Insert some other thing here. Having technologies that can work across… We talk about flex code and being able to code in the way that you want to. It’s also important to us that people be able to interact with data from different places and going to different places the same way across the board, and make it really dead simple for folks to actually use it. I think it’s gonna be incredibly important to not make everything the same, but to give people an easy way to interact with their data so they can actually solve the hard problems. Do all of this, interact with it the way they want to, do what their business needs to do with it, but not spend hours and hours and hours of time trying to get that done. Does that make sense?
SK: Yeah, it totally does, and it reminds of… I was chatting with a customer a long time ago, around just sort of the philosophies, as a provider of technology to others. How do we think about what gets you really excited to buy, what might give you aversion and concern? And it’s great commentary, which is, “Look, you may automate away 95% of what I have to deal with on a day, and I’m gonna love you for it, because God, that’s awesome. But if you make the other 5% of what I have to do on a daily basis impossible, I still can’t use you.”
SK: Because you just took 95% to zero, but you took the other 5% to infinity, and as a result [chuckle] it doesn’t balance out, right?
SK: And so, this is why I do think there’s a lot of value in this flex code model, which is, the… Oh gosh, let’s go figure out how to automate what we can and to ease and alleviate the complexities and the pains, but always give layers of additional peeling back of the onion, where you can go deeper and deeper in, where, yeah, maybe you like to use the UI a bunch, but really, you need to be able to plug into a CI/CD system, so okay, cool, here’s all the APIs to do it. Or maybe you really like writing a lot of stuff in SQL, but 5% of what you do, you actually do need to go write some Python for, or you need to really custom JAR that you wanna go plug in there, or even further still, you need custom libraries, you’re gonna go extend the Docker container it runs on. And that may only be half a percent of the time, but if we can make that possible, we free you up to go automate or operate the higher levels that give you more leverage and impact as a developer, as an analyst, as a scientist. And so, I think that’s the… If you do that, God, you create so much more impact for folks, and you won’t trap them down at the lower levels for 100% of their stuff, just so they can get their job done, which the potential to impact folks becomes so much higher as a result.
LD: For sure. I think everybody… I say everybody loves SQL. That’s not a true statement. Most people love SQL. SQL is always gonna…
SK: I love SQL.
LD: The running joke with me is, the only SQL I know is select * from airplanes, ’cause the company I worked for, we had airplane demo. [chuckle] It’s the only SQL I know. [chuckle] Not true. I know a little bit more than that. But most people love SQl, and you look at different companies and they’ve built their entire business, as they should have, on bringing SQL into the unstructured data world. But to that point, and something that you said, as great as SQL is right now, it doesn’t solve everything. Sometimes, you just have to get in there. Or, sometimes you just wanna get in there. The thing that I love about folks in the data-sphere is that they’re always trying to push something forward, and they’re trying to do more with what they’ve got. And some of that is trying out things in another language or trying to do something that maybe SQL just doesn’t give you the option to do. And to your point, it’s probably the right thing for 95% of the use cases, but sometimes you just need that little bit more flexibility and the ability to dig in. And nobody wants to lose the ability to do that with the idea of ease of use, because exactly what you just said. If it’s easy to use 95% of the time, but impossible 5% of the time, you’ve ruined any ease of use argument that you have. You’ve just ruined it, it’s just not there. And when you’re trying to bring together…
LD: Between data engineers, data scientists, data analysts, or other people who are trying to use data, when you’re looking at the developer world, different things, there has to be some sort of options and flexibility there, ’cause people have their preferred language or they have their preferred thing, or their… To a point, their preferred library. And so, it’s interesting to me to see what’s out there right now. But then, on the flip side, you have a lot of places that have just said no to SQL all-in, and just can’t use it at all, and that’s obviously not gonna work long-term.
SK: Yeah, and I feel like that’s why there’s so much value in really focusing on, especially in data world, just saying, like the core entity to focus on is data and data sets and marrying different access models around it. Whether you want to talk to SQL or Python, whether you wanna tap into it by a JDBC or an S3 or an API interface, whether you want to query it or you want it to be a persistent materialized dataset through a pipeline. Making it so that you can hybridize these, because there are so many different toolsets because we’re involving so many different skills to work with our data, that trying to anchor on a technology decision or a tool decision only leaves out a significant percentage of the people that you would benefit from having contribute to your data strategy. Whereas if we can focus on the data first, the actual notion of the data itself, and then later on the access layers and the access models and the development models on top of that, that gives us a whole different level of flexibility that allows us to actually take advantage of a pretty diverse set of skills across the teams.
LD: It’s interesting, as we’re talking about this, I go back to a conversation that we’ve had a little bit internally, which is; is the office of the CDO a cost center or profit center? And I’ve been… We’ve talked about that a little bit, and I’ve been sort of parsing that out in my head in different ways to think about it, and I think one of the things that you’re hitting on, which is where I sort of netted out with this, in my own brain, is we have been really focused on the technologies, and I think the CDOs and the offices of the CDOs and therefore the teams that they have in their board have been really focused on the technologies and they’ve been really focused on, “How do I build the coolest new techstack?” But we have to start moving towards it being more of a profit center where it is driving the business decisions. I mean, there are a ton of companies out there that are only around because of fast access to data. And nobody would disagree with the fact that it makes so many companies more competitive if they can access, use, and do different things with their data in a better way. And so the idea of that makes them more of a profit center than cost center, but that does mean that they have to think about the data stack in a more strategic way. We’re in a good spot of starting to get there and starting to think that way. What would you say to that?
SK: Yeah, I think the… A lot of this is following that natural trajectory of…
SK: There’s this whole new wave of potential as we start to tap into data and the technology that allows us to marshall it at a remarkable scale. And so you start super deep in the tech stack. And a lot of these investments are cost centers up front because they’re big long-term investments for teams and for companies.
LD: For sure.
SK: And the problems to solve are deep in that technology stack. And we see this time and time again, and this is a very good natural pattern to see emerge, which is, over time the problems that become the most important ones to solve, climb up stack. And so they get out of the, how do I run a distributor system, or how do I tune the JVM and tune my Spark jobs, etcetera. As more and more companies enter into that, the patterns emerge and technology companies will even help actually solve that and elevate everybody to higher levels in technology stack. And so in doing so, we start to get into broader democratization and enabling of more people. And that’s where we can actually start to get closer to really enabling the business. And so that’s why I think the… Is the CDO a cost center or a profit center? And frankly, I think one of the best strategic and career moves a CDO can make is to get the organization, the company, the teams higher and higher up the stack, because you’re enabling more people to drive more impact faster for the business. And that’s what allows you to actually be far more strategic and to really impact the overall outcome for business.
LD: And to your point, doing that means enabling more people within the business to be able to just get their hands dirty with the data and play with it, and not being bottlenecked on teams that are over-tasked with a zillion different things that they’ve gotta get done. And it’s interesting, sometimes I’ve seen on more than one occasion, organizations that have also figured that out and have, whether they’re termed data engineer or not some sort of data engineer type role that actually sits in the analytics org to help with this, but that’s just not feasible for every company out there, it’s not something that every company can do, nor is it something that every company should do. So making data engineers out of everybody who needs to use that data by making it easy to use is critical.
SK: Totally agree.
LD: Right. I think that was a really good first foray into our podcast and talking a little bit about data engineering and what we’re seeing. Anything else that comes to mind with all of that that you wanna chat about?
SK: I agree, I think it was a really good introductionary overview, I can’t wait to see which of these threads we get to pull on.
LD: I know.
SK: With our future guests, ’cause there’s a whole lot there to really start to unpack, so I’m super excited to see that start to play out.
LD: Yeah, there’s a whole host of things and as we go through more and more, we’ll find more and more threads to pull, and we obviously wanna hear from listeners about what is interesting to them as well. So I will say it here, you’ll hear me say it in every episode, reach out. Let us know what you wanna hear about, let us know if there’s a topic that’s of interest, let us know if there’s something that you want us to dive in on a little bit further ’cause I guarantee we have somebody in our sphere or happy to be introduced to people who can help us discuss that. Also, always happy to talk to people who maybe don’t share the same opinion that we do, it’s always good to have some healthy discussion going on like that, so I really wanna be able to hear from the listeners on that. Well, Sean thanks for your time today, much appreciated.
SK: My pleasure. This is a lot of fun.
LD: I agree, and I’m glad you found it fun because you will be doing this quite a bit.
SK: I cannot wait.
LD: Alright. Thanks everybody.
LD: There you have it, folks. And hopefully, this gave you a good intro into the types of conversations that we’ll be having on this new DataAware podcast. And as I mentioned, we wanna hear from you, we have a great line up of guests for the show that come from all different walks of Data Engineering, but we wanna know what topics you’re interested in hearing about or if there’s someone you think we should have as a guest. And as always, if you wanna learn more about Ascend and how we help data teams build and automate pipelines with less code. Well, you can visit us at ascend.io. Welcome to the new era of data engineering.