Ep 14 – Autopiloting Snowflake ETL with the Data Automation Cloud

About this Episode

In this episode, Sean and I do things a bit differently and chat with one of Ascend’s lead engineers, Nandan Tammineedi, about his work helping develop the Ascend for Snowflake platform. Find out more behind the scenes information on the development process including what surprised him and the team, what he’s enjoyed seeing since it has been in production and more in this episode of DataAware.

Transcript

Leslie Denson: Today, Sean and I are doing things a little differently and making more in-depth at Ascend, specifically with one of our esteemed engineering leads, Nandan, to learn a little more about the imagining, developing and building out of the Ascend data automation cloud on Snowflake in this episode of DataAware, a podcast about all things data engineering.

LD: Hey everybody, welcome back to another episode of the DataAware podcast. I am once again joined by the illustrious Sean Knapp. Sean, how’s it going today?

Sean Knapp: It’s going fantastic. How are you? 

LD: I’m doing very well, and I am doing very well because we have a guest that I’m gonna let you introduce here in just a second, who’s gonna come on to help us talk about something that we really love here and we’re super excited to talk about, and I think it’s gonna be a whole lot of fun. And it has something to do with a little something called Snowflake, and it’s just a super cool topic. So, I am just stoked to hear the two of you riff off each other and talk about this ’cause it’s a lot of fun when I hear it in my day-to-day world. So this is gonna be a fun one, don’t you think?

SK: I do think.

LD: Awesome.

SK: And yeah, I would love to introduce someone very special from our team, Nandan, who has been heading up a ton of our architectural efforts. Nandan, I think we can call you an Ascend old timer at this point.

SK: You’ve been around many blocks inside of the company, but as we really started this new initiative and Nandan has really help lead a ton of our biggest initiatives in the company. Nandan was the brains behind the architectural design and the execution of our data plane and our Snowflake strategy. So, I’m really excited to have Nandan on board with us talking about Snowflake today.

LD: Yeah. Welcome Nandan.

Nandan Tammineedi: Hey folks, I think you played me up too much, Sean, but thank you for that.

SK: Just wait, you’re gonna get hammered by recruiters as soon as this podcast goes live.

LD: I know, right. I don’t think he played you up too much.

NT: Well, so yeah, I’m an engineer, here at Ascend. I’ve been an engineer in lead for over 15 years. Before Ascend, I used to to work at a company called Netscope, it was another startup. I worked in the data infrastructure team there as well. You could say that I’ve been involved in data engineering in some form, on and off throughout my career, if anything, my career so far at least has been book ended by data engineering roles, although I have done a lot of other things. My current focus here at Ascend is just building the systems and services that manage the data plane that we’re gonna talk about today.

SK: Well, let’s touch in a little bit here then, and now we’ve already mentioned this notion of data planes, and I’m sure a lot of folks are already starting to form their own mental model for what that is, but let’s catch them before they go too far down a particular path. What is a data plane?

NT: Well, I’m still trying to look this out myself, but. [chuckle] Well, if you think of a data plane, the way we like to think about data plane is it’s a storage engine, some kind of storage back-end, at least in the context within data engineering, it’s some kind of storage back-end, and one or more computer engines on top of that. So if you think about what our product does, we help pull data from various sources, we help you easily transform the data to join it, to annotate it, to roll it up, aggregate and then write it to the destination of your choice, right?

NT: And in the process, you need some kind of backing storage. And so the data plane is like a proxy for some backing storage engine, and a choice of one or more computer engines on top of that. So, an example of a storage engine would be like a blob store, if you go to Amazon, you got S3. If you go to Google, you got GCS. So computer engines, you can think of Spark, you can think of warehouses like Snowflake, BigQuery.

NT: The interesting thing is, you can… The idea for storage engine is a little bit flexible because you can think of a warehouse, like a Snowflake or BigQuery is also a storage engine, it’s not just the computer. So, this is where the definitions are a little flexible, but I think what we make… What we allow in our product is to actually process data using different computer engines, so we give you that flexibility, but to continue to use a uniform or to standardize on a particular storage technology.

SK: Awesome, super helpful. And when we think about how Ascend historically has run, we’ve been a Blob store plus Spark predominantly as our processing layer, and Blob’s been, as you mentioned, Azure or GCS, or Azure Blob store. And so I think, obviously, Nandan and I had lots of conversations around this and have been talking about data planes for quite some time. It’s really interesting how we had our notion of the data plane before just pick your Blob, grab your Spark and pick your Spark. Sometimes it was Spark on Kubernetes, sometimes it was Spark with Databricks, sometimes it’s Spark with SQL. But that was really the sort of…

SK: A set of options, and, you know, when we started to look, I know when we started to look at Snowflake, we had a lot of customers who were reading, and writing to Snowflake, but then we actually also started to think about how can we process data inside of Snowflake as well. Tell us a little bit about some of that, that journey on as we started to look through, and started to explore, not just Spark, but Snowflake itself as a processing engine. What was the thought process with that?

NT: Yeah, so we have always supported, you know, historically in our product we’ve supported different languages, I guess, so different, you know, Query languages for transformation. For Spark, the obvious, you know, two choices are, you can write code in Spark SQL, right. Sparks flavor or SQL, and you can also write code using PySpark, right? Those are the two examples that you can think of. And we also support Scala transformations, but I think that there’s such a large community out there that understands standard SQL dialect, and has been using warehouses in the past, like Informatica and things like that, that the transition for them to something like a Snowflake is much smoother. So I think the choice of using Snowflake was like an obvious next step. I think what’s interesting is, we’ve known for a while that our control plane didn’t really… Could easily be extended to support other data planes, right.

NT: To support other compute infrastructures. There’s nothing in our data model that necessarily restricts us it’s just Spark, right. So, you know, we have a data model that consists of components. We think about artifacts that we produce that have partitions, right. They process over time, you know, each component has a Schema, and then we have a notion of a table, right. It’s just how we… And even the types of transformations we do, right. Do we preserve the structure of partitions from the inputs, and so on? Or do we kind of reshape these partitions as we transform them? This is… I think these notions were language agnostic, right. And so, we knew that we had something already. We just had to change the way that we architected our database infrastructure which previously was more, you know, Spark focused, if you will, so I think the question was… For us was, how many… Well, first of all, do all of our existing features fit well with Snowflake, right? As in, are there any features in Snowflake that don’t fit with another existing model for whatever reason? And there were one or two, I think, that we can talk about going forward. And, but, yeah, beyond that, it wasn’t really… It didn’t significantly change the model of, you know, of how we thought about, you know, the way data gets processed in our system.

SK: So you’re saying that this was one of the benefits from microservices architecture, and that we were able to leverage in many ways that swap out some of the infrastructure communications layer, but not actually the scheduling layer itself.

NT: That’s correct, yeah. And I think that’s the beauty of the, I guess some of the blueprint of the product, right, and some of the architecture that was laid down before. Granted, it… We had previously indexed a little too much on Spark, right? Because we thought that that is where the industry would be heading, but we always knew that we could pivot and change. And so I think this effort is all about kind of opening up the space to even more processing engines, and warehouses.

LD: So that actually kind of leads into a question that I’ve kind of always had as we were going through this process which I’ve heard a little bit about, but one of the things that you just brought up was there were, well… And our, I think as you go into all of these different processing engines, a few different future functionality changes, and in differences, you know, Snowflake has a few different things than Spark has the way you work it. And Snowflake may be a little different than it is in Spark. And kinda the beauty was that we could make things work, but not too much difficulty. What were some of the things where we looked at it, and we were like, oh, we really need to make X work in Snowflake for our customers. And was there a lot of… Was it incredibly, I don’t wanna say difficult because you made it work, but was it incredibly difficult? What were some of the tweaks or changes that we need to make? Did it tell us… Did that help inform anything that maybe we would’ve changed or done differently on the Spark side of things or anything that we would change moving forward for anything else that we do based on that tweaking for Snowflake?

NT: Sure. Yeah, I think there’s at least a couple of things that I can think, right. And I’ll give you an example of one of these it’s… So for example, Snowflake offers, the merge command. So this is a special kind of operation that allows a source table to be merged in to a target table, right. And so, what it’s allowing you to do is basically as you ingest new records, merge them in using some logic that you specify, but kind of update a table inference, right, and this is an operation that Snowflake offers that they claim is highly efficient, but it doesn’t necessarily fit with, you know, traditional, you know, data processing, I guess, patterns, right, where you expect when you produce an output, you expect it to be immutable, and you don’t wanna change it, right.

NT: And so, that was one place where we had to look at our model a little more closely, and say, how do we support the merge operation? Because this is a huge value out for Snowflake. And this isn’t something that we want to necessarily… We don’t want to… It shouldn’t be that if someone decides to use our product to run.

NT: Queries on Snowflake, that this is a feature that’s left out or that they cannot use, right. But we just had to figure out how to make it work with that part. This gives users a little bit more leverage, right. A little bit more control over how their tables are updated. And so, that was an area where we had to kind of figure out how we would support it, and we have done it, right. So I think it probably a much bigger shift than the way we think about…

NT: When we took up this initiative was, how do people consume artifacts from each of these components. In the past, we would have… So roughly, the three types of components and in our data pipelines, you have connectors, which pull data from various sources, APIs, blob stores, databases, data warehouses. And then we have transformed components. Transform components to… This is where the computer engine comes into play, right? SQL, PySpark, and various SQL dials. This is where you’re doing the joining, the aggregation what have you. But each of these nodes in the graph, they write outputs somewhere. And so how is it that users consume this outputs? These artifacts? And previously, when we had blob store, we wrote to unique locations in the blob store, but we have to provide some kind of layer to access that data.

NT: And we had a few different ways of doing it. We had, a JDBC interface, right? That allowed them to access those tables. We gave them a few ways to do it via our SDK and our API, where they could preview records. And we also provided something called file based access, which is basically an SD compatible interface on top of our blob store. So what this did was provide a layer of virtualization where you could access components by their names, right? As opposed to the unique identifiers that we’ve given them in our blob store. Right? So this is a change in our model that we’re calling semantic storage, as opposed to, let’s say, content addressable storage, right? Which is what we had before. And I can go into more detail about this, but I think this is probably the most significant shift in our thinking for data planes, right?

NT: Which is… Previously, you had to go through Ascend to consume these intermediate artifacts, the outputs of these transformations. And for the most part, if you wanted to look at data on its own in a table, in SQL, you had to create a right connector that would pipe this data out and write it to the actual table. Whereas now, with Snowflake, you can, you can look at the outputs in a Snowflake table, right? And you have a lot more visibility. You can access the underlying storage warehouse yourself, and you get an insight into the output of these components directly. Right? You get to see the records, you get to access them through your choice of technology, and Ascend just access the orchestration layer in this case.

SK: Yeah. I like that. One of the things you touched on that, which is, I think a really core distinction is, with this new data plane architecture, it’s less of Ascend fully wrapping the underlying storage layer itself because of that content, addressable storage paradigm, Ascend needed to be in the data access flow itself. And there was historical reasons as to why that… For example, on basic blob store, we were able to just provide so much more advanced capabilities by doing that wrapping and by supporting content adjustable storage. But with this new shift to data planes, we’re actually able to take advantage of more of the modern capabilities of those underlying planes, merge operations at the compute layer with Snowflake, but also other layers or other advantages in term of the storage layers themselves these days, like ultra efficient copy data that didn’t exist before. And as a result, that lets us sit on top as opposed to wrapping, which we like too, because it makes our life a little bit easier too, means we can have a smaller footprint if you will.

NT: Exactly. Yeah. I think it’s… We’ve always had a lot of cool, really cool design around how we store artifacts uniquely based on the code that processes them and the invoice. But I think we had to bridge the gap between that and people being able to see their artifacts by a name that they can easily recognize. Right? And so, yeah, like you said, we… That was a big part of the design process, right? When we were building the new data plane infrastructure.

SK: So I have a question around this too. Hopefully, this doesn’t pull us too techy. Leslie, you can pull us back outta the weeds if you want.

LD: Oh, I will. Don’t worry.

SK: What’s the API of a data plane? When I think about the… When we talk about our microservices and so on, the… Basically it, or we have this notion of a data plane manager, right? It gets, it exposes a bunch of APIs, systems to connect to it, tell it, do magical stuff with the data planes like Snowflake, and it does the magical stuff. What’s its “API” it’s interface to the rest of the world? What does it do?

NT: Well, there’s obviously functions to create data plane, right? You have basic data plane creation functions, right? Where you’re provisioning a data plane, like in the case of a Snowflake. In case of snowflake, you are actually creating a warehouse for example, or you’re setting up the credentials for connecting to an existing Snowflake instance. If you think about the interface between our control plane and our data plane within Ascend, it’s the task processing interface, for example. Right? So anytime you… So we have a scheduling system that’s constantly evaluating our graph, right? It’s constantly looking at the state of the graph, the state of the nodes, and to see if they’re up to date or not. And if they’re not, it generates tasks that need to be executed by the data plane. Right? So this is an inter… This is a JDBC interface that we have between our scheduling system and our data plane. Right? So it’s the interface to actually run tasks. And even within this, there’s a differentiation between tasks that ingest data. And this is tasks that are very connector specific, right? So you have tasks that pull from various types of sources, right? And even within those, you have to list, you have to identify…

NT: What objects you have in your input data source. And then you have one that actually read and parse them and actually pull them in into your storage backplane. And then you have the transformation tasks. So a rough way to describe the transformation, the transform tasks, is you have some code written in some language that’s data plane specific or that’s understood only by the data plane. But that scheduling, scheduling I was talking about, so it has a certain shape, it has its… The task has some part of it that describes if it’s preserving the structure of the input, if it’s aggregating the input and reshaping it in any way, it talks about, for example, if you’re unioning with another input, what other partitions to union with. So that’s part of the interface. It also tells you where you’re writing to, what location or what unique partition ID in the output to write to. And this is where the data plane takes over and decides, “Okay, this is my storage engine, this… I know now how to translate the writing part.” So there’s a compute phase and then there’s a write the output to the storage phase.

NT: The other parts are schema inference. So think about when you change the configuration of an existing component, you need to know what the output schema is gonna look like. And so that’s part of their data plane interface as well, which is, “I’m gonna be writing some new output, has the schema changed?” Because the code has changed. Or does it stay the same? So there’s an inference, schema inference part of it. And this is again an interesting part of data planes, which is we have to create a bridge between what Ascend, the schema that we presented Ascend, and let’s say the data types that the storage engine present. So this is another interesting challenge, because different data planes have different support for… Native support for data types.

NT: Yeah, we have records preview, so our users can at any time pull and sample and look at a specific portion of the component, reporter’s process, where they can look at a certain range of records, for example. We have a Query interface. So if you think about the bread and butter of Ascend, where processing inputs, performing a computation, writing it out, so they’re performing a ciphering, but the Query interface is something that just Query is an existing… It’s more interactive interface. This is where a user can actually, through an API or through the UI, essentially type in a SQL Query and say, “I wanna Query these components using standard SQL, and I wanna sample the record to look at them.” So that’s without the intention of overwriting them up. This is just part of the exploration of the components that you have in your system.

NT: Yeah, so I think broadly, that’s the interface that I think you could divide the… This is just a very coarse grain, slicing a thing of… And this is one way to slice it. Another way to look at it is, how do you think of the resources in your compute engine? You have to model resources and capacity, and that is very data plane specific as well. So for example, in the case of our Spark offering, we have a certain maximum capacity in terms of cores and memory, and therefore it is a way that we model it, whereas in the case of Snowflake, it’s different. So I hope that roughly answers it. I’m trying to think about if there were other APIs that… I think that covers it. You’re on mute.

SK: I do remember too, we had a really interesting one where… Obviously, for our system to measure how much is being used, as we have a usage-based metering model for our product, we have to even measure how much usage was being pushed down to Snowflake. And with Spark clusters, we just monitor how big the Spark cluster is. With other systems, oftentimes they give you a feedback, but this is an interesting one where we were puzzled for a little bit, and then I think he has found a really creative way of seeing how big the system itself was. Tell us about that.

NT: Yeah, sure. So in the case of Snowflake, it’s… Snowflake comes to you in precise… These warehouse… The warehouse is a unit, you can think of the warehouse as a cluster, like for processing, for running your Queries. And so it’s really the way you wanna meter customers or determine usage is based on the size of the warehouse. But then Snowflake, depending on load, will also spin up your clusters or spin down clusters. And so this is where we actually actively query the… Query Snowflake’s API and kinda gather this information, how many rank clusters of a particular size you have? And that feeds into our metering, because based on the size, you know how many nodes, what the size of nodes is, and this gives us a way to bill customers. Fairly generic metrics collection system, usage metrics, collection system, but yeah, this is part of the functionality.

SK: What surprised you the most as we’ve gone through this process? Leslie’s giving us a look.

LD: I’m giving you a look ’cause you stole my question, you know I always ask that question.

SK: Ahhhh. Sorry.

LD: It’s fine, it’s a great question.

SK: You usually ask this to me.

LD: I know. That’s why you stole my question.

SK: Leslie, what has surprised you the most about Snowflake?

LD: How amazingly… Honestly, and I literally am not… Everybody out there listening is gonna be like, “Well, of course she’s gonna say that about her own company,” but no, it’s really honestly how amazingly quickly the team, led by Nandan, was able to get this up and running for customers and get it going. It was… And how much the customers have enjoyed using it and how quickly they’re bringing workloads onto it, which is the high level answer. And I actually wanna hear what Nandan has to say about it, and honestly, Sean, I wanna hear what you have to say about it as well. This was just one of those things where it was like, “Oh, okay, data plane.” [chuckle] “New compute engine. Yeah, got it. Let’s go. Customers can use it. Woohoo!” And it was like, “No big deal. Got it.” And it was like, “Wait, wait, huge deal, very big deal. Massively big deal. That’s a very big deal.” So it’s super cool.

LD: I think from everybody’s perspective, but from again, marketing very non-technical perspective looking at, it was really cool to see how elegantly it all came together.

SK: I like that word elegantly. Nandan, what surprised you?

NT: Yeah, so I’d say it just surprises me how quickly we could get from a prototype, to a product that customers can actually use, and it’s a fairly full feature, and I think in the process, we realized that we could change our development practices a little bit, how we design those services to consolidate more. I think that’s made the speed of development faster, I think it’s partly, I’m also surprised that it’s been a bit of a revelation also to watch customers actually use it, and to see how fast you can build components, right? These cloud technologies are fast to run on their own, but the challenges are always the, how do you fit quickly, how do you build a component craft, so you need a UI that enables this, and you need to get feedback as you’re building these components, you need a developer experience that enables this. And so when you combine the speed of some of these warehouses and the responsiveness with, actually the responsiveness of our UI that enables this kind of connects the two well, and makes bigger things happen. So it’s just interesting seeing people build out use cases as quickly as they do. Once we roll this up, is there something else that… What surprised you Sean about this?

SK: Now you all took all the good ones. As you guys are talking about that, okay. Let’s see, I’m gonna have get really creative here on surprises, ’cause I wholeheartedly big sums up on the things that you guys are saying. I think the only piece that I could have then probably would be additive would be, it’s surprising how quickly things are converging, and when I say things, I mean the notions of ETL vs ELT, the idea of Data Engineering and analytics engineering, the notion of a lake architecture and a warehouse architecture. Now, we’ve oftentimes joked about ETL vs ELT in reality, it’s just always, ETL TL TL, as it just continues to go on, nothing ever just is transformed once and dropped somewhere else, and whether it’s attracted and voted, and then the transform cycle starts… These are largely semantics these days as when we look at the work that people are doing, that’s all largely the same, these days, maybe they were choosing to do it more Python centric versus SQL centric versus Scarlet centric could be seen increasingly converging. And then when we look at… And we’ve been doing a lot of this with the data plane is when we look at the capabilities and the directions, we see the Stack world is looking increasingly like warehouses. We see the warehouse world with books like Snowflake, increasingly look like.

SK: What Spark style processing engines with their own Python interfaces. It’s really starting to look pretty darn similar, which is great, ’cause I think the… Fortunately, we see these worlds converging which I think helps one simplify the understanding purposes they come into the industry, which is, you can be successful on either… So I’m saying, you should find the one that fits your needs the most. But that there continues to be such incredible investment from the underlying data infrastructure providers that we’re gonna continue to see a flurry of really compelling features and capabilities that continue to allow us to tackle more use cases and solve even more interesting problems regardless of your lake or your warehouse architecture.

LD: So without giving too much away, what do you think that this means for the future of, I hate to call it the future of Ascend, but that’s exactly right. But what do you think that this means for the future of… But it kind of is like, what do you think it means for the future or Ascend the future of the Ascend platform or really… ’cause a lot of this too is… We talked… I’ll be the marketing person for a second. We’ve talked a lot, really awesomely, which is why I was super excited to have it going on and on and have the two of you just kind of go to town is to talk in more depth about Snowflake and data planes and what we did. But to put a super fine point on it is what we did was bring the power or ability of awesome data automation to Snowflake for data engineers. So go us. And what does that mean? So how does this impact the future? How does this impact the future of automation for data engineers? What does this mean that we can do in the future? Again, whoever wants to go first?

NT: I think we’re going to see continued innovation at the data infrastructure level. We contended that there are, and I’m sure we will end the people by the number, somewhere between four and six monolithic organizations driving tremendous levels of innovation at the data infrastructure level from cloud vendors to stand-alone, stand-alone, not being obviously the three big clouds, but we see tremendous innovation happening here. I think that’s very exciting. I think the ability to bring Ascend’s level of automation to that level of innovation that’s happening at the lower levels of the stack is very compelling, as it helps teams frankly take advantage of those capabilities and make the most of those faster, easier. And so we are very interested and very excited, about the continued innovation happening at that level in that database system.

SK: Yeah, mine is a more engineering take on this, which is, I think that what I’ve seen over time is people just moving higher and higher up in the stack, becoming less and less concerned with or interested in solving the problems of actually orchestration. There was a time when it was cool to be able to know the details of spin of Spark cluster, how to run it at scale, how to run jobs at scale, how to keep track of all these things, and these are complex things, you don’t wanna be building all of these in house. And so I think, yeah, it’s just interesting. I think a lot of the… You have these big battles, the warehouse versus lake battles going on… And there’s a lot of convergence, and people are trying to settle on a particular storage back-end or a particular technology first, but then… They’re all vying for each other, vying for each others use. So, I’m personally interested to see how this plays out, but I know that we’ll be fine either way, because [chuckle] I think we’re fairly agnostic. And yeah, it’s an exciting time.

LD: That it is, that it is. Gentlemen, thank you so much for joining us today. Nandan, I think you may have to just get used to the fact that we’ll probably gonna have you back on. You did… This is wonderful, you and Sean, we’re probably gonna have you back on Sheel… We’re just gonna start mixing and matching, you’re gonna have to do this all the time now, so… This is wonderful, thank you. I appreciate it.

NT:  Thank you.

SK: Awesome.

LD: It really was amazing to have a behind the scenes view of Nandan and the rest of the team build Ascend for Snowflake, and even more amazing to see customers start using it and building out use cases so quickly. I just love to see it. If you wanna learn more about the Ascended automation cloud for Snowflake, you can always visit us at a ascend.io or reach out at [email protected]. Welcome to a new era of data engineering.