Lineage_Twitter card-01

Certifying Data Fresh and Organic with Data Lineage

In this episode, Sean and I talk all things data lineage with Ascend solutions architect Jon Saltzman. From its importance at every step of the data journey to how data organizations go about ensuring their data is “certified fresh and organic” or rather, easily traceable to where it’s been and who has touched it, we discuss how data lineage efforts can shape many facets of data workloads.

Episode Transcript

Leslie Denson: The next in our back to the basics series focuses on how you too can ensure your data is certified fresh and organic. Well, as our guest puts it at least. Sean and I are joined today by Jon Saltzman, one of our solutions engineers here at Ascend, to talk about data lineage in this episode of DataAware, a podcast about all things data engineering.

LD: Hey, everybody and welcome back to another episode of the DataAware podcast. I’m back once again with our favorite Sean Knapp. Sean, welcome back again. Just feels like coming home at this point, doesn’t it?

Sean Knapp: It really does. This is one of the highlights of my week, so…

LD: Excellent.

SK: I’m always happy to be back.

LD: I appreciate that. Today, we are gonna have a fun conversation and we are joined by the one, the only, Jon Saltzman, who is on our field team. So welcome, Jon. How’s it going?

Jon Saltzman: Well, hello everybody. I’m doing great as well. Happy to be here, thanks for having me.

LD: Excellent. Well, we are super stoked to talk to you today, and we are super stoked to talk to you today about… And it’s part of our back to the basics series on the podcast. So we’re gonna chat today about data lineage, which is A, fun, B, as I told these guys before we started, something that I actually, of all the things that we’ve chatted about, probably know the least about. So I’m super stoked ’cause I get to learn a lot during this one, just like, hopefully, you guys out there do as well. But I love hearing these guys chat back and forth, so I’m super stoked on this one. But to get us kicked off and started, Jon, why don’t you tell everybody a little bit about yourself since this is your first time on the podcast?

JS: Right, so I’m a solution architect at Ascend. I have a very diverse background. I’ve been a principal data architect, I’ve been a CIO, I’ve been the guy doing desktop support, I’ve done every job that there is in IT pretty much, and a lot of data plumbing. So there’s the less glamorous aspects of what we do and the more glamorous aspects, data lineage being one of the more surgical and interesting things that we do in our industry. So definitely happy to share some of my experiences out there in the field doing data governance and data lineage hands-on, but yeah, that’s a quick overview of my background.

LD: Awesome, awesome, awesome. Well, let’s just dive right in. And for those of us out there who are like me and just don’t know, what is data lineage? As I’ve been saying in a lot of these back to the basics episodes, what is the 101 of data lineage? What is it? If there’s somebody who’s kind of new out there to the data engineering space, what exactly is data lineage?

JS: Well, at least from my perspective, data flows through organizations. Data lineage is all about understanding how data flows through organizations, how it’s sourced, how it’s transformed along the way. It’s a bit like when you’re sourcing vegetables and you talk about the supply chain. That’s a good analogy, I think, for it. I see Sean is smiling over there. So Sean, what do you think about that? My tomato is going through the supply chain…

SK: Yes, totally. That’s funny ’cause I was thinking through… I always love the analogies.

LD: We talk a lot about vegetables on this podcast too. For some reason, we compare data engineering with vegetables a lot on this podcast. I’m unclear, but we like our vegetables here at Ascend. To say a lot about us, we like our veggies.

SK: We do.

LD: Just saying.

SK: It’s funny, as we record this, it’s 10:00 AM, but I gotta be honest, I am starving already, so my brain is already on food. And I was thinking about it and thinking about those analogies, and it’s funny ’cause it is very much that supply chain. And the analogy I was thinking of is, because I think everybody oftentimes agrees that we should really know and invest in data lineage, and we’re gonna get into, I’m sure, all of the sorts of the nuances and importance and details of this, but knowing what you have, where it came from, how it was processed, we use this food analogy. Imagine going to a grocery store and buying not just the raw ingredients like the produce and so on, but imagine actually buying boxed goods, refined products, just as we oftentimes take refined data and for those into, say a BI tool, for example, and there literally not being any ingredients.

SK: You have no idea what was put into this thing. Was it a bunch of sugar or healthy complex carbohydrates? You have no idea what that thing is or where it came from, how it was made. So it’s actually kind of a scary thought, right? Oftentimes, we want to know what we’re putting in our own bodies, but when we think about data lineage, more often than not, a lot of companies and teams don’t know where that data came from or have a rough hunch of where it came from, but they actually don’t know what happened to it to get it to where they are.

JS: Yeah, and I would say more than ever, when you look at a particular piece of data in an organization, like where that data has been, who has touched it, what’s happened to it along the way, there’s the kind of problem that you have the telephone game that happens in organizations. Well, I think that happens with data, the more you copy it, the more you mess with it, the more it deviates from maybe what it originally was. And so, how do you know what happened to it along the way? And if you just get a piece of data and nothing else, you have none of that story of where did it come from, what happened to it along its way, what was its journey that it ended up with? I can definitely say I’ve seen that in organizations myself. I guess we also talk a lot about the silo effect of organizations, kind of an interesting thing about how we throw data from one silo to the next. I think this problem is a well-known problem in organizations, especially ones that are more fragmented too.

LD: I think people probably, or at least I… ‘Cause when I think about data lineage, with the little that I know about it, and the way that you guys are describing it, I would think about it mostly with needing to know about it in regards to like PII data and needing to make sure that only people who should have access to certain data have access to it, and only people who should have seen data have seen that data, HIPAA data, that sort of thing. But I’m guessing it goes a lot deeper than that just to any sorts of data. Like you need to be able to know that this price didn’t change or this whatever didn’t change erroneously, I’m guessing.

JS: Well, so the use case you were just talking about, certainly a large part of data lineage is about auditability, and going back to that supply chain, certified organic data. That’s gonna be a thing.

[laughter]

JS: Somebody’s gonna come up with a little sticker or something. Actually, I jokingly say that, and it’s not completely untrue. I’ve seen in a lot of BI tools, for example, different BI tools will label data sets as this is the validated or the certified data set, and this is… Certainly auditability, security of the data, where it came from is certainly part of it. It’s not just that. The sticker on it is metadata in a sense. It’s like it’s additional descriptive data about that data itself, where it came from, what happened to it, who owns it, what has happened to it along its journey. Some of that data would probably take the form of a log. Here we could probably go off into all kinds of interesting tangents about immutable logs and things like what happens with data that’s being tampered in transit from one point to another, and ensuring that it’s not tampered with, and so there’s a lot of things down that avenue for sure, but I also would just go back to the very beginning.

JS: What people… And we’re talking about data lineage, but you ask what is data? What is data itself? Data itself is an observation of something. Right there, when you observe something and you make data, there’s a whole bunch of information to know. Who took the observation? Where was the observation made? How was the observation made? All this metadata is part of the story of the genesis of that data. I always like asking this question, “Where did your data come from?”, and seeing how people answer. It’s actually a pretty funny one.

SK: Well, I think you touched on a really cool concept, which is oftentimes we tend to think about the localized journey of data as opposed to the globalized journey of data. And part of this goes back to some of the conversations even had before, where whether you come from an ETL camp or an ELT camp, the reality is it’s ETL, TL, TL, TL etcetera, or the converse and equivalent of that. But the data goes through many, many, many stages in its life cycle. Where it originates somewhere, based off of some observations and some measures, and somebody then plucks it and they make some transformations to it, and then they put it somewhere else, and somebody else or some other system picks it up and takes it to the next step of the journey, makes other transformations and over the life cycle, it’s continually refined, aggregated, enriched, etcetera.

SK: But I think the important thing, John, that you’re hitting on here, and actually that even goes back to some of the originating part of your question, Leslie, is in a simple world and a localized world, maybe it matters less in the sense that, “Well, I just grabbed it from point F, so I did a little bit to it, and put it to point Y.” But when we think about the globalized tracing of data, it really is that ability to know where did it come from all the way and how did it actually arrive here? What is everything that happened to it to arrive here? But then even more importantly, I would contend, and this goes back to that PII question as well, embracing the fact that rarely ever is the data that you are looking at at its final point in its journey and in its life cycle, and it’s still going somewhere else.

SK: And so oftentimes when we think of PII and GDPR and CCPA and so on, a lot of the lineage question then becomes not just where did this data come from, but where did it go to? And what happened to it from there? And so it does become… If you think of this life cycle as this whole graph and journey of data, it really does come down to having to track it and really answer both sides of that question from any point in that graph. Where did the data come from? What happened to it? Where’d it go? What happened to it then? And I see Leslie smiling, and so I’m super curious what… She is cracking up and trying so hard not to laugh out loud.

LD: You kept saying where did it come from, where did it go? And all I can think of in my head is it’s the Cotton Eye Joe song.

[laughter]

LD: Literally all I can think of in my head. I’m crying over here trying to not laugh and keep it together, so I didn’t have to say that out loud.

SK: Can that be our intro song?

LD: Maybe, possibly.

SK: Or just when we edit this, can we just have that fade just a little bit into the background while we’re talking and then just fade out and see who notices?

LD: Oh my gosh. So basically what you’re saying instead of lineage is the Cotton Eye Joe data engineering. Okay, got it.

SK: I like that song. It is very upbeat. It is high energy. I feel like it is very appropriate for an Ascend theme song here.

LD: Maybe, possibly. I think there’s other connotations for that song.

SK: Oh no, that’s such a nice response. You’re like, “Oh yeah, we’ll take that under advisement.”

LD: I think there’s other connotations to that song. Unclear, I need to check. Let’s do a little digging on that a little bit more.

JS: Well, so now it’s an interesting question talking about the global scale and looking at data from that perspective. I think some would argue that part of the reason why data lineage as a concept is difficult to understand, obviously in Ascend we have a visualization of the data graph, how data flows through the system and understand the lineage of what’s connected and what are those transformations, but what happens outside of that system even for example? And I think one of the reasons data lineage is hard to understand is because when you think of it in this global supply chain of data, it can get hard to see across contexts. Where did this data really come from, especially if you have data that’s coming from outside your organization, flowing through your organization, maybe flowing out of your organization into another.

JS: And so I think that’s one of the reasons why it’s hard to understand, it’s a big concept, and I don’t know that you always will have all the information you need to truly understand the full data lineage of a particular data product. I’ll use the word data product since this is kind of a hot topic now with data mesh and data products and treating data like a product. Again products, we have to track where products come from and maybe what ingredients they have in them and so forth, but will you always know all of it? One of my favorite stories is about, I don’t know if you guys have heard the story about the folks in Colombia that harvest the beans, Cacao beans? So somebody went down and brought them chocolate bars that were made with the cacao beans that they harvested basically in the rain forest in Colombia, and they were shocked. They were like, “What? You mean the beans that I make turn into this?” They had no idea that it even happened. And I think that same effect happens a lot with data.

JS: You make some data, you put it out there into the world. How do you know where it goes and what happen to it? Like you might not, and then what it turns into. It’s pretty interesting when you look at it from that larger scale. So I think lineage, especially within an organization is never about knowing every last detail about what where it came from and where it went and everything that happened to it, because I don’t know if that’s even possible. But it is about knowing as much as you can, knowing that you’re dealing with good quality data, that this is data that’s been vetted that we have some confidence in. I’ve had data that we put out there into, let’s say the ecosystem of the organization, and somehow through a giant loop, it ends up coming back to you in a different form and you’re looking at it and you’re going, “This looks really familiar. Wait a minute.” And it turns out that it is the same data, but it’s just gone through a whole bunch of transformations, like data lineage might help with this problem. What do you think?

SK: I totally agree. I think when you see you… To probably another really interesting sub-topic of this, which is oftentimes people start with data lineage, I think companies often times go on various stages of this journey, and often times it even just starts with not even a lineage part, but the cataloging part of it based on cataloging part of, “What do have?”, which usually starts at something that may be as simple as listing your S3 bucket and seeing what popup or tractor you sit there and starting to evolve into more formal cataloging and so on.

SK: And I’ve certainly seen this with a lot of companies that we see as they go through their journey and are maturing and investing more in mini edge and automation, oftentimes it literally… You watch this journey to see what data they have access to, they’re literally listing a bucket or maybe talking to a high compatible catalog and so on. Well, what happens then is that second stage goes from not just what do we have, but answers that second question of, “Where did it come from? What other data derived from it?” And often time we start to see people solve for that second question, this data came from this other data and maybe some metadata that more qualitative metadata around it. And even with the sub-part of, “And here’s the other data that was derived from it,” to create more of a dependency graph.

SK: But I think the third part that becomes incredibly important that you highlight, John, is we also want to know not just where data came from or where it went to, but what specifically was done to it. And as you mentioned, “Gosh, this data looks really darn familiar, like not exactly the same, but really interesting familiar.” You would see that from that second stage of investment. Where did my data come from? Where did it go? But oftentimes, then you wanna actually know what was done to this data, what produced it, were certain data portions of that data filtered out for quality reasons or for other heuristics and logical reasons? Was it enriched in certain ways, was it time shifted because of something or other, or was the geography updated because we had to find a higher fidelity, a Geo look up that was an off-line process? All sorts of really interesting use cases.

LD: And so the reason why I think this is critically important as we start to think about lineage is, lineage is more than just data independencies, it is data and code bound together dependencies and being able to track very concretely what specifically happened to this data to produce that new that next derived data product. I think that’s critical.

LD: Here comes the part of the podcast that I’m sure everybody listening to, just loves where a connection forms in my brain, and I go, “Oh yeah.” So this is like when in my marketing automation system, I screwed up a couple of weeks ago, and I changed a touch point for an entire list of people where I meant to add a touch point to everybody and instead just changed a touch point per how everybody had interacted with us. And I went to go, just double check on somebody and I realized, “This doesn’t seem right, ’cause I know that they’ve interacted with more than one piece of content that we have.” And so then I could click into the details and see the history of on this date, they interacted with this piece of content. So on this date they interacted, with this piece of content, and so it added this piece, on this day it did this, and these were all done by workflows. But then today, this idiot Leslie cleared all of these and added this touch point. And I’m like,”Oh, wow. That idiot Leslie should get fired.”

SK: You better talk to that Leslie girl. What is she doing in there?

LD: Yeah, that Leslie girl is a dummy and should never be allowed to touch the marketing automation system. And so then I had to go back through it and redo everything, which it took some time, it was fine. But it’s kind of the flip side of what you’re saying, which is it allowed me to say, this data looks familiar, but not familiar enough. It looks familiar as in I know I’ve seen this data before, but something is wrong, and I know something has been changed, so let me go look at the lineage of it, let me look at the history of it and see what’s changed. “Oh, it’s that idiot Leslie, let me go fire her super fast and then fix what she did to get it back to the state, at which I needed to do proper analysis.”

SK: I think you highlight something super cool about lineage, which is, we have this principal at Ascend in general, which is to make change people so that you can move fast and be nimble and agile. And in many ways, and one, that’s what we’re doing a lot on the marketing side with our investments and marketing automation, and that’s the cool part about lineage oftentimes is when you’re moving fast, we also get to log things. You get to go back and say, “See, Oh, here’s the record of what happened, here’s where it came from.” And as a result, I have such higher fidelity around what’s going on, that that Leslie woman who’s…

LD: Such an idiot.

SK: Trying super hard and going really fast, it’s like, “Ah ha! Boom Boom.” And I think it’s funny ’cause in many ways, lineage is actually, when we tie it with some of the other observability pieces, it’s not too dissimilar from what we see in a lot of the DevOps domain for software engineering, logging, monitoring, alerting. It gives us a lot of that confidence in our systems and our data, knowing that things happen over time. But that the systems detected, that we have the ability to see where it came from, what happened, rectify issues fairly quickly. And so I do think it is lineage in a broader observability category or I think of absolute critical importance to data engineering as we continue to advance data engineering into a much higher velocity profession.

JS: Yeah, I would even add that if we look at sort of some of the projects that people have been doing, I wouldn’t say it’s a totally new idea, but it’s been newer. Things like git applied to data, so that if you have a data set, you check it into a repository, and every change that’s made to that is version then controlled, and you can see all that. Exactly, that’s a good example of the software engineering tools and the data engineering tools combining together to produce this stream of observable auditable changes basically. Absolutely.

LD: Kind of taking it the next step, let’s say a company doesn’t have a lineage solution, platform, process, insert word here in place. What should they do?

SK: I would just say, “Call us in, use Ascend.”

[laughter]

LD: This is a Ascend.io, but other than that…

SK: Yeah, of course. John, you were about to say something that I think was far more productive and serious so I’ll refer it to you.

JS: No, just call us at Ascend, that’s what I was gonna say. No, that’s what you said.

[laughter]

JS: I think data lineage falls under the data data governance and data management practice, and this is historically a notoriously difficult thing to get started in an organization, especially if you have nothing in place and you’re starting out level one where things are a little bit more chaotic and uncontrolled. The best advice that I have there from my own experiences is, “Look… ” I come back to the organic label, but, “Start organically, look for problems that are gonna help your organization.” A good example of it that I personally ran into, if you had… Let’s say you had a data set about companies, you had a data set of company data. That data set started maybe it’s in your finance department, and then you share that same data set with your marketing department and you share that same data set with your sales team. And each one of those different domains of users is gonna have a slightly different use for that customer data, they may change it, they may make it suit their needs.

JS: And let’s say that all this data comes from a single source, we’re just making a hypothetical situation. If you knew the lineage of the data and you knew that a copy of this customer data went to your finance department, and you knew that same copy of data from finance was being used by marketing, so in other words, they’re not… That the marketing group is not getting it from the source, but they’re getting a copy of the data from Finance, in other words, if you knew the lineage of what this data had flown through. In that situation, you might ask yourself, “Now, which copy of the data should I use for my project?” And this can create a lot of confusion, a lot of costs in organizations, because maybe you start with a certain data set, it’s the only one you have access to, maybe you’re in sales and you use marketing’s copy of customer, because that’s all you can get. This can lead to inconsistencies, it can lead to inefficiency, maybe some of the data you need is not there, and you start asking why.

JS: These are the kinds of questions which lineage starts to help with. You go, “Oh, where did this data come from? Oh, the copy that’s in marketing came from finance, and the copy that’s in finance came from the original source. Maybe I should try to get the data from the original source rather than getting this one that came through three different departments in my organization.” So those are tangible wins that you can have in an organization, when you start having this visibility into where did the data start, where did it come from, and how was it used in these intermediate steps along the way? That certainly could be an entry point into this. But again, if you start too big and you start trying to get the lineage for everything, maybe that’s not gonna work. Start with something where you have a problem that you can solve with this.

SK: Yeah, I would really agree with you on this one, John. I think we see this and we’d probably say very similar things about whether it’s implementing advanced orchestration or data automation or lineage, but you’ll probably see a lot of the I’d say, seasoned and those with a lot of battle scars around the data ecosystem, primary, similar guidance switches… I’m 110% with you on, “Don’t try and boil the ocean, don’t try and create the solution that will fit for everything, because it will probably fit maybe like 20% for everything, but you’re gonna miss the mark on a ton of things and it’s gonna take you two, three years to go on from that company. You’ll have missed a ton of opportunity in the mean time. And so I think…

SK: We tend to see the most successful data teams out there are the ones that do exactly what John is describing here, which is figure out what the core goals are or find a handful of very specific use cases, even find a couple that are diverse enough that will stretch some of the early use cases. But go after specific problems. Log tons of stuff, collect tons of data. And one of the things that you all will hear from me a lot too is… One of our investors called AMDs Arbitrary Mandatory Deadlines, they’re incredible at driving focus for how do we get to a checkpoint where there’s a bunch of value for this sooner rather than later. Because I gotta be honest the, oftentimes, we see teams go and spend a ton of time, quarters, if not years, building the next gen X, Y or Z but that usually takes so long, that the world has already passed them by, by the time that things starts to see the light of day.

SK: And the teams that we see are so successful are the ones that in a matter of weeks and maybe months at most, put these incredible wins up on the board because they can move so fast and they’re the ones that get more opportunity and more resources because they’re driving a lot of value quickly. And so when we think about lineage, go for past wins and you can create really great organizational incentives for people to participate in that lineage construct by sharing their data, how it was generated, what code operated on it, who’s accessing it? And if you can create the right organizational construct to not mandate but incentivize teams to participate in that, I would contend you will see greatly outsized rates of success.

LD: So one of the things that we talk about pretty frequently on here is how fast data technologies innovate and change. Is it the same on the lineage side, and maybe it’s just because it’s not the most talked about piece of the data technology stack that I don’t hear as much about it. But has there been a lot of movement on innovation on that in recent years, or has it stayed pretty steady? What has that innovation looked like?

JS: You know here I would say data architecture in general has not necessarily been seen as or data governance as a glamorous, exciting, the coolest cutting edge thing, although obviously it has great importance and value in organizations. One of the best pieces of advice I ever heard, by the way, on that topic was, if you’re going to do a data governance initiative, whatever you do, don’t call it a data governance initiative, because that’ll just shut people off right away, right there. Do it, do it. And do it, right. But do it quietly. Do it organically. And then later on when everybody goes, “Wow, things have gotten better,” you go, “Yeah, that’s because we have a data governance program,” so anyway.

SK: You know I would say when we think about the pace of innovation, I don’t think that there has been a tremendous acceleration in the pace of innovation in data lineage, and I think the reasons are a couple fold. I think it starts first and foremost with, data lineage is hard. I think there is far faster and better products out there that have emerged in data cataloging, ’cause that’s a step in that journey of data lineage, and I think there’s great products out there for that, and I think that they continue to progress quite nicely. I think the reason why data lineage has yet to see what I would categorize as even modestly exciting pace of innovation, is because it is the intersection of both storage and processing or cataloging and e-tailing, if you will, it is tied to… It is the linkage between two very distinct domains with different technologies, different products, different vendors, which makes it the heart.

SK: As we established in sort of first three stages of advancement in data lineage, you start with cataloging and then you start with dependency change, but really what matters is dependency and code dependency tied together is really where the most interesting things happen. And I think because of that, we tend to see most investments in lineage still happening in bespoke platforms. There’s some really interesting AI and ML style of approaches to lineage, I think that have come into the market. I’m very much a pragmatist engineer at heart, so I look at those and I think that’s a hard way to solve for the problem, but I think they’ll do some really interesting things, and I think you can solve for significant portions probably in that way.

SK: I think where we are going to see tremendous acceleration in innovation in the data lineage space is going to be tied to the category of data automation. And I think that the reason that I put it there, and not even just data orchestration, but specifically data automation is the dependency and the requirements to innovate in data automation, which is orchestration with significant amounts of system level intelligence that is fueled by metadata, the same metadata that powers lineage needs to power data automation. I think that’s where we’re going to see the residual effect of innovation on the lineage side, because we’re solving for increasingly complex problems in the data automation space.

JS: I would also say that I think it’s not necessarily a new problem, but the challenge of how to do this is also not solved completely either. If you look at the old school data flow diagrams, the old… One of the first things you learn when you study structured modeling is data flow diagrams. This was an attempt to show inputs, outputs and processing, but it was a document usually made by a human or potentially made by a computer. And this was an early attempt at understanding data flow across organizations in context, to give people a map so that they could understand where their data came from, where it lived and what happened to it along the way. And I think there’s a concept of active data lineage versus passive data lineage, things that are the by-product of data flowing through systems that can be recorded in logs and then looked at later to understand what happened to it versus actively at runtime linking everything together. I would say Ascend is more on the active side, the way we link code, data and partitions of data and so forth.

JS: But anyways, if you look at the technology, this is still an area that with… As with everything in data, it’s still growing and changing, and there’s a lot of opportunities. I think the hard part about this, and I’ll go back to the data product discussion for just a minute, is how do you make something timeless, because technology does change quickly, but this need doesn’t go away. And it’s something that’s always there. I think some of the ideas that folks have around the portability of the metadata that contains that lineage information, so that if you have a ‘Data product’ and somehow additional metadata goes along with it to help you understand that this is probably an area that things are gonna continue to develop, or things are even developing now with different architectures because if you think about it, we don’t really do that today. When somebody sends you a data file, maybe you get a specification, maybe you get the actual data itself, but do you get the history of where it came from in the background and all this other information about it, probably not.

JS: I still think this is an area that’s probably still developing and changing, but obviously ever since the advent of Big Data, especially, we need this technology more than ever. We’ve got oceans of data, and we’re like, “But where did the stuff in the ocean come from?”

LD: So to kind of wrap it up, if there is one tidbit of information or one thing that you would want people to know about lineage or that you think is a misconception, or that you just wish was out there a bit more, ’cause again, I think it is… I don’t wanna call it the unsung hero, I don’t think that’s what it is, but I do think it’s the lesser… One of the lesser talked about parts of the data engineering workload. So if there’s the one thing that you are like, “I wish people paid more attention to this, or I wish people realized this more, about lineage,” what would that be? What is your last piece of wisdom that you wish to impart? Make it good.

JS: Well, I’ll go first, I guess. Mine would be that there’s a misconception that this should be easy, and it is not easy. And I think with that misconception goes, “Well, if it’s not easy or if it’s gonna slow us down or it’s gonna take time and resources and other things away from other things, then we shouldn’t do it because we don’t have time or we don’t have resources to do it.” I think that if you have that misconception, it won’t actually cause you a problem right away, but it’ll cause you a problem down the road when your organization grows and questions start coming up about the validity of data and why does this report show this and why does that report or this dashboard show something different. And it’s not a problem you see right away, but it’s one you see down the road. And so it’s not worth necessarily taking this problem head on if you’re…

JS: Especially if you’re a smaller organization and you don’t have the time and resources to do this. Be aware of it, keep it in your mindset, find tools that support you to have best practices in place as you’re growing and as you’re changing, and keep this in the back of your mind. Because some day if your organization gets to a point where you have a lot of data, even if you’re a small organization, you could have a lot of data, you’re gonna wanna know where it came from and what happened to it along the way, so… It isn’t easy. Don’t expect it to be easy. It’s not impossible, the tools are growing and maturing. Ascend is proof of this, that the tools help you more at this now than ever before, but just because it might be hard, don’t ignore it either. That would be my thought. Sean, how about you?

SK: Yeah, I love where you’re going with this. And the second thing that I would add to that specifically on the data lineage is hard. Be sobered by the fact, but because it’s so hard, just trying to dip your toe into it, and oftentimes if not done right, you can do more damage than good because you’re seeding false confidence, confidence in your data, confidence in your systems. And so I think it’s very important to invest and make sure you’re doing it well. If you’re going to do it, do it well. I think a key component that I would suggest to try and lease out something additive to what you’re suggesting, John. A lot of organizations first approach lineage under the umbrella of governance. And it is far more stick than it is carrot. And what we find with really fast moving teams is they need carrot. Because as we found in our last annual survey, 96% of data teams are at or over capacity. And so find ways for lineage to help make people’s lives easier.

SK: If you can make change easier and cheaper so that users can adopt these new technology so they can build faster, so they can have more confidence that things they’re building are valuable and accurate and being used, those data products as you described them, John, if you can align those organizational benefits and incentives with a data lineage strategy, I think that’s how you get the ultimate return on that investment.

LD: Gentlemen, I feel a lot smarter after the end of this. So I hope everybody listening to us as well. Thank you both so much. I appreciate it. And John, you did fantastic. So you’ll be back. I keep saying this to everybody as they come on the podcast. It’s not like I think anybody’s gonna do a terrible job. But we got a bunch of really smart people at Ascend. And so I’m just gonna start bringing everybody back, maybe at some point, I’ll let Sean stop coming back but not now, not anytime soon. So, thank you guys both very much.

JS: Thank you too, Leslie, it was an honor.

LD: It’s sort of incredible to think about all the facets that go into understanding full data lineage and just why they’re all so important. Big thank you to John for joining us to chat through all of it. And as usual if you have any comments, questions, etcetera you can always find us at Ascend.io. Welcome to a new era of data engineering.