Operationalizing Machine Learning Pipelines [Podcast]

In this episode, Sheel and I sat down with Shawn Azman, Head of Operations at Vidora, to chat about how organizations are moving forward with building out machine learning pipelines, overcoming common pitfalls, and overall, successfully operationalizing their machine learning practices.

And if you’re interested in learning a bit more about Vidora or how Vidora and Ascend can work together, check out this recent blog post on “How To: Operationalize Machine Learning for Marketing and Product Teams Using Ascend and Vidora.“

Transcript

Leslie Denson: So you have your data, you may even have data pipeline, you may even have ML pipelines, but now what the heck do you do with them? That’s where this next podcast comes in. Sheel Choksi and I have the chance to chat with Shawn Azman, who is Head of Operations at Vidora, about how companies these days can operationalize their machine learning pipeline, which is growing more and more and more important these days as companies have figured out how to get their data and the actual scaling of the infrastructure isn’t the problem, it’s more of the scaling the teams and scaling how they actually are able to use the data. So learn more about operationalizing ML in this episode of DataAware, a podcast about all things data engineering.

LD: Hey everybody, and welcome back to our next episode of the DataAware Podcast. I am joined today with Ascend’s Sheel Choksi. Sheel, welcome back for what you’ve just told me was your second trip down the podcast lane. So I’m glad we didn’t scare you off too bad the first time.

Sheel Choksi: No, not too bad. Thanks for having me.

LD: Yeah, of course. Sheel and I are actually joined today with Shawn Azman, Head of Operations at Vidora, and we are gonna have a really, really, really awesome chat about something that is near and dear both to our hearts. And then also, if you guys have heard of Vidora, and if you haven’t, we’re gonna learn a lot more about it today but also their hearts, which is around how companies can and how they should operationalize AI, ML, and really what they should be looking out for in that realm. So hey Shawn, how’s it going? Welcome to the podcast.

Shawn Azman: Leslie, Sheel, it’s great to be here with you, thanks for inviting me.

LD: Yeah, of course. We’re glad that we’re able to finally get you on the podcast. For listeners out there, we have some joint customers and we’ve worked together in the past, and so we have a lot of respect and a lot of appreciation for what Vidora is doing in the AI and ML space, and so super excited to dive into that with you.

SA: Absolutely.

LD: Yeah. So on that note, why don’t you talk to us a little bit about yourself, your history, and also Vidora and give just kind of the listeners just kind of a quick overview and background of you and Vidora and what you guys do?

SA: Yeah, absolutely, I’d be happy to. And again, thank you for inviting me on. And so as you mentioned, my name is Shawn Azman. I work in Operations here at Vidora. And I’ve spent the better part of the last decade working in the data and analytics space, so really working with companies to help understand what’s the best data to be capturing, but more importantly than capturing data is how do we leverage that data? And it’s really that path that led me to where I am today at Vidora. And just a quick background on Vidora, we focus on operationalizing machine learning. And what that means to us is that we specifically focus on consumer data. Because so many companies capture so much data, we help them take that data and automatically transition it into machine learning predictions, and we do this automated and continuous so that customers can always rest assure that they have the best data possible to make informed decisions about how to interact with their users, be it within the product, through marketing channels, or through advertising channels. And so it’s really about creating an operationalized machine learning pipeline that can continuously learn over time.

LD: It’s obviously something that I think people are seeing more and more. One of the things that we… We actually just put out a new survey, which one of the findings that we found was really interesting, was that only about 21% of the respondents, and it was across data engineers, data architects, analysts and data scientists, only about 21% of the people found that scaling their infrastructure was their biggest issue with creating and scaling their data projects these days, now it is just like the massive amount of data projects that they have to get done. It’s funny and it’s interesting, because for so long it was like, “How do you get access to data?” then it became like, “Okay, so how do I make my infrastructure work with that data?” and to your point, and with what Vidora does, is now it’s like, “Oh, how do I use that data? How do I actually make this data something that our company can interact with and use and get more value from? And that’s just an interesting transition to me, at least.

SA: Yeah, no, absolutely. We hear this all the time. We often interact with customers that are sort of going through that transition. As you mentioned, most of the companies that we work with already have existing data. In fact, I would say most companies have more data than they know what to do with. And so the questions that we get most time is, “How do I become more efficient at using the data that I already have?” And so we’re often working with customers around identifying, “What data do you have and how is it best to then extend the value of that data using things like machine learning?” And so it becomes really important that we sort of have that foundation of data which most companies have, and then it’s all about testing, validation, and deployment. So it all starts with, “Can we even learn using machine learning from the data that we have?” A lot of companies that are just starting machine learning, honestly don’t know the answer to that question. And so we spend a lot of time early on with customers just simply setting the foundation of, “Do we have enough historic data to make accurate predictions about the future?” and then fingers crossed, assuming that we can make accurate predictions, then it becomes, “Okay, so if I can do this once, how do I do this ongoing every day, every week, ongoing in case any new data comes in, new customers come in, how do we always adapt and learn?”

SA: So you’re absolutely right. I feel like we’ve sort of graduated from, “Should I be capturing data?” ’cause most companies have it, to, “Okay, I already have the data, how can I be using it better?”

SC: What you’ve mentioned just reminded me of something where, as more practices have been established on collecting data and doing analytical value and things like this, most folks have a good understanding as to how to lay out your schemas, how to use a proper BI tool and get practice around reporting so that business has visibility, as we move that over to ML, it’s new frontier for a lot of people. A lot of folks are maybe trying to start really basic around, “Maybe we need some Python scripts that run a small algorithm,” to the really complex of, “Maybe we need a full ML ops platform, and so that we can do drift detection at our model that’s deployed in real-time production.” How do you think about all of that in the varying degrees? And how do companies really start establishing that foothold and growing in this space?

SA: Yeah, this is a huge question, and to be honest, I actually have experience with this outside of just working at Vidora. So in some of our previous companies, we’ve actually gone down this exact path where you hire out a few, maybe data scientists or sort of PhDs within data, and sort of give them over the responsibility of exploring the data, creating some initial models, and again, doing that testing and validation. And so often that’s where we’ll see customers start. However, there are a few shortcomings when it comes to this approach. Typically, what you’ll see when customers take this approach is, Sheel, let’s say you’re the sort of data scientist, what I could do for you is, here’s 5% of my historic data that’s a static snapshot of what has happened in the past, and I can give that to you in a cloud, and so you can start working on this locally. And so then what you’re gonna do is, like you said, I’m maybe gonna create some initial models, I’m gonna test out a few different algorithms, I’m going to see what the predictions are and see if they can hold true. And oftentimes, we can create something that’s pretty legitimate from that snapshot of data.

SA: However, that can often be the easy part. What then is the hard part is, okay, if I’m only using this on a small percentage of historic data, what happens when I scale it up to 100%? What happens when I get new data that’s coming into the schema that the model hasn’t seen before? And what happens when I wanna create new predictions that are slightly different than the ones that I’ve done in the past? And so while that’s often a good way to sort of initially say, “How can I sort of make sure that I have the right data to do machine learning?” it’s often an even bigger step to then take that into production. And that’s where a company like Vidora comes into play, because we essentially think of those two parts of the equation as just one platform. So we have a concept of a machine learning pipeline which essentially says, “I wanna ask a question on my data. Which customers are most likely to churn? Which of my subscribers are most likely to upgrade to a new package?” And we can test that validation using all of the historic data, so that when we’re ready to deploy and operationalize sort of ongoing, it’s really just taking that model that’s already built, scales built in, and we’re just sort of re-training and re-deploying that model over time.

SA: And so that’s one thing to definitely think about. It’s not only enough just to say, “Can I create the model on a small set of data?” Scale and continuous learning is a really big aspect to that and it’s something that should be thought of ahead of time, ’cause I think it’s often underestimated how big of a step it is to go from that sort of local small train model to one that’s deployed and ongoing.

SC: Yeah, that’s fascinating. And it’s funny that you mentioned it that way because I’ve always thought of it as such a big step. So I’ve always been thinking about, “Well, how do we get more folks empowered?” and to your point, there’s definitely folks on the other side of that equation who are like over-simplifying what that might look like in order to make that step. And it sounds like in your recommendation there that helping people with the platform and technology really helps enable folks. And to Leslie’s point earlier, one of the trends we’ve been seeing in data engineering, which I’m curious if it applies, is as there’s been more projects, as there’s been more to do, the goal has really been to enable more people to do it. And so outside of data engineers who are handling and transforming a lot of that data, you might see data analysts pitching in and resolving some of their own pipelines and what that needs to look like, same thing with data scientists. Is that a trend that you’re seeing? Does that apply?

SA: Absolutely. Yeah, and I think there’s a lot to what you just mentioned. So maybe I can go through a few different things that we see with our customers. And I don’t want to belittle how hard it is to create those initial models, ’cause honestly, there’s a lot of expertise that needs to go in. One of the things with machine learning is you often need to format your data in the specific way for machine learning to actually be able to learn from it. It’s not enough just to say, “Oh, here’s my database for my analytics provider, just go learn off that.” There usually has to be some kind of a transformation that takes place, and that requires machine learning understanding. And so that’s where the data scientists really shine because they can help to transition that data into the right feature set so that the machine can actually do some learning. And so I don’t wanna say that that’s easy, per se. I don’t wanna do that. But what I was meaning before is that that skill set of being a really good at understanding algorithms and machine learning models is often a different skill set than the sort of, call it operational engineers who are going to take that and scale it out to a model that’s gonna be 100X the size that’s gonna run every day over the next few years. So because these are often two different skill sets, it’s really important as to how you set up those teams in order to have both.

SA: And so that’s a really big thing. But to your point on giving more resources and giving more access to more people in the organization, I really do think if we’re sort of pointing ourselves into the future, let’s say I’m now becoming a prediction model, I think that’s where the industry is going to go. I think in the past, this really was the realm of people who knew how to actually dive into the code, build their own models, create their own predictions, but what you naturally run into is simply a resource constraint. Either you’re not gonna have enough budget in order to scale it up to be as big as it needs to be, or more likely than not, you’re not gonna have enough employees as internal resources to build all the models that you need. So often what we see is, and this is not uncommon for any part of an organization, but it’s a priority stack rank. You’re gonna get the really important ones, the ones that move the bottom line or the top line, those are gonna rise to the top, and then everything else is gonna fall to the bottom.

SA: And really there’s gonna be that threshold where you might be on a team that doesn’t have that use case, that’s sort of important enough to rise to that threshold, and at that case, you’re sort of out of luck as to, “Okay, so what do I do? I obviously need this. This is important to my role that maybe it doesn’t reach the top five for our organization to really prioritize it.” And these are oftentimes where we can sort of step into that equation, because what we’re trying to do is obfuscate all of the technical underpinnings of machine learning models and really turn it into more of a business question. As I mentioned before, most companies know types of predictions that they want, and if we’re dealing with customers, I wanna know who’s going to get acquired, who’s gonna register, I wanna know who’s going to convert, and I probably wanna know who’s gonna churn. Now, there’s a myriad of other predictions that you want, but these are extremely common, any time you have customers, you’re gonna be worried about these things. And so our question is, can we obfuscate and automate that enough that the person on the business team who’s going to actually be using those predictions, can they go ahead and create that model for themselves?

SA: And the answer is yes, but they need to be convinced that it’s working. And as you mentioned, this is where we do a lot of our support with our customers, ’cause let’s say that somebody were doing a churn prediction, and all that we were to give them is a high, medium or low score, and nothing else. I think the first question someone is gonna ask is, “Well, why is this user in this bucket and not the other bucket? What is the behavior that they’re exhibiting that sort of dictates in which bucket they are?” And so when we get a first customer, that’s why I always say testing and validation is always the first step before we should even worry about deploying and operationalizing, because if I don’t have confidence that this prediction is accurately describing who my customers are and their behaviors, I’m not really gonna have confidence to say, “I wanna run my product on top of this, I wanna run my marketing on top of this.” And so really important things to look at are, what are the drivers of the predictions that are saying if a user is really high value or really high likely to churn or really low? And I think it’s a really interesting thing, ’cause if you look at this from how would you do this without machine learning, typically what you would do is an analytics-driven approach where you would use sort of data you know to segment out your customers.

SA: So who’s gonna churn? Well, it’s probably gonna be users who haven’t shown up a lot, maybe they haven’t purchased. These are the common things that you would look for. And so when creating your machine learning model, it’s always the first thing we do with our customers, what would you assume would be the behaviors that most exhibit this kind of behavior, and then we look at the model and we say, “Are those the behaviors that are sort of rising to the top that are the most predictive of that customer behavior?” And that’s a really nice first step to say, “Okay, wow, this machine learning model is actually picking up something that I knew inherently, but it did it just from the data.” And then there’s one other important aspect that I’ll point out here which is, that’s how we would do it as humans, the machine is gonna take it much further.

SA: Even though we may choose three to five data points to use for our segmentation, the model has access to all of the data points. And so what’s really interesting is not only to see which are the most important, but what’s interesting to see is how each of the other data points also adds value to the overall prediction, and these are the kinds of incremental values that you get from phrasing these questions as predictions rather than historic segmentation, because maybe you wouldn’t have considered how often are you opening up your marketing emails, how often are they you’re contacting your support team? These are things that may add some incremental value to your prediction that we would just never think to put in our historic segmentation, and this is why we can start to get sort of incremental ROI from creating predictions on top of our customers because it’s just learning from a much wider set of data than we might consider if we were doing it ourselves.

SC: Yeah, absolutely, that makes tons of sense to me. And to further on that point, maybe you didn’t think of being able to analyze, or maybe just didn’t have the bandwidth to, and past analytic teams and experiences there, you prioritize, and, “Alright, look, we have time to slice and dice maybe these six, but there’s probably more to be found, but who’s got the time to do it? And it doesn’t seem likely enough to be worth exploring, even though there could be a negative information there in predictability that comes out of that feature.” So that makes a lot of sense to me and, well, really cool to see the pattern of automating what was once manual. That’s always an exciting thing. One of the things that you mentioned while you were describing that was as you build these platforms, you’re able to take very common patterns, for example, figuring out if a user’s churn, and really aggregate and optimize those cases. And that’s the benefit of platforms. That’s the benefit of tools that are built more specifically to solve these use cases, is you get that aggregation across many different customers, and you get that best feature set built into there. And what that made me started thinking of is that a lot of companies, even though they have their own data and they have enough variety of data, don’t necessarily have enough data of that same feature to train a model and to get accurate predictions.

SC: And what I was curious about is, with a platform like Vidora, could you in fact train user churn prediction across everybody’s data and then you can benefit from it with your own data as well?

SA: Yeah, it’s a really good question. So I should preface this by saying it’s not something that we currently do at Vidora, but it’s theoretically something you could do. However, one of the things that we’ve actually seen from customers is just sort of macro trends that actually point them in the opposite direction. So two things that I’ll point out that sort of come directly to mind are the sort of getting rid of third party cookies within websites, browsers are starting to do this, and the second, as we know with iOS, them starting to say, “Hey, don’t track me across apps.” And so one of the things that we’re seeing is this idea of stitching user profiles across different parts of the web is becoming less and less common. And so what that means to companies that are sort of directly interacting with these customers is that first-party data that you’re capturing about your users is actually becoming the most important thing that you can have with respect to machine learning predictions. And so that’s a really big thing that we work with our customers on is what is that internal data set that you have, and can we learn from it? Because where we might be able to join that with other data sets specifically for that customer, oftentimes that’s going away.

SA: And now, I think you bring up a really important question which is, okay, well, I have the data that I have, and let’s say that that’s the universe of data from which I can learn. The natural question is, is that enough? Can I actually learn from that? And the honest answer is it honestly sometimes depends. So for really common things that happen all the time within your business, again, I’m thinking of the big funnel conversions, registrations, subscriptions, purchases, churns, more likely than not, once we get to a certain user base, those kinds of things should be predictable because we should be seeing them more and more commonly over time, except for churn, we’ll never see that ever again, but all those other ones, we’ll be able to predict going forward. However, let’s just say that there’s an event that’s a once in a blue moon event, it happens once a year and one random customer is gonna do it, I’m obviously sort of being hyperbolic about this, but there are those patterns where there’s just not enough examples of customers actually taking these actions in the past to make a prediction.

SA: We would simply just be guessing. And so again, in that testing and validation phase, any time we create a new prediction for our customers, we automatically give it an accuracy score. And to be honest, some accuracies are poor, and that’s not to say that the model is broken or it’s not working, it’s just simply to say that we don’t have enough examples, to find patterns that are predictive of this behavior, and it’s just not something that’s common enough for us to predict.

SA: So we do run into that quite frequently, with those more rare events, but we can usually find, again, those big conversion points within a customer journey, that are predictable, and usually we find, those are the ones that can sort of bring the biggest value to your customers. If you can get a few percentage more conversions, if you can prevent a few percentage fewer churns, those are gonna really drive your business, and those are often the ones that we do have enough data, to make a prediction on.

SC: Yeah, that makes total sense. These are the big needle movers for the business, and that tends to be where there’s enough data. And thinking back to the world of if you were doing just straight forward analytics on that, you presumably have the same issue there on these more rare events and trying to come up with business distillation and learnings from that, and it’s like, “What was this based on? Oh, 20 users. Uh-oh.”

LD: Yeah.

SA: Typically, what you’re probably gonna be doing is, you’re gonna find an event within your analytics that stands out to you. Maybe there’s a spike in something, and then you’re going to do just sort of natural exploration, “Okay. Who are these users? What are their segments? Is it location? Is it… When they registered, is it their subscription type?” and you’re still gonna start to guess, you’re gonna guess this, check it out, see if that’s important, you’re gonna look at this feature, check it out, see if that’s indicative of that user group, and essentially, what you’re doing is literally what the machine learning model is gonna do, except it’s gonna do it for all of those different data points. It’s gonna say, “Okay, I know who are the past users who have exhibited these actions, and what I’m gonna do is look at their past, what are the events that led up to that conversion point and try and find what are the most predictive behaviors that sort of say, ‘Okay, this user is creeping more towards this subscription event or this churn event,'” and essentially, because it has the entire sort of data universe of that company, to learn from, it can start to suss out those patterns better than we could, if we’re just sort of manually looking through it.

SA: And so, it does make it a bit more of an efficient process. Again, assuming that we have enough data from which to learn, it can really start to find those patterns very accurately.

SC: Well, so you know me, I’m convinced. I wanna go down this ML path, I don’t wanna do any of this manual work anymore, I want the insight from the machines. How do we start getting there? Does it depend on how organizations are structured? Does it depend on the team skill set? What does this process even start looking like?

SA: Yeah, for sure. So I think organizational structure, it’s important. When we start engaging with customers initially, it’s one of the first questions that we ask is, “How is the organization structured?” and typically, we see one of two paradigms, if customers are already starting to think about machine learning. One, would be a centralized data team, consider it the hub off which spokes come, and so I have a centralized data team and they work with marketing, they work with product, they work with finance, and so it’s typically that team that we would be dealing with, to do machine learning. However, we’re often seeing this sort of become diffused within an organization. And so, maybe instead of having a centralized team, we’re starting to see, there’s a Data Scientist specifically assigned to the marketing team, or there’s a data science team specifically assigned to the product team. And often times, this is where that interplay between, “Okay, I really understand the data models and how to build them,” versus, “Okay, I really understand our business and how to use those models.”

SA: And so, you really do need both sides of those equations, in order to really be able to take advantage of machine learning. And so, that’s a really important thing, is to be able to have that expertise internally, to know how to do both.

SA: However, I think there’s one layer on top of this. We talked about the fact that companies… Basically, every company has too much data to know how to deal with. However, one of the struggles we often see from companies is, “How do I organize it in a way that’s actually useful?” and that’s a big thing that we run into as well. And so, some common pitfalls that we see there, just always on top of or minds are, “Is there a way to combine these different data sets in a way that makes sense?” And thing that’s really important is, “Are you tracking users consistently, across all of your different touch points with those users? Are you using email in one source and a subscriber ID in another source and a cookie ID in a third source, and do you have a way to combine those?” Obviously, with machine learning, we wanna make sure that we have all of the behaviors that user is exhibiting, in order to be able to make predictions about what they may be doing in the future, or who that user actually is. And so it’s sort of the organization of data and how well is it speaking to each other, internally, that really sets the stage for how well we can do machine learning on top of that data.

SA: And so, on top of understanding the organizational structure of who’s responsible for what, it’s often understanding what the data architecture looks like, and, “Do we have access to all the different data points we would need and or want, in order to make really accurate predictions?”

SC: Yeah. So it’s sort of two different organizations in that sense that, it’s what you’re calling out, the organization of the people who are getting this project done in the organization of the data. Do you find that getting that organization of the data ready and making sure that, yes, you do have a cookie ID that stitches back to an email address and having those ties, do you find that to be something that in today’s day and age, as people use more of these SaaS services is for granted or something where organizations need to invest into and get ahead of?

SA: I would say we’re in the inflection point. Having worked in this space for 10 years, I would say five to 10 years ago, this was the dream. I would say, right now, it’s the reality for many companies, but not all companies. However, if it’s not a reality, it’s sort of like the top priority. When we talk to companies, often times they’ll be like, “I really want to do this. However, I need to organize the data first, before it can even become a reality.” However, they know. And so, I would say we’re sort of right at that point where we already are capturing the data, so now it’s making sure that it’s organized, architected in the right way, so that we can reap the benefits of all of the sort of work that we have done to collect and aggregate this data. And so, I would say if you haven’t already done it, it’s probably something that you know you want to work on, and as you mentioned, there are a lot of different providers that can help companies do this. So it’s becoming easier and easier to actually do it, but with these organizations that are capturing a large amount of data, there’s obviously work that needs to be done, so it’s not a truly old step yet, but I would say it’s definitely becoming easier, over time.

SC: Yeah, and I would even echo that at my previous company, we got to the point where as we were bringing on another platform or a provider to do another task, that was one of the first questions was, “How are we gonna identify people? What’s gonna be the thing that’s consistent, as we integrate from this source to that source?” And to your point, on the inflection point, I think we did hit that inflection point where it was, “Well, if we’re going to bring on a new provider, let’s make sure we square with these details,” it was part of the checklist, always.

SA: Yep. And once you make that priority internally, it’s always a necessity for anything that you do from that point forward. You don’t wanna do all the work and then have another source that doesn’t map into that same structure. So, totally. Yeah, I definitely think it’s one… It’s something that’s becoming more and more common, going forward.

SC: And one of the points that you mentioned when you were talking about the two different structures of the organization, where you have… Honestly, not even specific to data scientists and machine learning, where you have data folks that are either one centralized data team or where these folks who have knowledge of data, are assigned specifically to product or supply chain or finance, things like this. Is there a particular paradigm that you’re finding, is faster, more productive and getting transformed into ML, or does it not matter?

SA: I would say, in general, I think both structures can work, but I think what you really want, is alignment between, let’s just simplify and say two sides of that equation, which is, the ones that are gonna be really responsible for the data strategies, and the ones are gonna be responsible for, let’s call it the business strategies. And so, on the data side, this is exactly what we were talking about. Somebody who can really understand, “What is our architecture? Where is this data stored. If we needed to transition it, aggregate it, integrate it with another service, how would we go about that?” And that’s a different skill set than, let’s say, the market or product managers who says, “Okay, assuming that that’s all taken care of, how can I now take advantage of all that data that we have, to make my product better, to personalize the experience more to each user, to make sure that my marketing is relevant to anybody to whom I’m sending?” And so, it’s really what we want is, alignment between those teams. If I have a business team that’s really focused on sort of using data in a particular way, but they don’t know how to get the data into that format, well then all we have is wishes, we don’t actually have the ability to take action on it.

SA: And so, I would say, regardless of which organizational structure we’re using, it’s really important that there’s that alignment, so that we can make sure we’re architecting and organizing the data in a way that our, again, I’m gonna just call it business teams, can actually use that data in the best effect, for their customers.

LD: So let me jump in and ask real quick. ‘Cause we have heard, in a couple of different organizations, and I think we’ve actually talked about it on podcast once or twice, folks actually out there and hiring for machine learning engineers, even taking the data engineer role more specific into machine learning, but it is like finding a unicorn. And I think the reason why it’s come up is because it is incredibly… Hard as it is, to find data engineers right now, it is 10 times or more harder to machine learning engineers. Is that something that you’re also seeing more and more companies start to look for, or what is your experience around organizations that may have that role or looking for that role or have decided not to go for that role?

SA: Yeah. Absolutely. I would totally agree, that for companies that are looking for it, they’re finding it hard to find those people, the really specific skill set. And again, I find that one of the big things is, “What’s the expertise in? Is it in building models and building features and creating predictions?” that’s really valuable. But then again, it’s, “Okay, so let’s say we can create a model. Do we actually have the internal expertise to scale that model, again, to production sizes, that’s streaming data in constantly, that’s growing in size?” They’re the data infrastructure tasks married with the machine learning tasks, and so my advice would be to make sure that we’re focused on both of those, not just building models locally on small data sets, but also being able to say, “Okay, now that we have a model, how do we productionize it?” I find those are two different skill sets, but we wanna make sure that we have both.

SA: However, one of the things that I often find when talking to customers is, even if they have that expertise internally, they maybe have a few. And again, what that means is, they can only focus on a few projects at a time. And so then, the question becomes, “If I didn’t make the cut, if my use case wasn’t prioritized enough, then what do I do? Am I just back to our old techniques, using what we’ve always done to get the job done, or are there other ways by which I can go about sort of taking better advantage of the data that we have?” And that’s usually where the door can start to enter into the conversation, because it can sort of open up the access to people who can’t build those models themselves, who aren’t to the data engineers to get into the code and really start to test out and validating, really use in production, these different models.

SA: And so, I think, definitely it’s important to have internal expertise, especially for the most important cast within organization. I can think of, for example, if we are a company that makes most of our money on displaying advertising, I may want to own the algorithm that’s optimizing for which ads I’m showing my customers, and that is probably the most important decision I can make as a business. And so for me, it makes most sense to make that bespoke, custom, something we own, can control, can change whenever we want, but that’s no light task. That’s gonna take a really long time and it’s gonna be continuous, focused on making sure that it’s working. And so, if that’s what we’re focused on with our internal team, what does your marketing team have access to? What if we wanted to do different kinds of predictions? And so, those are the questions that you’ll often run into, where you’re choosing your highest priority tasks and then it’s a decision of, “Do we simply just wait for all the others, or do we have… Do we find other opportunities to give them what they need, through other avenues?”

LD: Going back to the whole idea of how companies are really starting to better use their data and machine learning, kind of taking in a little bit, what are some of the more interesting ways that you’ve seen companies get to that point? And they don’t have to be pretty, in fact, they’re kind of funny when they’re not pretty. That’s usually… The way that companies have figured out how to do it, and you look at it and you’re like, “Wow, that’s not how I necessarily would have done it. That is super interesting,” can be some of the ones that you learn the most from. What are some of the best ways that you’ve seen companies really make that… I hate to call it jump, but it is a little bit of a hurdle that they have to cross, if that makes sense.

SA: Yeah, yeah, definitely. I’m obviously drawing a blank right here, in real time. But typically, the way that we’ll see it, is very similar to how we talked about before. If we have an internal team, the first thing is about defining what are the highest priority initiatives that we can accomplish and then it’s about breaking down that task into its individual chunks. What data do we have access to? What’s the most important data that we have, that we should be learning from, and what models can I build on top of that data, to make the most accurate predictions? And this is a really important phase because again, it tests and validates our ideas to say, “Is this even within the realm of possibilities?” And so, that is often where most of the time, in the strictly machine learning realm, is spent, it’s saying, “What are the most important features to train the models? Which models should I be testing on, and how do I make sure that the models are accurate, when I do run them?” And then it’s about deploying those models and actually taking advantage of those.

SA: And so, questions there, become, “How often do I need to run this model? Is it enough for me to make one prediction, and then that’s it?” Most often, not. Usually, if we’re making predictions because we wanna take some action on it, and so where we often see companies sort of fall down in this deployment is, they’ve set up an infrastructure that allows me to… I’m gonna generalize, click run, and I get these predictions for my customers, but then I change my product and the data schema changes slightly, or I introduce a new feature, data that it’s never seen before. How easy is it, to take these new data regimes, retrain the model and re-update those predictions? Sometimes, that’s easy, it really depends on how we’ve architected the model, sometimes it’s basically like starting over. And that’s really where you want to avoid. You wanna be able to have this built out in a scalable way, so that new data introduced automatically gets trained, and it’s not a manual process to go and have to manually re-run and re-train these models, but something that can be automated and continued out over time.

SA: And so, I would often say, those are kind of some of the pitfalls that we see, is that, again, we don’t think about how difficult it will be to transition smaller models into larger models, and we don’t think about how to really operationalize this for changes over time. New users, new features, new websites, those kinds of things. How do those impact the model and how do we make sure that those can scale with our models over time? So I’m not sure I answered your question specifically, but those are just a few that came to mind that we’ve sort of talked through, with different customers.

LD: Yeah, no, I think that makes… You did. And I think that makes a lot of sense, and I think Sheel will agree. In a slightly different vein, because obviously we’re not as… While a lot of our customers are using data pipelines for AI and ML stuff and those work flows, we see a lot of that in the same way where they… It takes a bit of a mindset shift, in order to say, “Let me architect this for what we have now, with an eye to knowing that things might be… Are going to adjust in the future.” If you don’t know what the future is necessarily going to look like, it’s hard to say, “I’m gonna create this, where it works now, without needing to change everything when we add in 10 new team members, or when we add in new data sets or when we add in something different that maybe throws the whole thing off a little bit.”

SA: There’s kind of a stat floating around, that says about 80% of machine learning projects that get started within organizations, never actually make it to production and are deployed. And so, what we find is that there’s a lot of internal testing of machine learning, but not a lot of deployment and productionization of that machine learning. And these are exactly the kinds of things that we see, when we run into. And so, some ways that we see that actually manifest itself within an organization, we have data, I download a little bit of it and I put it into my local storage and I’m running my model off there, but I realize when I actually scale it out to the actual production data set, it breaks under the strain, or it takes too long to actually be able to effectively use this model in a real-time production environment, or as you mentioned, data is changing over time, so how do I make sure that my infrastructure is robust enough, that I’m not manually saying, “This is exactly how it should learn,” but I’m more dynamically saying, “Take the data that we have and figure out what we should be learning from, so that it’s always continually learning over time.”

SA: And so, I think you mentioned it, right, it’s that pre-planning that goes into it. Two things that we often see customers underestimate, is the timing and resources needed to get these done internally, it’s often gonna take longer than you might initially have expected and usually cost more throughout that process, than you might have initially expected, if you’re trying to take this load on and simply do it all from scratch internally. And so, it’s really important to sort of go in with clear eyes, to make sure that we have the right team in place and we have the right plan in place, to not just build one model, run it once to see if it works, but really to architect it from the beginning, assuming that it’s going to work and going to need to be continuously running over time. These are really hard things to do ahead of time, and so I think that’s where when you were mentioning getting machine learning engineers with expertise, who have actually productionized those models before, I think the more you say, “Have you done it before? At what scale?” those kinds of things, that’s where the search becomes so hard, because you really want somebody who’s sort of been there and experienced it, and can say ahead of time, “Actually, here are the things that we need to look out for,” and plan those into the process.

LD: Well, Shawn, I have one final question for you. Let me stump you. No, it’s not gonna stump you at all. And we ask this of everybody, and it’s probably my favorite question out there. But what are you most looking forward to, in the next, we’ll call it two to five years, in the ML space or really just the data space in general? Things seem to change so fast that two to five years seems kind of crazy, but what is it that you are really excited about?

SA: Yeah. Absolutely. I think the thing I’m most excited about, I think Sheel brought this up right at the beginning of the call, and it’s really, giving access to different parts of the organization, giving them access to machine learning where it’s never existed before. And I really do believe, in the next two to five years, this will become a reality. So we spent a lot of the conversation talking about how much expertise you need internally, to do this yourselves. Even when you have this expertise, it’s often extremely hard to do it right, the first time. And so, you might need to take more time, put more resources into actually getting it done, and even then it’s really hard to get it into production. But what we’re seeing in the industry is, you have tools and services that make this much easier to automate a lot of it. And to sort of speak how we think about this, the way that we typically do this is, instead of trying to solve everything under the sun and do every kind of machine learning imaginable, we really like to focus very specifically on a subset of machine learning problems specific to customers and consumer data.

SA: And what that allows us to do is, automate more of the process of taking raw data and creating predictions, we can automate more of that than we could if we were trying to solve sort of every machine learning problem that we could imagine. And the benefit of this is, giving access to those teams who know they have that kind of data and know that they want those kind of predictions, we are now seeing marketing teams themselves create these predictions, so they know which channel, which time, which marketing offers to give to which users, and they can create these predictions themselves. We work with a lot of product teams, to do… The common use case is going to be personalized recommendations, but we’re also seeing things like, let’s say I’m a subscription company and I have different offers I can give a user, to entice them to subscribe. Well, I don’t wanna be giving everybody the offer, ’cause that’s just money out of the table, if they were gonna do it anyway, but if I could myself create a prediction that says, I have a few offers at which point end in the funnel and which offer should I give that user, to make sure it’s best for them so that they feel most comfortable subscribing, that can really move the needle for a company.

SA: And so, where before, this might take one of those really large internal projects where we have to hire out new team members, get that expertise internally, now, this is something that I as a product team can automate myself. And so I think that’s the thing that we’re most excited about, looking into the future, two to five years is, “How can we give easier access to machine learning to these different teams, who for better or worse, may often fall down on that priority ladder and give them that ability to really optimize their responsibilities and optimize the data they have?” And I think this is all to the benefit of the customers. They’ll get better offers, they’ll get better personalized experiences, and that will also lead to better results for the company. And so those are the things that we’re most looking forward to.

LD: That’s awesome. Well, we’re looking forward to that as well. I think that’s a lot of… That’s one of the things that we love doing as well. So yeah, we’re looking forward to that also. Well, Shawn, thank you so much for joining us today. Sheel… I will speak for both Sheel and I. He can feel free to disagree with me, if he so chooses, but… He’s over here laughing. But we really enjoyed the conversation, and I’m sure we will have you back on again, to talk more about this. So, thank you so much.

SA: Absolutely. And again, I really appreciate you guys inviting me on. We always enjoy talking with everybody from your team. Leslie, Sheel, it’s been great talking with you. And yeah, happy to come back any time.

LD: Awesome. Thank you so much to Shawn and Sheel, for that matter, for this great episode of Operationalizing ML, AI and even just data in general, is a huge topic and one that, again, I’ll say it again, is near and dear to our hearts. It’s one of those things, I mentioned, our DataAware poll survey is… Folks now have their data, they have scaled up their infrastructure. That is not the problem that they have. The problem now is how they scale their teams in order to actually operationalize the data that they have, how they get it out to people, how they actually use that data in production. So it’s always interesting to hear from someone like Shawn at Vidora, about how they’re helping their customers do that more and more. So if you’d like to learn more, you can always jump on over, visit Vidora, check them out, or you can always reach out to us, at [email protected], or at Twitter @Ascend_IO, and we’d be happy to chat. Welcome to a new era of data engineering.

Ep 11 – Operationalizing Machine Learning Pipelines with Vidora’s Shawn Azman

About this Episode

Transcript