In the latest episode of the DataAware podcast, Sean and I chatted with Miguel Alvarado, CTO at Lumiata, a company on the cutting edge of enabling ML & AI for healthcare organizations, about how data science is moving the needle in the healthcare industry, as well as what organizations should be looking for when it comes to AI & ML centric data teams.
Learn how AI & ML are impacting healthcare—and impacting data team structure—in this episode of DataAware, a podcast about all things data engineering.
Leslie Denson: Working with data in the healthcare industry brings its own set of challenges. However, it also brings its own unique set of rewards. Today, Sean and I had the chance to chat with Miguel Alvarado, CTO of Lumiata, to talk about the importance of AI and ML in healthcare, and why finding the right team can be so challenging. In this episode of DataAware, a podcast about all things data engineering.
LD: Hey everybody, welcome back to another episode of the DataAware podcast. Today, I am back with Sean Knapp. Hey Sean, how’s it going?
Sean Knapp: Hey, everybody.
LD: What you can’t see but I can ’cause we’re on video, is that Sean just chugged a five-hour energy, which means this is gonna be a good podcast. I know enough to know now that this is gonna be a good podcast.
LD: So today, Sean and I are joined with somebody that Ascend adores. It is Miguel Alvarado, CTO of Lumiata, who is a partner of ours. Hey Miguel, how’s it going?
Miguel Alvarado: Hi. It’s going well. Thank you for having me.
LD: Of course, of course. We’re super happy to have you on the line. I have heard such fantastic things about Lumiata and what you guys are doing in the healthcare space with AI and ML, that I know Sean and I are both really excited to chat with you today and start diving in on some of that, ’cause it’s a fascinating, fascinating space. So thanks for joining, thanks for digging in on this.
LD: Why don’t you start us off by giving us just a little bit of background, or the listeners really a little bit of background on yourself and a little bit of background on Lumiata, and then we’ll really just start diving into it?
MA: Sure. I’ll start with my background. So I have a little bit over 25 years of experience with software. I started my career back in 1996 at Microsoft back in the day. So I had a few different individual contributor gigs before I got into leadership. Then I started managing a couple of people, then three people.
MA: And then I became part of this little startup that was four people, then we got acquired, it was called MetaStories, we got acquired by a company called Brightcove, which was an online media platform, which was actually the biggest competitor to a company that Sean used to be a co-founder of, called Ooyala.
MA: Anyway, since Brightcove, I spent a lot of time building analytics systems, building big data and analytics systems, and that continued over to Intel, Verizon and VEVO. And throughout that journey, machine learning popped up at some point about seven years ago, which is kind of like the natural progression, you start with big data and then you start thinking about the cool things you can do with the big data, and machine learning is a natural progression.
MA: And at some point, a couple of years ago, I thought that I wanted to do… That I have been doing media for a long time, and that was really cool and interesting, and really fun and challenging in a lot of ways, because of scale. But I felt like I needed to go somewhere where my work or the products that I built had more social impact than… Something a little bit juicer than just entertainment. So I felt that there were two areas that could be of interest, one was education, and the other one was healthcare.
MA: Those to me, they’re kind of like the two pillars of any society, education and healthcare. And in that journey, I found Lumiata and Lumiata had the perfect combination of a strong, ambitious mission and vision. They had a good team, and it had a lot of data, which is rare. Sometimes when you are looking at the startup environment when it comes to AI, you hear really interesting problems and then you ask people, “Well, how are you doing it, what’s your data? Where are you getting your data from? How much data do you have?” They’re like, “Well, we don’t have much data.” It’s like, “Okay, well that’s gonna be a problem if you wanna do machine learning.”
MA: So now pivoting the conversation a little bit into Lumiata, but Lumiata is… Well, first of all, I’m CTO at Lumiata, so I oversee software engineering, data science and product management. And we are a company who has built an AI platform that allows us to deliver machine learning models to solve very specific problems in healthcare.
MA: And the use cases that we go after range from predicting cost and risk for individuals and groups of individuals, for the sake of health insurance underwriting, all the way over to the clinical side, which is more around predicting disease onset and medical events. And we have been following a land-and-expand strategy, where we’ve been landing in the area of cost and risk prediction. We felt that that was a very pragmatic, very obtainable area to focus on.
MA: You’re predicting cost, so it’s not, it’s very different from financial companies predicting financial outcomes. So there’s prior art there, so we’re landing there, that’s where we’re gaining our momentum, but we’re expanding to the clinical side. And an example of that is a pharmacy company called FGC in Canada.
MA: And what they’ve done is they’ve deployed Lumiata models in 12 pharmacies, so models in the physical environment. And we have data for a lot of their customers, so when a customer comes in to buy their meds, they fill out a few questions and then we do a prediction for Diabetes Type 2 and another condition, I forget which one at the moment.
MA: But if that prediction comes true or it turns out to be true, then the pharmacist asks the customer if they wanna spend some time with a physician assistant. So they do like a mini intervention there and then the outcome of that could be that there’s a doctor appointment that gets scheduled. So this is all in spirit of bettering health outcomes and getting ahead of things.
MA: And I find it fascinating because it’s AI in a physical environment. And FGC is doing an A/B test between 12 pharmacies that have this and 12 pharmacies that don’t have it, and they’re trying to gauge what the satisfaction metrics say from their customers. So that’s just… Go ahead.
LD: I said that’s really cool.
MA: Yeah, no, it’s awesome. So that just shows you kind of an idea of the breadth of things we’re doing. Part of what we’re doing too is we’ve been building tools for ourselves to industrialize the machine learning lifecycle from raw data over to fully productionalized models, and we’ve been exposing some of those tools to our customers, so that they could accelerate the way that they go about building machine learning models and whatnot.
MA: That area hasn’t gained as much momentum, but we’re finding a sweet spot for that in the actuarial practices. So actuaries turn out to be the closest thing to a data scientists that some of these healthcare companies have and actuaries happen to know math and they happen to know statistics, but they don’t program, they don’t code, but they have the ability to understand machine learning way better than a lot of other people.
MA: So now we’re pivoting with our tools to build machine learning models into building tools for actuaries to build machine learning models. And I’ll stop there. There’s a lot more to say about Lumiata but I think that’s the gist of it.
SK: Awesome. And to add on for those… Since we’re on podcast and not on video, but for those who are recording, I could take copious notes, old school style written down, so I’ve been frantically writing all sorts of questions as I hear Miguel walk us through a bunch of this. So many different threads to pull on and areas to focus on. And so I really wanna dive more into the healthcare-specific stuff, but I have a feeling we’ll spend so much time there.
SK: Wanted to start first with a little bit more sort of general. As we think about your journey all the way back from Microsoft to especially a lot of the ecosystem that in, when you’ve been in the data world, media to healthcare, and you even talked about a lot of the predictive things that you all are doing at Lumiata and how there are similarities between that and the finance domain as well.
SK: Walk us through this, ’cause we have listeners on the podcast from all sorts of different domains and industries. What are the similarities, what are the things that are transferable between these different industries? And then also what are the things that will then bridge into more healthcare specific stuff, what are the things that are really distinctly different about healthcare from others?
MA: Yeah, that’s a good question. I think part of the answer has to do with why I jumped from media to healthcare. Basically, I felt that there are some things that are truths about data and machine learning across any vertical, and I felt like, “Hey, so we’ve solved some problems in media, healthcare is a little bit behind.” There’s an opportunity to bring some of the stuff that we’ve done in media, in healthcare and help move the needle forward.
MA: So some of the similarities are… So when it comes to data, everybody has the same problems and we’ve been having the same problems forever, and quality, as you know, is really hard to still get right. And when it comes to machine learning, garbage in, garbage out. And the worst thing that you can have is a data scientist troubleshooting a model for weeks and then they realize that the model is fine, the problem was the source data to begin with, all the way down here.
MA: So those are problems that we’ve had in software and big data for quite some time. So I think that’s a similarity. I think dealing with data problems, while the specific manifestation of the problems is different depending on what domain you’re in, at the end it’s the same core problems. How do you ensure data quality? How do you generally validate data? How do you keep the lineage between source data and derivatives of the data? How do you treat data almost like source coding? How do you version it? I could keep going on and on, but you get the idea. It’s the same data problems.
MA: When it comes to machine learning, I think that you can see machine learning as a generalist, where if you understand regression, classification, time series problems and the solutions, and then you understand things like traditional machine learning algorithms, what each algorithm is capable of doing, supervised, unsupervised, semi-supervised, and going all way to deep learning different architectures and what architecture are good for, if you can understand those things in an abstract way, you can apply those things to any vertical.
MA: So if you know how to predict time series data, well, you could go to Uber and predict what the ride behavior is gonna look like at any point in time in the future. Or you could go to finance and predict the stock market. Or you can come to Lumiata and predict what somebody’s health timeline is gonna look like in a year from now. So again, those things are very transferable.
SK: When we think about the things that are really unique and distinct. What are the things that you’ve had to build on top of this foundation for very specifically Lumiata and the healthcare industry?
MA: Yeah, it’s a good question too. I think there’s a couple of different things. One I think that’s very distinct is that, and maybe we share this with the financial industry, which is security and privacy are at the heart of it all. You have to be super careful. I don’t know if you know this or not, but if you have a breach on healthcare data, the fees that you have to pay as a penalty for that breach, they’re ridiculous. It’s like from $50,000 per person.
MA: So if you have a data set of a million people and you have a breach, you’ll probably go out of the business as a small startup. So as a small start-up, you have to go above and beyond to have a very secure environment that has all the guardrails for privacy as well, so that’s one thing in how it’s different.
MA: Here’s another thing that’s really interesting. When you’re dealing with healthcare data, it’s incredibly fragmented. So for instance, if we get data from a provider, when it comes to an individual, the provider is only gonna have information for that individual for the times that they went to see that provider. So if you’re a person that’s seen a lot of different doctors, well, we’re only gonna see the data from that hospital system or that doctor, but we can’t see any of the data from the other providers. So that’s a problem, we only have a limited view for you.
MA: If the data’s coming from a payer, so a health insurance company, we’re only gonna get the data for the time that you were covered by that insurance, if you went from one job to another, the insurance company is only gonna have data for you when you were employed by the company that had insurance policies with that payer.
MA: Even if you went from one employer to another, the insurance company may have you broken down as two different people and they haven’t connected you as a single individual, so that makes it incredibly hard to create models that will predict with decent level of accuracy at the individual level, if you’re trying to predict things like disease onset, because to predict conditions you need to have a very accurate health history for individuals, and it’s always going to be fragmented.
MA: There’s ways around that, by the way, but it’s hard. Whereas, yeah, if you’re a media company, you’re building a recommender for content, likely is that that company has all the data for that individual for when they’ve been consuming video for that property. So it’s a little different.
SK: And so one of the things that you touched on too, which we see certainly from the Ascend perspective, a shift across industries around data and privacy and security as well. Clearly, we’re entering these heightened states of awareness around privacy and security, anyway. As we’ve talked over the years, I know that you all, because you work with all sorts of different customers and partners and many providers and so on, you’re dealing with people who have data across different cloud systems, you have people who literally have records sitting on a server in somebody’s closet inside of the hospital.
SK: How do you deal with this? If one of our listeners is like thinking about starting their own company in the healthcare space, what are the things that they should know around, these are the things that you are going to meet from a requirements perspective, when you want to go tap into healthcare data?
MA: Yeah, it’s a good question. I mean, I think that… So here’s something that’s interesting about Lumiata, I think that initially some time ago, Lumiata wanted to make a difference when it comes to machine learning, and in that journey, they discovered like, “Wow, before we even get to machine learning, we gotta get the data part right.” [chuckle] And that requires way more effort than anybody imagines, right? And in healthcare, it’s really messy.
MA: There are very few companies that have a very pristine environment. I gotta say that probably insurance companies have the cleanest data, because their business depends, the claims datasets are mission critical, so they better be right. So their whole business depends on it. But when it comes to the provider side, it gets messier and messier.
MA: If you go to the ACO side, they have data from a lot of different places, so and every place has a different data format. So I guess if you’re somebody getting into this space, you just gotta know that there’s no common ground and you’re just gonna have to deal with a lot of different data schemas from different people and just have to figure out a way to normalize that to something.
MA: When I joined Lumiata, I thought, “Well, hey, isn’t everybody just using FHIR HL7?” Which is the standard. Well, yeah, everybody would like to be using it, but a lot of people are not. And even when people are using it, there’s a disconnect between organizations. Like some…
MA: You go to a payer, maybe some IT Department is using FHIR for inter-op on some systems, but then we’re dealing with the analytics team and they’re disconnected from that IT Department, so when they share data to us, it’s just table dumps from data warehouses. So there’s fragmentation even within these companies themselves.
SK: And so this becomes a little bit interesting too, because as we’ve talked about it in the past, all of this stuff is basically just getting in the way of you doing the really cool, fancy stuff. What is it? I forget the woman that I think had this quote, which was, “The greatest minds of our generation are fixing commas inside of files.” So try and get things to parse, and like splitting fields and so on.
MA: Spot on, for sure.
SK: Which I’m sure is gotta be disheartening for many data scientists early on in their career who’s like, “I thought I was going to save the world, and I’m reading through a spreadsheet to top stuff up,” but like a lot of things, you kinda gotta grind your way through some of that monotony to get there, and technology is making this easier over time.
SK: I think the one interesting question that I always like to ask folks from different industries are, what does your team composition look like, and what are those ideal ratios? I think we had Jesse Anderson on, a couple of podcasts ago, and… There is a lot of it and it’s a moving target, right? Depending on the products and the technologies available in the space like how many analytics people you want versus how many data scientists, versus machine learning engineers versus data engineers.
SK: In your mind, what are these ratios? And what are the, even more importantly, the skill sets you look for across these, so that you don’t have your data engineer trying to do, modeling the data and you don’t have your data scientists trying to manage your infrastructure? What does that look like?
MA: This is a really great question because we’re growing, so we’ve been recruiting a lot recently so I’ve been talking to a lot of candidates, and here is a huge problem that I see, even before I get to ratios. The variance in the industry of what a data engineer is or what a machine learning engineer is, is incredibly high, the skill sets are all over the place. And it’s almost like talking to a data engineer that has been doing a bunch of SQL, and now they do it in Spark, so now they’re a data engineer.
MA: But I’m sorry if you just know SQL, to me, that’s not a data engineer. Or I’ve been interviewing so many machine learning engineers that say, “I’m a machine learning engineer.” And they’ve been coding because they’ve been training models in a Jupyter Notebook, but that’s the extent of the coding. It’s like, “Well, yeah, I code Jupyter. I train a model in Jupyter and then I create a Docker container and I push that to production.”
MA: Well, that’s still not software development. I mean, the amount of code that you’ve had to write is very limited. So I think the key thing is to define these roles. So to me, a data engineer is somebody that understands distributed systems, maybe hasn’t had to build a distributed system, but they understand the inner workings of something like Spark very well. And they know one widespread language, at least, like it could be Java, it could be Scala.
MA: I mean, if you come to that world, Scala, it wouldn’t be out of the ordinary. And you’re very strong with… It could be Python as well, but you’re very, very strong in that language, and you understand what software design patterns are. I can filter out the majority of people from data engineering, machine learning engineering, by asking them, “What is design patterns, and what are the most popular design patterns they use?” Some people are like, “Well, I write code that can scale.” “Well, that’s, you haven’t answered my question, that is not a design pattern.” [chuckle]
MA: It’s just like a requirement, but that’s a very ambiguous requirement. So I think that a data engineer, a machine learning engineer, first of all, they need to have a strong software development foundation, period. They need to understand design patterns, they need to understand how to write maintainable, well-structured code, you need to know how to write tests for that code. You’d be surprised so many people don’t write their own tests.
MA: Anyway, assuming that that foundation is there, then you have the specialization of data and machine learning. Well, if you’re a data engineer, you’re a very strong engineer, but now you have a passion to solve data problems, you have an interest in solving data-specific problems, and you wanna do that at that scale.
MA: On the machine learning side, again, you have the software foundation, but you also, you’re a little bit of a data scientist too. You know the math, you know the statistics, you read a lot of papers, so you know how to train models, you understand the algorithms, so you know what algorithms to apply for what problems. And you also understand the underlying infrastructure, like you understand Kubernetes, you understand Kubeflow, you understand what distributed training is, all those things.
MA: So I think like, with this definition, makes a machine learning engineer a bit of a unicorn because I expect people to be a bit of a systems engineer, a bit of a distributed systems engineer, but also know all these things that a data scientist knows. Well, so then what’s a data scientist?
MA: A data scientist is somebody that understands the math and statistics of different algorithms to solve data problems. It doesn’t have to be only machine learning, but I think the statistical background gives them the toolkit to solve many different kinds of problems. In our case, we do want a data scientist to know machine learning and be a bit of a machine learning practitioner.
MA: The difference, machine learning engineer, data scientist, is that the data scientist is not expected to have the software foundation that we talked about, distributed systems and whatnot, they’re gonna be more on the math side of things, more of the, “This is how the algorithms work and how I can creatively solve problems with maths, statistics and machine learning algorithms.”
MA: Now that that definition is there, the ratio, I would say one data scientist for every four to five engineers. And from that, probably you wanna have… It depends on what type of shop you are, but I would say three data engineers and two machine learning engineers. And this is without counting the engineers that you would wanna have building services and building front end stuff, this is just kind of strictly on the data and machine learning side.
SK: Interesting. So what would you say to… We encounter a ton of companies, but the ratios are flipped. They’re five data scientists for one data… A star engineer, as in like, data engineer, ML engineer, infrastructure engineer. Besides feeling bad for that poor engineer, who has got so many folks they gotta support, what would you do? ‘Cause at some point, you end up finding all those data scientists, they’re doing data engineering work, right? They…
MA: Well, that’s…
SK: Worried about it.
MA: So that’s a problem that they… Exactly. If you have the ratio flipped, you end up having data scientists doing a bunch of data engineering, and sometimes they’re willing to do it and they’re happy to do it, but they don’t have the software foundation. You’re gonna end up with a lot of very well-structured spaghetti code. That’s a spaghetti… Very well-structured spaghetti code.
SK: That’s a good one.
LD: Second time I’ve heard “spaghetti code” today, by the way. [chuckle]
SK: Oh was it in this…
LD: Yeah, pretty sure it was.
SK: We do an internal tech talk weekly, so.
SK: I was drawing a lot of the stuff on the whiteboard. Well, okay, so this is really, I’d love to keep pulling on this thread, but it’s, we got a bunch of other things I’d love to talk through. So one of the things you mentioned earlier, and I wanna pop back up to that, which is really interesting, you talked about coming from media, where you end up with this insane volume of data usually.
SK: Like media, online video consumption, we both played in the same industries, like, “Oh, we can track every heartbeat from the player, so we know what seconds of video you watched when you came in, when you abandoned.” All of this stuff, you get so much data. And generally speaking, assuming good data in, good outcome, but you need certain volumes of data, and you touched on this earlier.
SK: So I’m gonna ask you like a… It’s probably a really hard question to answer, but I’m gonna ask it anyway. I think that a lot of people, as they get started, have to answer this question, which is, how much data do you need? I know the answer you’re gonna start with, “It depends.”
MA: It depends.
SK: And so then I’m gonna go back sooner or later… But like, how much do you need? Is it…
MA: Yeah, well…
SK: How do you know if you have statistically significant data or if you need to go get more, etcetera?
MA: Well that’s a super hard question to answer, like exactly how much you need, but I mean, roughly, I’ll say that I have… From what I’v seen in media and in healthcare, you need data, at least in the hundreds of thousands of samples for machine learning to really move the needle. You can develop things with dozens of records, hundreds of records, a few thousand, and that’ll get you going on the development side, but if you really wanna get to maximum performance of the model, I think you do need large volumes starting in.
MA: I think for cost and in risk prediction, I want to say that we found to be the minimum to be, if I’m not mistaken, around 600,000 people records, but that was as the bare minimum, you do wanna go up from there. And then it depends what kind of machine learning algorithms you’re using. Traditional machine learning kind of plateaus.
MA: Like if you’re using gradient-boosted decision trees, which is very popular for tabular data, for structured data, at some point you plateau and it doesn’t matter how much more data you add, it’s not gonna give you that much more performance. But if you’re doing deep learning, then you start throwing a lot of data at the problem and then you need many million person records to move the needle.
SK: Interesting, and so from a general philosophy perspective, is it just get as much data as you can? Obviously within boundaries of, especially in the healthcare space, knowing what you can collect, and what you could or should and shouldn’t collect. But really you’re just trying to absorb as much data as possible?
MA: Yeah, and it’s… So we at Lumiata internally, we have around 120 million person records, combining claims data and your charts, so that’s probably a third of the US population, so it’s a pretty big dataset. But we have the luxury of having the healthcare connections that we have because of our investors and whatnot, not everybody has that luxury, so you have to get practical and creative.
MA: I’ve seen people have a level of success, some level of success with synthetic data. That’s becoming a big thing in machine learning now, like creating synthetic data for the problems you’re trying to solve. Now the question is, well, how do you create synthetic data? You’re not just gonna make up data and have things work. And so there’s a few things that I’ve seen, we haven’t tried any of this, but there’s a few things that I’ve read people do in papers and in articles.
MA: One is, given a real data set from a population, use Gantts to generate more data that looks like that. So that’s one technique that I’ve seen people use. Another technique, which is a lot more elaborate, there’s another start-up, another Khosla Ventures startup that has created expert systems that simulates what real world interactions look like, and these simulations generate health records that are varied, a simple reality.
MA: So the expert system simulate interaction between patient and doctor, and there’s a lot of medical literature baked into these systems, and then they run the simulations and then you end up with data that looks more or less like real data. And I’ve heard that that works really well as well.
LD: Super interesting.
MA: That’s super interesting. [chuckle]
LD: I could not have guessed that, personally, that that was a thing. But it makes sense as you say it, it’s just not something that I’ve ever…
MA: Because sometimes you don’t have the data, and there’s a lot of start-ups, that just, they start cold. They start from nothing. So how do they get the data? Especially in healthcare is incredibly expensive to buy data. You can buy data, but it costs several hundreds of thousands of dollars.
SK: So, as you were saying that, I thought of another question on the thread before, so I’m gonna ping-pong all over the place. This is the benefit of having a five-hour shot before the call.
SK: And then we’re gonna circle back to this. But I think it was really… As we talked about this ratio, this is sort of 1 data science to 4 to 5 eng, and of which are 3 to 2 on data eng to ML eng. It feels like we see so many people continue to have the upside down ratios on this. And of course, yes, right, this is the Ascend Podcast, of course, our plug for, “Hey, we can really help you get through the engineering leverage.” Yes, okay, cool. So we got that out of the way.
SK: Let’s assume this really painful, dark world, where just like Ascend just doesn’t exist and everybody has to go figure it out for themselves. Like it’s just gonna rain all the time and be cloudy, and people are just gonna be sad. That world. How do you go and find more data engineers? All these ratios, how do you do it? In this horrible world where Ascend doesn’t exist and you gotta go figure it out on your own? Like most folks are upside down on the their ratios.
LD: ‘Cause they’re not there. It’s so hard to find the data engineer.
SK: And that’s the thing. How do you go get them?
MA: Yeah, well, like I was saying earlier, there’s a lot of data engineers that have the data part, but they don’t have the engineering part right. That makes it very hard. That’s why in some ways, we’ve taken the philosophy of like, let’s just hire very strong software engineers. Forget about the data in ML monikers. That’s the… Maybe those monikers will go away one day, to be honest, it’s an interesting specialization that I think is doing the industry more harm than good.
MA: But let’s just hire really, really strong software engineers that understand distributed systems, that have very good principles, that wanna solve data and machine learning problems. And that, you can find way better engineering manpower that way than if you specialize. If your job description just says “data engineer”, you’re gonna get a lot of noise unfortunately on the candidates.
MA: We try it all, we put a bunch of job descriptions for the same role because we want to… You’re gonna have to these days, because these things mean different things in different companies. Even in big companies, these titles mean different things, depending on the team you work on. Like we talked with a lady from Apple recently, and she shared her title was Machine Learning Engineer. She was more of a research engineer.
MA: But then we talked with another gentleman from Apple and he was more like what I think of a machine learning engineer being, but even in the same company, there’s variance of what these titles mean.
LD: It’s interesting that you say that too, because as Sean mentioned, we had Jessie Anderson on the podcast a couple episodes ago, who has spent an inordinate amount of time researching what makes up a good data team, and one of the things that he said was, to your point, it can’t just be a software engineer, it has to be a software engineer that loves data.
MA: Yeah, yeah.
LD: But the underlying point there is it has to be a software engineer.
MA: That’s right, that’s right, that’s right.
LD: Who loves data and wants to work with data products. And as crazy as it may seem to us, because we’re so steeped in the data industry, there are those software engineers out there who maybe don’t wanna work on data products.
MA: Oh yeah, there’s many. I have mutual friends who are like, “Dude, that’s just crazy. I don’t wanna… “
LD: They don’t wanna deal with it, but it has to start with that foundation of a software engineer, but who loves data and wants to work with data, and that’s where you get to it.
MA: Exactly. No, that’s exactly right. And the thing about data engineers, there’s a… The reason why there’s a lot of variance is because some people come from the software background, but some people come from a data analyst background. They were data analysts, they use BI products at some point, they learned SQL. But there’s a difference between that and writing a Spark job that scales well. So I think that’s the nature of things.
MA: Now, I think there are tools that allow these people that come more from the analyst background to do more data engineering things, and that’s fine. But the other thing, like for us is like we’re a small shop and I just like generalists. A couple of our best engineers that we have today are the ones that could be working on a security problem right now, but then tomorrow they’re training our model, but the day after they’re creating a REST API, but then the day after they’re working on the data pipeline.
MA: Those are I think the generalists. I call it now, full stack machine learning engineers that can do anything, right? And they’re hard to find, but if you can find those, those are super awesome to have, ’cause you can throw them at all sorts of problems.
SK: So that reminds of me something you touched on a little bit earlier too, and finding these unicorns. You mentioned asking about design patterns, and hopefully this doesn’t totally hose your future interviews, but what’s a great answer to that question? If you were coaching other CTOs of like, “Hey, this is my one question to really determine if they’re are gonna be a great data engineer or not,” what’s an answer you look for?
MA: That’s a very good question, and I don’t expect a specific answer to that question, I expect a conversation. For instance, I just did a 30-minute phone screen with a gentleman earlier today, and I asked him about design patterns, and I said, “Well, let’s talk about design patterns. What are the design patterns… What are design patterns that you have found helpful in building software? Like let’s just talk about the design patterns.”
MA: And he goes, “Oh, design patterns,” I’m like, “Oh.” He sounds kind of overwhelmed, like, “Where is this gonna go?” And he’s like, “Well, no, I think that I’ve had to go through two iterations of learning design time patterns.” He’s like, “I was doing object-oriented design and programming for a long time, and so I learned or all the patterns there, and it was all about classes and how you structure these classes and so forth. Then I moved to a gateway where I was using Scala and it was all functional.”
MA: “Now, so I went from object-oriented to functional programming so I had to learn a whole other sleuth of patterns. They’re all functional patterns.” And we kind of geeked out a little bit talking about what that meant, and we talked about combining object-oriented with functional programming. It was a good discussion, clearly the guy had been around, and clearly the guy had done some thinking around well-designed software, functional and object-oriented, so he passed my test.
MA: I wasn’t looking for a specific answer, I was looking for a conversation. If you can reason around patterns one way or another and have an eloquent conversation about it, I think that always puts you at the unfortunately 5-10 percentile of the data engineering and machine learning engineering candidates out there.
SK: That’s really helpful, and I think it highlights the “there is no perfect answer for,” or a uniform “one size fits all” answer for all problems. And so…
MA: That’s right.
SK: To your comment of really wanting to find these incredibly high leverage generalists that are, frankly, A-players in every part they touch, you want those people with that flexibility to find that right pattern for the problem space that they’re in.
MA: Exactly, exactly. Another example is another guy from this… Earlier today. He doesn’t have a ton of real world machine learning experience, mainly from his Master’s, but he was at Microsoft for 10 years, so clearly he was doing something at Microsoft. And so I asked him the same question, “Let’s talk about patterns,” and he said, “Oh man, patterns. I thought that I could write good software, and then somebody recommended me this book, Design Patterns,” by Erich Gamma, Richard Helm and two other authors I can’t remember, which is a classic.
MA: “I read that, I’m like, Oh my God, where have been? I haven’t been writing software the right way.'” So he had a really good cool story to tell regarding design patterns. So he also passed my test. It’s like he knows the book, he’s read the book. It’s a little old school for some people, but it has a lot of things that still apply to today’s era. So yeah, it’s more of a conversation that I’m looking for, more than answer.
SK: Cool. So I know we have a few more minutes. So there’s two questions that I really love that to ask here as we start to wrap this up. First is, what are some of those really incredible, that you can share, obviously, “aha” moments you’ve seen at Lumiata with your customers, where the rubber meets the road and this really practical application? What are some of those?
MA: Yeah, there’s definitely a few, but there’s one really cool one that I’m excited about. So I mentioned earlier, beyond the models that are deployed to specific use cases, there’s the model builder product that we’ve been working on. And I think we have missed the mark a little bit. It tries to be like a no-code way to build models.
MA: When you look at the UI, it’s like it follows the same steps that a data scientist would follow, so it defeats the purpose of how to make UI to a build a model model, because it’s supposed to be used by somebody that is not a data scientist. We realized, “Hey, we need to do a little bit of work on this to make it hit the spot.” And as we’ve been researching, like I said earlier, it turns out that actuaries have been in the business of predicting things for quite some time.
MA: They’ve been in the business of predicting things for decades. It happens to be that they’re probably the most advanced when it comes to statistics and math, in organizations like health insurance companies. The similarity between an actuary and a data scientist is vast. And then it happens to be that that world is changing. Like in UK, for instance, now, I believe it’s 10 to 20% of the actuarial curriculum includes machine learning, 10% of the certifications test includes machine learning.
MA: The US will eventually catch on and be there too, so it was a hot moment of realizing, “Oh man! Actuaries are the geeks that I’ve been looking for inside healthcare, beyond data scientist, and nobody has built a product for them.” I think that hopefully, me saying this in this podcast doesn’t now handicap us the ability to build something unique and have somebody else go build it. I joke.
MA: I think that it was a bit of an “aha” moment, just realizing like, “Wow, here’s an opportunity to build something very cool for a customer or a persona that has been neglected for quite some time.”
SK: Awesome, that’s really cool. I have one more question, and Leslie, I don’t know if you have another one too. But I wrote this down, ’cause I definitely wanted to make sure we hit this. Which was, so we’re on this journey as a both a technology industry, but also the healthcare industry more specifically is on this journey, of really leveraging data and applying AI and ML to drive far greater good and impact for society.
SK: How far are we on this journey? Are we there yet? Are we at the end? Are we at the start? How much of the theoretical potential benefits are we seeing? And what’s this gonna go look like in 10 years?
MA: Yeah, that’s a very good question. Let’s talk also about the negative repercussions that we’re already seeing. [chuckle] I think data and machine learning are still in its infancy. I think that we’re just getting the ball rolling. The problem with machine learning, is that it doesn’t… It’s not like you’re really learning that much with any given model. You’re optimizing for one objective function, so you’re learning enough to solve something very specific.
MA: But if you’re trying to mimic intelligence as a whole, you can see how limiting that is. However, things like reinforcement learning start to look like real intelligence, because you have the simulated worlds where the agents have a reward, penalty, can assist kinda way to learn. And that to me, looks more like a little baby learning what’s around in the world, right?
MA: That’s starting to look more like intelligence, but still the very beginning, where you have reinforcement learning systems beating somebody at games, but it’s a very specific problem. We yet need to figure out how we are gonna generalize learning. How do you create systems that can just learn many different things, and in many different domains? They’re not specific to one thing.
MA: I think that what we’re gonna see 10 years from now, is likely that. Systems that can learn a lot of different things, so you combine that with robotics and then you can have more something that simulates more of a human expression of learning, if that makes sense. That said, we gotta be careful with what we build, because if you haven’t seen The Social Dilemma, or if you have on Netflix, you can see the huge negative effect that AI already has on our society.
MA: So I think people build systems in social media that, they didn’t do it maliciously, but nobody really thought about the potential ramifications that things would have. Now that we know that things can have negative ramifications, the ethical part of AI is gonna be very important, moving forward.
SK: Yeah, that sort of investing and taking the time, not just to figure out if we can do something, but whether or not we should.
MA: That’s right. I hear that Google is taking that philosophy very seriously, it’s not just, “Can we?” but, “Should we?” And that’s something we all collectively have to be aware of, and kind of operate with those boundaries.
LD: Well, I think that was a better topic to end it on than I would have, so I’m happy.
LD: Miguel, thank you so much. This was, again, instead of leaning into it, I find this space incredibly, incredibly fascinating. Because to your point, the options and the possibilities are somewhat endless. Again, just because you can, doesn’t mean you should, but it’s some of the stuff that can be done, it is amazing.
LD: So this is super interesting, and I am certain we will have you on again, talking about this in more detail, and diving in on some of those threads that maybe we didn’t have a chance to pull today. We really appreciate you coming onboard. Thank you so much.
MA: Thank you for having me. It was fun. It’s always fun to talk about these things.
LD: With his background in some pretty diverse industries, Miguel’s perspective is always interesting, as is his insight, especially when it comes to team dynamics and skills. Now, if you’re interested in learning more about Lumiata, you can find them at lumiata.com. And as always, if there are any questions for us, you can find us at Ascend.io. Welcome to a new era of data engineering.