Transcript
Leslie Denson: Since breaking onto the scene as one of the first direct-to-consumer companies out there, Harry’s has continued to grow in both company size and data size. Today, our own Sheel Choksi sits down with Harry’s data analyst, William Knighting, to learn how Harry’s shaves time off their analysis pipelines with intelligent data orchestration in this episode of DataAware, a podcast about all things data engineering.
Sheel Choksi: Hi, I’m Sheel Choksi, a solution architect here at Ascend. Today, we have William Knighting from the Harry’s data analytics team chatting about Harry’s data platform. Hey, William.
William Knighting: Hey, Sheel.
SC: How’s it going?
WK: It’s going well. How are you?
SC: Good, good. So, well, tell me a little bit about yourself, William, what you do and how you got to this point.
WK: Yeah, so I’ve been in the analytics space for about eight years now. Most recently, I was at a start-up in Seattle called PicMonkey. It’s in the photo-editing space. It’s a subscription business and really focused on a lot of user engagement, hit-level data, and really trying to move data between our warehouse and get it into how we were gonna visualize it and then make sure it was in the hands of the stakeholders. And you can just imagine there’s a lot of difficulties as you go throughout that process. I’m at Harry’s now. I’ve been at Harry’s for a little more than a year and it’s got really good, a good problem set. We have similar issues to where we have a lot of data, we’re trying to get them into the warehouse. We’re trying to get them, analyze them, get them to the point to where they can be presented to the stakeholder, and so, they can make some good, informed decisions. And there’s definitely some challenges that we face with that, but it’s something that keeps me coming to work every day.
SC: Awesome, awesome, glad to hear it. Now, you did mention a couple different things there in terms of warehouses and datas from different sources. And one of the patterns we’ve been noticing is that as analyst teams, data science teams, we’re seeing a lot more technical expertise come out of these directions. And so it’s sounding like you’re starting to have a tech stack. And so, what does that really look like for your team? What are the different kind of components involved?
WK: So right now, we have a… Everything’s up in the AWS Cloud. So we use Redshift, primarily, with stuff being stored in the S3 buckets. A lot of that then gets pulled over into Looker, is how we visualize a lot of our data, and then it gets presented out. Definitely have stuff like a smattering of Excel sheets and Google Sheets that are being used for different reporting needs. And one of the new things that we’ve added, actually, has been the Ascend platform into how we pipe data in, and how we can expose it into Looker, and how we can easily iterate through different projects and be able to get data quickly from one place to another to be able to support reporting, support different aspects of the business. And yeah, that’s our tech stack right now.
SC: Okay, cool, makes sense. And of course, we’re happy to have you onboard as Ascend customers. And to dive into some details there, what does it really look like? Are you ingesting data? From what sources? Where are you sending it to? How are you powering Looker? What does that all look like?
WK: So like most analytics teams, we have an engineering team that takes… A data engineering team that takes care of a bulk of the ingestion into the warehouse and making sure things get populated into our tables for most of our reporting and get piped into Looker, and everything along those lines. One of the new things we’re… One of the new challenges we’re facing here at Harry’s is we have a host of other types of data that are now coming into the warehouse that we need to get into the warehouse, that we need to analyze. And as we now start…
WK: Harry’s is a CPG company to where we sell online, we also sell on retailers. And getting some of that retail information into our warehouse and be able to give it and to support our retail teams so that they have the data to go do what they need to do, it’s opened up a host of new problems that we need to solve. And one of the ways that we’ve used Ascend is, is we’ve been able to connect various different APIs that these retailers have provided, and we can get this data into S3 buckets. And it doesn’t require anybody on our team to have, necessarily, a high, technical expertise to go do some of the data engineering work to ingest this data and then expose it out. But we’re able to use the Ascend platform to be able to grab this data out of the S3 buckets once we put it in there, and then from there, we have multiple transformations, and then we can expose that out to Looker. So it’s really changed the way that we iterate through projects quite a bit.
SC: Okay, interesting. And change the way we iterate, it’s a little ambiguous for me here. So are we talking about faster iteration, more iterations, easier iterations? What are we talking about?
WK: Yeah, I think everything, and definitely faster iterations, to where before, the way we would typically get something through into our production database is there’d be a PR process, there’s a dev, there’s dev, it would be a rebuild. So there’s a little bit of a time period, a little bit of a time delay between when we get that data and the freshness of that data showing up into the warehouse. To where with Ascend, the analytics team has complete control over the pipeline to where we can quickly iterate, whether we wanna change the way the data looks, whether we wanna change the source of the data. So and then the ease in which we can iterate, too, to where the bar to entry to using a platform like Ascend and be able to bring in our data is much lower using the web UI and some of the other tools and the support as well. It really allowed us to quickly iterate, easily iterate and then effectively iterate as well.
SC: Awesome, I’ll take it, sounds like a trifecta. [chuckle] So with all these different tools in the tech stack, helping you get data into the warehouse and from warehouse into Looker and things like this, each one of these tools often helps to derive value through transforming the data further. So for example, Ascend, helping to transform it, get it laid out into a warehouse, warehouse helping you transform it to bigger batch queries to get it ready for reporting, Looker with LookML allowing you to do further transformation and getting it ready for analysts to explore. How do you think about all these different tools that are in a big data pipeline, if you will, and where you might wanna use a certain tool for a certain thing, and where you might wanna do something in Looker versus a Redshift?
WK: Yeah, no, those are all great tools to have. And one of the things, it’s really… For us here at Harry’s, it’s really been situational, but some situational dependency on which tool for which problem we’re trying to solve. Whether we have some of that data inside of Redshift at that point, and then maybe we have an obstacle like, “Okay, if we need to make some series of transforms to that data, potentially make a fact table out of a series of tables,” that becomes a process that we have to do. And inside of Redshift, which we have a PR process for, and if we wanted to, say, quickly iterate, maybe we go depend more on LookML and Looker to do some of those transformations to get the data that we need to look the way it needs to look for whatever report we’re trying to put out at that point.
WK: I think maybe on the other other side, trying to get data into the warehouse has probably been, I think, a classic problem like any data organization has. We have a few homegrown tools that we use here internally that allows us to get some CSVs into the warehouse. They have to meet a certain format and everything, and there’s other criterias and sometimes, those builds don’t always work out. But we have different forms to be able to get the data and be able to clean the data. It’s just, it’s really become situational, dependent about what we… What tools are at our disposal and how quickly can we use them to get whatever project we need over the finish line, hopefully not to get too much tech debt doing so, and then iterate back and keep going.
SC: Okay, makes sense. It sounds like a very practical approach to the right tool for the right job and add in circumstance to it, which helps make that decision. And what I’m hoping, of course, is that on the Ascend side now in your tool belt has made it quite a bit easier for more data sources and more formats and things on getting to that warehouse step as well.
WK: Yeah, no, it definitely has, it’s opened up a whole new plethora of options that we have now to get data insight, and really, for the most part, it’s really exposing data out of Looker. All the stakeholders want a dashboard, build a Looker dashboard, show it to me in a dashboard and our only pipeline to be able to get data in to expose it into Looker was this our main ETL pipeline and having Ascend and being able to pull data from most of S3 buckets, potentially using stuff like Google Sheets, wiring Google Sheets, and that’s been a great value add as well, has allowed us to have that secondary pipeline to where we can also, we can… The analytics team independently can get data in and expose it in Looker outside of our normal ETL process, and that’s really the thing that Ascend has unblocked for us.
SC: Awesome, well, I think that’s a good a segue then to talk about it, like, well, what does it actually feel like to build an Ascend and to iterate on these components, the visual interface, if you could touch on your experience there, that’d be helpful.
WK: Yeah, the visual interface has been extremely helpful. I think one of the things that our analytics team has really relied on is like, we’re big on trying to have some kind of organization to whatever DAG flow that we have. And one of the beautiful things about Ascend is we have essentially the pipeline visualized in front of us, all the transformations, the nodes are there, and being able to use the groupings and be able to take a few notes, some of the read connectors that are, say, coming from some of our retail sources, and have them easily group together to where somebody coming in who just wants to go look and understand what this data flow is doing here, can go look at, okay, this is our retail data, this is the retail transforms, this is like the output of the data. It’s really easy to just visually digest what’s going on in Ascend with the Ascend UI, and then just as far as using the Ascend UI for setting up web connectors to read from S3, to read from Google Sheet, and the templates and everything, just back to the point of a really low bar to entry to where I don’t need to write Python, I don’t need to have some proprietary knowledge of how this API is set up, to where I can just set up, get my S3 credentials, put them into Ascend, and start pulling data and then start making my transforms in the platform.
SC: Cool, yeah, makes sense. And I’ve heard that one quite frequently, especially the being able to view the output records every step of the way. So one of the things I’m curious about is when you’re actually building this pipeline, and so, you’re one transform in and you’re like, okay, that one is done, on to the next one, are you building these pipelines iteratively and interactively, or you have a grand vision in mind and then one step at a time? How do you think about it? And what does that experience feel like?
WK: So one nice thing about the Ascend platform is being able to, once I go into my SQL transform, I can go down and look at my partitions, I can look at just facts about the data set, just high level facts, where the data is spanning from, the time, if there’s some time stamps, when the min is timestamped, when the max is timestamped, just to kind of know, understand some of the high level stuff about the data set. For me, personally, I’d rather use the SQL editor inside of Ascend, and I like to do that for, first off, just kind of looking at the data again, just doing a quick select start, seeing how the data looks like in a table, and then also start iterating when you’re thinking about what that next transform is gonna look like.
WK: So now I’ve made my first transform, I have the data looking like what I want it to look like, I’m looking through the Ascend, I can see visually how it looks like on the table. Now I’m starting to think like, okay, how do I wanna build that next transform. Do I wanna join it to an additional transform? Do I need to… Whatever I need to do, I can now start iteratively seeing what that looks like inside with the Ascend editor, and then once I like that, it’s easily, I can then quickly make a transform out of that, and then I’m off to the races, off to the next transform. And I’m just kind of iterating through the pipeline. I’ve never actually just made a pipeline straight through before, it’s typically that iterative process to make sure, okay, this transform looks the way I want it to look now, let’s go build a new one and out to completion.
SC: Okay, awesome. Yeah, and for folks listening in, what some of that SQL editor looks like is as you’re building components in Ascend, you can write SQL queries on any interstitial state of the pipeline and start checking the data and then, in fact, form the next transform off of it. So William, yeah, it sounds like you’re using that, which is awesome, and as well as the partition profiles, which I didn’t immediately even think about in terms of, yeah, are these timestamps in range of what I thought they would be in? Is this data looking like the quality data that you’re expecting?
WK: Yeah.
SC: So it’s definitely one thing to build these pipelines, but it’s another one to maintain them and make sure that they’re running correctly, you’ve got fresh data in your Redshift, your Looker dashboards are up-to-date. What does that part of it look like in Ascend?
WK: So when I think about that part, I think about what it looked like before, and I think about a lot of our ETL jobs were Python running on a Lambda, dependent on some other resources whether that was gonna successfully run and pull in the data or not. And now we’re able to use Ascend to be able to automate a lot of that. As soon as we get that data into S3, we can then automate the freshness of the data to where it may be pulling every three hours or whenever, n number of hours via Ascend into the platform, and from that point right there, once we have our re-connector set up and we’ve read our data in, at that point then we can start using some of the changes in the nodes and some of the transforms that if we need to change any column names and how that impacts any downstream impacts, any downstream columns, it all carries through via Ascend, it all cascades. So those changes at the beginning of the DAG will then cascade down and will impact everything throughout the DAG to where it removes the need from us to have to go in and maintain, if we’re gonna make one column change, if we’re gonna change a data type, that’s all 10-4’s via the Ascend platform.
SC: Yeah, awesome. That is one of my favorite features is with a more naive pipeline orchestrator, you make one change over here and then you have to decide, okay, do I re-run my pipeline, how much historical data do I need to re-run it for? And that feeling you get every time in Ascend when you just update it, and then all of a sudden everything just starts running, you’re like, “Okay, sweet. My data is gonna be there soon.”
WK: Saves me some time, that’s for sure.
SC: Yeah, very cool. And one of the other things that I know we’ve talked about with Ascend is how it’s been able to help democratize more technology as well, so outside of just SQL transforms and building up those kinds of things, having some access to data science tools and some making programming more accessible. Can you shed some more detail about that?
WK: Yeah, so here at Harry’s, we have a data science team, and definitely, they’ve been able to come in and use some of the PySpark features that Ascend offers to be able to paralyze some of the scripts that they run and take advantage of that. So where they can have less resource constraint on their local machines, and that’s something that’s really opened up some doors for us as well, when I think about democratizing, opening up the amount of people that can come in, get data, use the platform, be able to, whether it’s a Google Sheet, you don’t even necessarily need to worry about the SQL part of the transform. If you can just go in, hook up your Google Sheet, you can pipe that up to Looker and that suffices for a good amount of cases, if we can just get raw data exposed into Looker, and that’s just kind of really opened up a lot of different avenues that a lot of different people on the team are using the platform for.
SC: Yeah, so it sounds like two aspects there, it sounds like one is more people being able to do things even if they don’t even necessarily know SQL, but second, for the other end of the spectrum, the folks who are far beyond SQL and have much more technical expertise like the data scientists have been able to production ally some of these Python scripts and make them more robust and parallel by taking advantage of PySpark just being available in there. Does that sound fair?
WK: Yeah. Yes, 100%.
SC: That’s awesome, cool. Well, thanks for kinda sharing a whole bunch of deep dive into what it looks like in the Harry’s analytics teams and in the technology stack. I really appreciate your time today, and as usual, if anybody has any questions, please let us know. And thanks for joining us today.
WK: Thanks, Sheel.
LD: It is always fantastic to hear from someone that’s building, or otherwise working with data pipelines on a daily basis. You just feel like you get knowledge and insight there that you just can’t find anywhere else, obviously. And it’s really fantastic to hear William talk about it. We hope you found how Harry’s and William are handling things as educational as we did, and as always, if there are any questions, you can find us at Ascend.io. Welcome to a new era of data engineering.