Queryable Dataflows: Combining the Interactivity of Warehouses with the Scale of Pipelines

Back

Queryable Dataflows: Combining the Interactivity of Warehouses with the Scale of Pipelines

Ascend.io

data-eng@ascend.io

Last month, we unveiled the Ascend Autonomous Dataflow Service, our solution for painlessly building and managing pipelines, described declaratively as Dataflows and built up out of SQL and PySpark transforms. With Ascend, data engineering teams have been dramatically reducing the amount of code and time it takes to implement production pipelines by focusing on what actually needs to be done with the data rather than how to orchestrate all the underlying jobs and systems required to produce the resulting datasets.

One of the challenges when building pipelines is that what data to produce is not always obvious at project inception. It can take a significant amount of data discovery, profiling, exploration, and experimentation to distill the right value from the large amounts of data available at most organizations. For most of you, this usually requires a separate interactive tool that can quickly query against the data—such as notebooks connected to high-performance compute infrastructure or a downstream data warehouse—so that you can get an initial understanding of where to start. Once the exploration is complete, you have to capture the results and translate that back to the pipeline development tooling. This reverse-engineering, reimplementing, and validating of pipelines against the original analytics experiments takes an inordinate amount of time and disrupts the overall development process.

For us, that process seems error-prone and a time suck. Instead, what if the pipeline building world and the interactive query world coexisted? Not only could any transform stage in a pipeline be queried like tables in a database, but these queries could be instantly converted to a transform if the data results are correct. For the data engineers we work with, this was a pretty powerful proposition. Thus, we developed the new Queryable Dataflow capability (now available in technical preview).

Now in Ascend, all stages of all Dataflows are queryable without switching tools or disrupting your development process. As part of this capability, we’ve built an interactive query editor that lets you interact with Connectors and Transforms in Dataflows, as well as Data Feeds broadcasting from other Data Services, as though they are read-only tables in a SQL database. This embedded editor has built-in auto-completion of SQL keywords, as well as table and column names, for writing queries faster. Queries can be submitted asynchronously; you can either wait for the query to finish, start a new query, or navigate back through the list of previous queries and review past results without rerunning the query. Teams collaborating on the same Dataflow can even share their queries (and the resulting datasets) with each other.

To demonstrate the power of Queryable Dataflows, let’s look at a common use case: source data profiling. Let’s say you’re looking to build a pipeline to predict demand for e-bikes, based on San Francisco bike share rental data from 2013 to 2018. To even begin building, you need to get an understanding of the data you’ll be working with. Instead of spinning up a separate analytics tool or pushing sample data into a warehouse, you can start profiling the data directly using Ascend. Maybe you’d like to start by seeing how this dataset has evolved on a month-by-month basis and the trends in the data itself, such as:

What were the shortest and longest (in minutes) rides each month?
How many total bike-hours were clocked each month?
How many distinct bike share locations did trips originate from each month?

You can easily submit those queries as well and get near-immediate results.

At this point, you have enough information to start building. A key benefit is that any of the queries you’ve already run can be instantly productionized as Transforms within your Dataflow. And, since it leverages the same Dataflow Control Plane and underlying Spark infrastructure, you don’t need to recode the queries and no data needs to be reprocessed once it’s converted to a Transform—the data result is immediately materialized and able to be built on or shared out. Additionally, you can continue to run interactive queries against the data incrementally as you build for faster result confirmation and more accurate pipeline development before it gets sent to downstream data consumers.

With Queryable Dataflows, you no longer need to choose between the interactivity of data warehouses or analytic tools and the scale and flexibility of data pipelines. You can rapidly iterate between data exploration and pipeline development without impacting the integrity of your production environments or disrupting your end-to-end development process. This also makes it much more seamless to handle one-off requests or inquiries by leveraging the Service you’re already building against. Additionally, your downstream analysts and data scientists can connect their preferred BI tools or notebooks directly against any Transform for fast, on-demand access, without having to wait for resulting data to load into a warehouse environment.

Queryable Dataflows: Combining the Interactivity of Warehouses with the Scale of Pipelines

Try it out. Your future self will thank you :)