By Ken Kubiak, Product Management at Ascend
This week, the team at Ascend is launching our Autonomous Dataflow Service, which enables data engineers to build, scale, and operate continuously optimized, Apache Spark-based pipelines.
The Ascend product is the culmination of more than three years of research and development. Along the way, we have learned from our early adopters spanning a range of industries including media, digital consumer, health care, and logistics. Through these partnerships we have discovered considerable consensus on the challenges facing Big Data teams and initiatives; challenges that are not adequately addressed by the array of tools that have come to market in recent years. As many of you know all too well, most data engineering teams must piece together solutions from several vendors or open source projects, keep current on the latest announcements of cloud service providers, and track advancements and vulnerabilities on multiple fronts.
Instead, we believe that you should be able to focus on the data, not configuring multiple cloud environments, spinning up Spark clusters, diagnosing failed jobs, removing obsolete data to reduce storage costs, and securing sensitive information. With Ascend, you define Data Services that communicate with each other and the rest of the enterprise via Data Feeds. Data Services are built up out of Dataflows, which define the connections to external data locations and the series of transforms required to clean, convert, join, and aggregate data to produce Data Feeds.
From prototype to production, data engineers rely on Ascend to manage and maintain the intricacies of data pipelines, so they can focus on the data. Let’s take a look at some of the unique features that make this possible.
There are many advantages to being able to operate across multiple cloud providers. From its inception, Ascend has been designed to be cloud-agnostic and portable. A majority of our customers either run Ascend on more than one cloud provider or have migrated their Ascend Data Services from one cloud provider to another. Migration is a simple matter of exporting the Dataflow definitions from one environment and importing them into the other; Ascend automatically brings all Data Feeds in the new environment up to date. This means you can enjoy the flexibility of choosing where to run workloads based on the cost of compute resources, or proximity to your other data environments.
Concealing the variances between cloud providers under a common API gives Ascend the flexibility to adapt our platform to the underlying infrastructure. An example of this is the object storage technology we deploy: MinIO. The MinIO gateway presents an Amazon S3-compatible API on top of Google Cloud or Azure. With Ascend, your data is always accessible via an S3-compatible “virtual bucket”, even when Ascend is not running on AWS. This approach also enables a clear path to on-premises deployments, which will utilize NAS drives for storage while maintaining the same virtual bucket API as a cloud deployment.
With Ascend, you will never need to learn about Amazon EKS vs Google Kubernetes Engine, or how to spin up a Spark service on each cloud platform; we handle that for you. Most Ascend users are not even aware of which cloud provider the environment is running on. With Ascend, you have the freedom to choose your target cloud platform based on economic or architectural concerns, and the entire environment can later be ported between cloud providers without retraining users, rewriting code, or adapting APIs.
The Ascend platform is built around the Dataflow model of computation. A Dataflow is a directed acyclic graph (DAG) in which every node contains a transform that is evaluated on its inputs. Transforms can be arbitrarily complex, but they have no side-effects and only depend on their declared inputs. The leaf (and root) nodes of this DAG are connectors, which are the inputs (and outputs) of the Dataflow. For every node of the Dataflow, we can derive the expression it computes by labeling the connectors and flattening the sub-DAG.
To create a new Dataflow in Ascend, you construct the DAG and then define the necessary connectors and transforms. Connectors can specify the location, connection method, and credentials for external data, or reference another Dataflow using data feeds. Transforms are specified in the user’s choice of programming language. The initial release of Ascend supports SQL and PySpark, with more options in the works. Multiple programming languages can be combined to build the same Dataflow.
Transforms also handle partitioning for their inputs and outputs. For SQL transforms, Ascend will automatically generate the pattern from the code. For other languages, you can choose from a number of common patterns, including mapping, full reduction, partial reduction, and union.
Ascend infers the output schema of each transform from the code defined in either SQL or PySpark. Our early users have found this feature very helpful when writing downstream transforms. We’ve also observed that it encourages an incremental approach to building queries, much like how you usually work in a SQL console or Python notebook.
The Ascend web app includes a graphical editor to design your Dataflow. Soon, you will also have the choice of defining the Dataflow “blueprint” as declarative code, analogous to the way services are defined in Kubernetes. You can switch back and forth between graphical and code descriptions and Ascend will keep everything in sync.
Elastic Data Fabric
Another choice that you have to make when architecting a Big Data platform is which compute engine to use. Compute doesn’t stand alone — it needs to read and write data somewhere—so it requires co-located storage. It needs to be robust, so you don’t need to worry about failed jobs, which are inevitable with any large-scale data processing tasks. It must balance performance and cost—intelligently scaling the compute cluster — and automatically choose appropriate instance types. And just when you think you’ve found the ideal compute engine, the cloud providers or the open-source community reveal a new technology du jour.
To address all of these issues, we developed an elastic data fabric. “Elastic” because it automatically grows and shrinks its footprint based on the workload, and “data fabric” because it coordinates compute and persistent storage. The current elastic data fabric utilizes a Spark cluster deployed on Kubernetes, combined with the cloud provider’s persistent object store. Ascend manages the specifics under the hood so you don’t need to worry about managing Kubernetes deployments and Spark clusters. The compute cluster is auto-scaled and draws from both permanent (on-demand) and preemptible (spot) instances to optimize performance versus cost.
We’ve also made this data fabric serverless, centered around the functions being evaluated instead of individual jobs. With the Dataflow model, the transforms you code are functions, so every computation involves a functional expression — the application of a function to specific input values. The elastic data fabric accepts requests to evaluate an expression, and derives from that expression a token, which is returned. Multiple requests with the same expression return the same token and are, therefore, automatically de-duplicated. At any later time, you can request the value of the expression using this token. You never need to track Spark jobs, because the data fabric intelligently diagnoses failed and timed out jobs and re-executes any that do not require human intervention. All errors reported back to you are actionable.
Finally, the entire elastic data fabric is encrypted, both at rest and in transit, with access controlled at the level of individual Dataflows. Teams can control access to sensitive data and decide which derived Data Feeds to share with other teams.
At Ascend, we are dedicated to staying at the forefront of distributed computing technologies. We’ve deliberately designed the data fabric interface so that, in the future, you will be able to swap in alternative implementations, such as to support interactive queries or real-time streaming. Stay tuned for more developments on this front.
Dataflow Control Plane
Now that we have a multi-cloud, multi-language, elastic data fabric, we need a way to effectively control all that power, and to observe and confirm that it is operating at peak performance given specific cost constraints. There’s been a recent trend toward declarative specifications for cloud computing: declare the state you want to end up in, and let the system handle getting it there. An Ascend Dataflow is essentially a blueprint that describes what data you want; the Dataflow Control Plane decides how to make it so.
Track the Data, not the Tasks
Earlier attempts at orchestrating Big Data computation have centered around executing scripts at prescribed times (e.g., via `crontab`), or defining the dependencies between tasks as a DAG and kicking off runs of the entire DAG at regular intervals (e.g., Airflow). While the dependency information in an Airflow DAG is a step toward a declarative specification, Airflow tasks have side-effects and can read data from anywhere, not just from their upstreams. There is nothing in Airflow that verifies the correctness of the dependency graph, which can lead to subtle errors. In contrast, Ascend Dataflow transforms are guaranteed to be side-effect free and depend only on their upstreams. We already described above how this functional representation enables a serverless, elastic data fabric, but it also allows the Dataflow Control Plane to optimize execution based on the Dataflow semantics.
This combination of an elastic data fabric and the Dataflow Control Plane is extremely robust: The Control Plane references the Dataflow DAG to schedule computation on the data fabric. Each scheduling iteration determines which transforms are ready to be computed and generates requests to the data fabric for those computations, but does not wait for the results or attempt to track running jobs. Instead, it just looks for more work to schedule. At any time, the value of each expression in the DAG is either computed or pending computation; data is never inconsistent.
Data Partitioning and Incremental Recomputation
The Dataflow Control Plane also understands how data partitioning is affected by each operation, so it can leverage parallelism to the greatest degree possible.
Every value may be partitioned, and every transform defines the partitioning pattern for its output. Partitions are hierarchical, which allows for efficient handling of data with millions of partitions. At any time, you can drill down and access any partitions that are available to determine which ones must be recomputed and which ones have errors that require user correction.
Because the Dataflow Control Plane manages data partitioning, it can also perform incremental recomputation. Frequently, when new data arrives, only a small subset of the partitions are affected. The Dataflow Control Plane detects this and determines exactly which partitions of each transform must be recomputed based on the partitioning patterns associated with the transform. You don’t have to write additional logic for each transform to handle incremental computation. It’s also easy to backfill data: when you increase the range of the read connector(s) to include the historical data, the Dataflow Control Plane will schedule computation of the relevant partitions for all downstream transforms.
In Big Data systems, failures are inevitable. Computation sets have become so large that even with a very small probability of error per task, the probability that there will be at least one error approaches 1.0. Redundancy doesn’t help, because all redundant computations are equally likely to fail. What does work is being able to recompute just the failing portion of the computation and stitch the correct data into the result. When the elastic data fabric encounters a processing error, it tells the Dataflow Control Plane which partition(s) failed so it doesn’t need to reschedule the entire transform to recover. Instead it only schedules the partition(s) that failed. If the error requires human intervention (e.g., a corrupted input file), you don’t need to wait until the next nightly run to see if your fix worked. Once the correction is made, the Dataflow Control Plane will discover it and recompute only the relevant partitions.
All of these powerful capabilities are available today with the Ascend Dataflow Control Plane. In addition, we are already hard at work on additional automated Dataflow optimizations, which will continue to drive down cloud spend.
If you’re a data engineer who prefers wrangling data over managing cloud infrastructure, I encourage you to give Ascend a try. We are currently offering a free trial of our hosted service so you can quickly start building, no strings attached. Check it out here and we can get you up and building quickly.
Hope to see you soon on Ascend!