The Anti-Pattern in Big Data

Back

The Anti-Pattern in Big Data

Ascend.io

data-eng@ascend.io

Today’s scale of data creation and ingestion has reached magnitudes that have fueled an Icarus-like obsession with data-driven business decisions. The desire for velocity in analytical processing, machine learning, and visualization has only enlarged the gap between the vision of a data-powered intelligence engine and the actual tools used for this concept. The jargon-filled alphabet soup (of data lakes, warehouses, marts, mines, etc.) that we have come to accept as reality has fallen far short of the original value proposition for Big Data. We fight our fires with increasingly large data engineering teams, menial work around data cleansing/preparation, and flaky custom logic that fails to scale with data diversity and size. Ultimately, companies are often forced to make a decision between cost and data quality.

At Ascend, we believe this not only makes existing data engineering efforts untenable for most organizations but also struggles to deliver the scalable solutions necessary for data science experimentation. The direction of innovation thus far has pushed towards larger amounts of storage and increasingly complex, hand-stitched pipeline solutions. Yet, the metrics that actually matter for data science like model development speed, query latencies, and dataset flexibility have been deprioritized. The current state of the data world tends to err on solving data engineering issues with a large budget rather than with a managed, scalable solution.

Data Warehouses: Where It All Went Right

Data warehouses originated from the desire for a centralized source of truth for high-quality relational data. They represented everything right for a singular data problem. A data/modeling scientist could run numerous SQL queries with relatively low latency in order to retrieve highly curated datasets for model training and pattern analysis.

However, with time, the age-old issue of scale caught up with the data warehouse. As the desires for high quantities of high quality data exponentially increased, the respective costs of these data warehouses and the ETL operations needed to fuel them, also followed suit. In an effort to not lose any data for future use, we lowered the bar for data quality in storage and drove the popularity of schema-less, non-relational data repositories — data lakes and even worse, data swamps.

Retrospectively, we now view data warehouses as the original ideal attempt in unlocking data-driven insights and innovation. But the reality of cost and scale turned this utopian vision into an unmanageable budget-draining monster. Additionally, limitations on scale, cost, or performance requirements dictated increasing levels of data preparation, slowing down the overall development cycle for data teams.

Data Pipelines: A Means To An End

Steering into the skid of Big Data’s overused aqueous analogies of lakes and pipelines, we can look at data lakes as modern-day reservoirs. They contain value in an unstructured, nebulous manner but still required transport to another destination to be utilized. Data pipelines became the answer to the “schema-on-read” requirement for data lake reads. Opening the door to data pipelines pushed the industry towards digging themselves into a deeper and deeper hole of technical debt. Custom data pipelines drove the need for powerful processing techniques like MapReduce and Spark. Scaling pipelines drove the need for scheduling thousands of pipelines around different SLA priorities, which encouraged the inception of tools like Airflow.

The modern data industry has found itself at a point where the sheer amount of data to extract value from has become so large that instead of depending on highly curated stores like data warehouses, we choose to dump petabytes and exabytes of data into data lakes. In order to just touch and work with this data in storage, we construct complex networks of data pipelines and dependencies that are fighting over a scarce amount of compute resources.The current concept of a data pipeline actually contradicts the free-flowing, query-able attribute of data lakes. It is as if in order to balance out the schema-less aspect of data lakes, pipelines enforce rigidity and static behavior. This rigidity manifests itself as cascading, multi-staged workflows that can really only be used at two points: the start of the pipeline and the end of the pipeline. Furthermore, the black box that is the core processing of the pipeline will also swallow errors and require constant toggling for debugging. Ultimately, this configuration forces data engineers to suffer from maintaining flaky, single-use pipelines and constantly duplicating and/or rewriting common logic for pipelines that serve similar business goals.

This house of cards of coding and tools ends up as an anti-pattern for Big Data’s actual value. Attempting to bring structure to structureless data storage, pipelines end up actually being so restrictive that further development after the initial version of a pipeline usually requires major overhaul of code or writing a new pipeline altogether.

Experimentation Requires Data Freedom

We all understand the general Big Data premise where value exists in the sheer amounts of data that we now record and store. But having exabytes worth of unprocessed, schema-less data in storage like a data lake is almost as bad as having no data at all. In order to support and drive the masses of aspiring data engineers and data scientists, the data industry needs to prioritize the freedom of data movement above anything else. Being able to reshape and transform data at any point in the flow of moving data from storage to querying is the most important task for any data engineering team today.

At Ascend, we have built our Autonomous Dataflow Service with this core mission in mind. Data science is ultimately still a “science”. It is driven by experimentation and repetition. We have introduced the term Dataflow because data within Ascend has the ability to be accessed and analyzed at any stage in its transformation. Our Dataflow Control Plane pushes compute to optimal efficiency, which guarantees the best SLA’s for data insights and frees engineers from digging through Spark logs.

The current state of the data world forces data scientists to run data pipeline deployments, write complex MapReduce or Spark jobs, and manage bloated data warehouses. Ascend draws the line of responsibility between data scientists and data engineers. We allow both groups to do what they love to do and what they do best. By superpower-ing data engineers, Ascend’s platform gives time and bandwidth back to data scientist for model development, insight analysis, and accuracy tests. The world of data today is still bogged down by manual management of data lakes, data warehouses, and data pipelines by all individuals - scientists and engineers. The world of data tomorrow will be driven by products like Ascend that empower data engineering teams and enable data scientists to meet the increasing demand for data-driven decisions.

The Anti-Pattern in Big Data

Data Warehouses: Where It All Went Right

Data Pipelines: A Means To An End

Experimentation Requires Data Freedom

Try it out. Your future self will thank you :)