The data engineering world is full of tips and tricks on how to handle specific patterns that recur with every data pipeline. A prime example of such patterns is orphaned datasets. These are datasets that exist in a database or data storage system but no longer have a relevant link or relationship to other data, to any of the analytics, or to the main application — making them a deceptively challenging issue to tackle.
Typically, these orphaned datasets are addressed with ad-hoc manual protocols and after-the-fact garbage collection processes. In a traditional modern data stack, this is a major time sink for data engineers.
In this article, we illustrate why dealing with orphaned data sets is paramount, and how Ascend goes beyond mere garbage collection by not creating any orphaned datasets in the first place. This active approach reduces wasteful maintenance activity for your data engineers, enhancing both efficiency and alignment. Let’s dive into this transformative strategy.
The High Stakes of Neglecting Orphaned Datasets
Why should we concern ourselves with the issue of orphaned datasets? To start, the implications of this problem extend far beyond the confines of databases and have serious ramifications for businesses at large. Already in 2016, IBM estimated the cost of bad data to be over three trillion dollars, and that was before the chaos of data lakes emerged and orphaned datasets began to swamp the land.
Orphaned datasets occur due to several reasons, such as changes in data relationships, ad-hoc archiving of deleted records, restarts from system failures and errors, or design flaws in data management strategies. A common consequence we observe is a lack of awareness among stakeholders about which datasets in the database are officially approved and actively maintained. Even the data stewards who have taken on responsibility for the accuracy and business relevance of the data are often unaware of the nuances of orphaned datasets. They are blindsided when they read from a table they’ve been granted access to, only to find out it had been abandoned by the data engineers weeks ago.
These orphaned datasets lead to inefficiencies, including wasted storage space and decreased system performance. More importantly, they incur legal risks when they involve competitive information or private data subject to GDPR and CCR regulations, and they can seriously compromise data integrity — eroding trust in data and confidence to make decisions with it.
To avoid the problems related to orphaned datasets, many data teams institute rigorous data hygiene standards and time-consuming change management procedures. Before any changes are made, every contingency has to be carefully mapped out, manually reviewed, and tested under various scenarios. The data team also spends much of its time on internal custodial processes to reduce data bloat, and the business has to intervene regularly to help sort out what’s useful and what is not. Innovation slows to a crawl, and staffing costs soar while the team’s productivity sags. But what if there was a better way? Let’s take a closer look at our unique approach to handling the orphaned datasets problem.
Where do orphaned datasets appear in Ascend?
In general, the root of the orphaned data problem lies in the interim storage of data between the individual processing steps that make up a data pipeline. As in any pipeline, these processing steps consist of ingestion, transformation, and data sharing. On Ascend, these steps are organized into dataflows (essentially pipelines) that run in the form of workloads in the data clouds. Ascend groups these dataflows into data services, where the user selects which data cloud they should run in. These data clouds are also where the interim storage between steps is kept. In Snowflake or BigQuery, these are database tables; for a lakehouse like Databricks, these are partitions in Delta Lake.
As you can imagine, the complexity of interim storage quickly explodes when an enterprise runs many data pipelines, each having multiple processing steps, distributed across all the data clouds. The orphan dataset problem really explodes when pipelines begin to change. In most data pipeline systems, new tables are created in response to any change, and the old ones are abandoned in order to leave a complete trace behind:
- When source schemas entering the pipelines at ingestion change, and the interim storage schemas along the entire downstream pipelines need to change.
- When the logic in the transformation steps changes to follow business rules, and the schemas and partitions along the entire downstream pipelines need to change.
- When performance tuning involves partitioning changes, affecting the size and organization of interim storage.
- When data is different before and after any change, so orphaned datasets for the same processing step look different over time, making it hard to match them up.
While engineers create these interim datasets with good intentions, other people who have direct access to the database can’t tell which tables fit where in the pipeline, and whether they are active, archives, or abandoned. And when you multiply the number of interim storage tables by the variety of sources of change, this problem quickly spirals out of control. But as our users can attest, on Ascend all this operational detail is fully automated. Let’s step through it.
How does Ascend do its magic?
Instead of simply garbage-collecting obsolete orphaned data with heuristics, the platform roots out the causes with powerful options to keep datasets aligned with the pipeline processes in real-time.
- Automation #1: When a user first creates a data service and a dataflow, Ascend automatically creates a database for the data service and a database schema for that flow, following a templated naming convention.
- Automation #2: When the user starts creating ingestion, transformation, and sharing components, Ascend automatically creates the interim storage tables for each of them, in real-time.
- Automation #3: When any of the sources of change above happen, Ascend instantly changes the interim storage automatically. For example, changing the name of a component in Ascend instantly changes it in the database.
Here is the part where you can choose what Ascend does in response to change:
- Strategy #1: The change is applied directly to the live table. This strategy minimizes clutter and storage space, but could lose some data that is no longer useful going forward. However, Ascend always keeps a record of every change in its metadata, so the tables themselves are not needed to look back at schemas, when changes were made, who made them, etc.
- Strategy #2: Ascend clones the table, removes the clone from the pipeline, and changes its name to append a timestamp. Then Ascend applies the changes to the table in the active pipeline. This creates a clear, easily distinguishable trail of full-fledged artifacts that some businesses require, tracked in Ascend metadata.
- Strategy #3: Instead of appending the name of the clone, Ascend simply keeps the table and ignores it.
To see these capabilities in action, check out this brief demo.
How can you up your game?
By now you have realized that Ascend has addressed the problem of orphaned datasets by not creating any of them at all. Ascend’s data pipeline automation has eliminated this bugbear of data pipelines, freeing up data engineers to pursue more value-generating work, such as creating new data pipelines to drive new business insights.
- This level of automation dramatically reduces the need for dataset management to be included in the CI/CD workflows for building and delivering data pipelines. You no longer need to separately track the data assets that will be created, or those that will be outdated — the platform handles that for you, and its metadata provides the auditability of the work being done.
- In a future release, Ascend will be able to roll back entire pipelines to a previous state, at zero cost, by simply reinstating the archive tables along with the previous transformation logic and lineage. If this sounds familiar, you have been a user of Ascend for several years — this was a feature supported on the original Spark-based compute engine. Soon it will be available on all data clouds, even for pipelines that span multiple data clouds.
Additional Reading and Resources