Spotlight: Data Partitioning

Back

Spotlight: Data Partitioning

Michael

Michael Leppitsch

michael@ascend.io

Many enterprise data teams choose Ascend for its ease of use, single pane of glass, and high levels of automation. The data engineers on those teams are often pleasantly surprised when they discover the powerful partitioning capabilities of the platform, and the precise level of control they have over how Ascend handles data. Let's take a closer look at these capabilities.

What is data partitioning?

In this age of big data, it is not uncommon to deal with billions of records and terabytes of data. Processing such volumes requires smart approaches to segmenting it into manageable subsets in order to:

Distribute and manage workloads over time and resources
Apply processing techniques that save time and money
Avoid a lot of pain when large datasets exceed the limits of technology

Partitioning creates subsets of the data that can greatly improve performance and reduce costs in that they:

Can be stored on distributed file systems
Can accelerate processing across multiple servers or nodes in parallel
Can process individual subsets of the overall data
Can use smaller and cheaper compute resources
Can be operated in bulk, such as merged, moved, indexed, or deleted.

Common partitioning techniques include:

Horizontal partitioning or sharding Range partitioning Vertical partitioning Normalization partitioning

The partitions contain complete rows of data, grouped by the values of one of the attributes of the schema, or an added unique partitioning key. In this form of horizontal partitioning, the partitions are split into contiguous subsets of rows based on ranges of values in the existing attributes of the schema. The partitions contain subsets of the columns of data, and share one of the columns containing a unique value for all the rows to link them. In this form of vertical partitioning, columns containing redundant data are split into their own partitions.

Why does data partitioning matter in Ascend?

Modern data clouds have removed the need for you to explicitly partition data for performance reasons. Similarly, the Ascend platform has harnessed partitioning techniques to achieve scalability that is unmatched by conventional ETL/ELT tools. With Ascend's DataAware^TM technology, data is partitioned in order to reduce the number of times it needs to be processed over the lifespan of the pipeline.

When developing your pipelines in Ascend, you have direct access to leverage these powerful capabilities to tune your intelligent data pipelines. This way you can control how data is processed in order to adjust and tune for performance, or enjoy the defaults built into the platform that are already optimized for most common use cases.

What are the best strategies for partitioning in Ascend?

The primary purpose of partitioning Ascend data pipelines is to reduce the re-processing of data as changes happen to either the code or the data. There are two primary strategies to partition data:

When the data is a time series: Range partitioning is particularly efficient as it applies changes to the most recent partitions that are still in active use.
When the data contains specific objects: Horizontal object-based partitioning can dramatically reduce resource usage in subsequent transformation processing.

How to Do Partitioning in Ascend

On Ascend, data engineers can adapt partitioning strategies as data flows through their intelligent pipelines. You can change the partitioning at different points along the way:

At the entry point of data: Ascend Read Connectors will apply your partitioning preference automatically as data enters your intelligent pipelines.
During transformation steps: Ascend automatically provides a default "after-state" partitioning strategy for you, which you can override with your own configuration for special cases.
At output: Ascend Write Connectors automatically choose between time series and full reduction partitions in order to minimize the rewriting of data to output systems whenever possible. You can override this behavior with your own configuration.

Data Partitioning Still Matters

While modern data platforms have largely eliminated it, the partitioning of data in pipelines remains a powerful tool to reduce the amount of reprocessing triggered by changing data and revisions in pipeline logic. Ascend data pipeline automation platform saves you money by automatically applying partitioning to all of your data. In addition, the platform puts several partitioning strategies at your fingertips to tune and optimize your pipelines further.

Spotlight: Data Partitioning

What is data partitioning?

Why does data partitioning matter in Ascend?

What are the best strategies for partitioning in Ascend?

How to Do Partitioning in Ascend

Data Partitioning Still Matters

Additional Reading and Resources

Try it out. Your future self will thank you :)