How to Controlling Cloud Costs

Back

Controlling Cloud Costs

Uncover the importance of storage, compute, networking, and retries in controlling cloud costs and building efficient data pipelines.

Jo Stevens

joe@ascend.io

In the world of cloud computing, efficiency isn't just about running operations faster or smoother — it's also about achieving more with less. It's about ensuring that the resources consumed deliver maximum value and avoid unnecessary expenditures.

Understanding and controlling cloud costs is a fundamental part of how Ascend manages the cloud infrastructure of our dedicated deployment customers. These are customers where the entire Ascend software stack is installed in their cloud account.

In this blog post, we delve deep into the factors that contribute to how Ascend runs in the cloud, focusing on the primary areas that contribute to costs in this context: storage, compute, networking, and retries. Further, we provide details on the concrete strategies we apply to wring more value out of your infrastructure, making your intelligent data pipelines more efficient and cost-effective than pipelines built any other way.

But if reading is not your thing, dive right into the fascinating details of how we master cloud costs in the video below.

Understanding What Drives Cloud Costs

So, what exactly are the primary factors driving cloud costs within a typical data processing environment? To gain a concrete understanding and provide tangible insights for data pipeline optimization, we've monitored the performance of one of our production pipeline networks –— an established system that handles significant data volumes and undergoes updates approximately every 30 minutes.

To get to the bottom of actual infrastructure costs, this reference system uses open-source Spark. Users of Databricks, Snowflake, and BigQuery benefit from optimizations that these providers perform behind the scenes.

After carefully examining our reference pipeline network, we've pinpointed four areas that play a critical role in driving cloud costs:

Storage: When dealing with vast amounts of data, you need to store that data somewhere to be accessible for processing. In the large-scale data pipeline networks operated by Ascend, storage can account for 30% of total cloud costs.
Compute: This refers to all processes responsible for actually handling your data. Whether it's a read connector drawing in data from an external source, or a component performing transformations, these operations consume the bulk of compute resources. For our reference pipeline, compute tends to account for approximately 20% of the total cloud bill.
Networking: This category covers all costs associated with moving your data around from data acquisition via read connectors to data delivery via write connectors. Cloud providers usually have associated costs for data movement, which in our reference pipeline network makes up another 30% of the bill.
Retries: A more subtle but nonetheless important aspect is the cost of retries. Anytime data processing encounters a failure for a variety of reasons, retries commonly resolve the problem. The costs associated with retries can vary, but Murphy's Law seems to hold — jobs tend to fail at the worst possible time. This can create unforeseen expenses that can impact your cloud bill.

Based on a comprehensive understanding of these cost drivers, our engineering team is constantly improving how the platform utilizes your cloud infrastructure, making your data operations more predictable and efficient. Let's explore each of these components in turn.

Storage Costs

Even though storage is generally considered inexpensive, its costs can quickly accumulate in a high-volume data environment. As a result, storage costs often end up being a significant portion of your total cloud bill. So how does Ascend manage them and what solutions do we apply to mitigate these expenses?

Understanding Storage Costs

The majority of bulk storage of data in Ascend-powered pipelines occurs in our customers' Snowflake, BigQuery, and Databricks accounts. For the portions of the pipelines running natively on Ascend, the data lands on blob storage. These cost around 2 cents per gigabyte per month, though this varies based on the cloud provider and region.

Reducing storage costs can be quite challenging. Unlike compute resources, where you can simply delete a VM to stop the cost, reducing storage costs involves deleting the underlying data objects. This deletion requires knowledge of whether or not those objects are necessary, which isn't always straightforward. And if you delete something you later find out you need, regenerating it can be a burdensome process.

Strategies to Reduce Storage Costs

The Ascend platform leverages two effective techniques designed to keep cloud storage costs under control and optimize your budget.

Intelligent Tiering and Autoclass

Intelligent tiering and Autoclass are features offered by Amazon and GCP, respectively, that can help reduce storage costs over time. They automatically reduce the cost of storage for data not accessed for a certain period. When using these features, storage costs for data not accessed for 30 days can drop by about 50%, and for data not accessed for 90 days can drop by up to 80%. Since these features are dynamically driven by how data is accessed over time, actual savings depend on your data usage patterns.

These features work by monitoring the read activity on data objects and transitioning idle objects to colder, less expensive storage over time. There's no penalty for accessing data from cold storage, so this can be a beneficial strategy for reducing costs without impacting operations.

Unfortunately, Azure does not have a similar feature, but they do offer lifecycle policies that can also help manage storage costs.

Ascend Views

Ascend Views is a new component in the platform that transforms your data without persisting it, reducing both end-to-end processing time and storage costs. Rather than persisting interim datasets in the middle of your pipelines, you can chain together multiple transformations that persist the final dataset only in your final components of the pipeline. This approach not only reduces underlying storage costs, but can also significantly improve the performance of each individual pipeline, regardless of whether it is running on Snowflake, Databricks, BigQuery or natively on Ascend.

However, one trade-off to consider is that storage for Ascend Views won't be supported by a caching layer due to the lack of persistence, which can cause increased compute consumption under certain retry scenarios we discuss later. You can choose to use this type of component where it is most appropriate based on your unique circumstances.

Compute Costs

Compute costs for the Ascend platform itself are driven by nodes of virtual machines (VMs) in your cloud account, and can vary widely due to factors such as cloud provider, region, instance type, and the choice between spot and on-demand instances. While these variations might be confusing, for our platform they offer opportunities for cost reduction through the use of intelligent management strategies.

Understanding Compute Costs

In an ideal scenario, our platform would be using spot instances at 100% utilization at all times, to ensure maximum cost efficiency. However, in the real world spot instances can disappear at any time, and achieving optimal utilization isn't possible due to the poor packing, fragmentation, and the tricky nature of compacting compute resources in such a dynamic environment.

Given the big-data nature of Ascend, our platform utilizes Spark under the hood for many of the data management operations that make it tick. Such a distributed Spark cluster may run on multiple compute nodes for an extended period. Due to variations in the workloads, these clusters don't always fully utilize the nodes they are running on, leading to resource fragmentation.

Even more challenging is the compaction of compute resources, particularly for stateful applications like Spark jobs. If a long-running Spark job is terminated due to a disappearing spot instance, progress can be lost and previous state can be expensive to recover.

Strategies to Optimize Compute Costs

Thankfully, the Ascend platform is very good at addressing these challenges and squeezing the most value out of your cloud infrastructure. We incorporate several strategies, including spot compute, pod packing, idle executor preemption, and leveraging efficiencies of Ascend's processing engine that also drives the BigQuery, Databricks, and Snowflake data planes.

Spot Compute

Ascend has long used spot nodes, especially for running transformation jobs that are not being served by Snowflake, Databricks, or BigQuery. Spot nodes can offer savings of up to 90% compared to on-demand or reserved instances. Even with the aforementioned recovery operations, a significant discount of about 60-70% can be achieved.

Pod Packing

Kubernetes is at the heart of Ascend, and changing the default scheduling behavior of Kubernetes can contribute significant compute cost reductions. We tune Kubernetes to safely pack workloads for efficiency, while still preserving the integrity and performance of our services.

Idle Executor Preemption

We apply our long-running experience with Spark to identify idle executors that are incurring overhead but have completed their workloads. These are marked as eligible for compaction, reducing the need to keep up unnecessary VMs. We estimate this feature can deliver a 50% reduction in compute costs when running small data volumes.

Ascend Compute Upgrades

Ascend has recently upgraded its core compute model, bringing with it significant improvements. Previously, each job was given its own container to run in. With this upgrade, the platform now uses a warehouse model where compute entities have a controlled level of access across various resource pools in your cloud account. This results in better control over low-level resource consumption by the platform, with significant performance gains and up to a 60% reduction in cost.

Networking Costs

Networking, another critical pillar of cloud costs, is primarily driven by inter-zone traffic and the use of Network Address Translation (NAT) gateways. Both of these components, if not managed correctly, can escalate costs unexpectedly.

Understanding Networking Costs

Inter-zone Traffic

Inter-zone traffic, which occurs when data moves between different availability zones within the same region, can surprisingly contribute to significant costs. Providers like Amazon charge both for data egress and data ingress, meaning you're effectively billed twice for moving the same data.

This effect is particularly pertinent when data is being shuffled during operations like data repartitioning. Providers like BigQuery, Databricks, and Snowflake optimize these operations internally, but operations for which Ascend uses Spark can lead to large data transfers between distributed executors. Without optimization, these transfers can reach the point of exceeding the total volume of data being processed and quickly inflate your cloud bill.

NAT Gateways

NAT gateways enable internet access for private nodes residing in your private subnets that do not have public IP addresses. This allows for data ingestion from sources outside the subnet, and access for authenticated users. While data ingestion from outside the network directly into a cloud storage bucket is generally free, network traffic via the public internet caused by the orchestration of Spark jobs to process that data can incur costs of around 4.5 cents per gigabyte.

This translates to about $46 per terabyte, and can escalate rapidly with larger volumes of data. Furthermore, these costs apply not only to data ingestion from the public internet, but also to the distribution of the Ascend application images into the containers inside your cloud account running your data pipelines.

Tackling Networking Costs

To mitigate these costs, the Ascend deployment strategies include zone affinity and NAT clusters.

Zone Affinity

With this strategy, we ensure that the Spark-based components of Ascend always run within a single zone. This effectively eliminates the inter-zone traffic caused by shuffle operations. The strategy has been recognized as a best practice for operating Spark clusters.

NAT Clusters

We are currently addressing the root causes for the high costs of data transfer across NAT gateways, and anticipate we can significantly reduce these costs by up to 90%. Watch for this improvement in our regular release notes and in your cloud invoice.

Retry Costs

Retries are a useful recovery mechanism in computing, but can lead to escalating cloud costs if not well managed. They incur additional processing costs, contribute to storage and network expenses, increase processing time, and can require human intervention in case of failures. While retries are essential for dealing with interruptions in spot instances and network glitches, their ideal number is zero.

Understanding Retries Cost

Retries usually kick in during ephemeral failures to recover the compute job. These failures are not uncommon but are more prevalent when using spot instances. While this is better than the job failing completely, it leads to reprocessing costs and additional storage and network charges. The extra processing time also delays job completion, affecting overall operational efficiency. Furthermore, these failures can escalate to the point of necessitating human intervention for triage, which translates into additional expenses, both in terms of time and money.

Many infrastructure incidents, such as an unexpected network partition and interruptions in spot compute instances, are by definition beyond the control of Ascend. While the resulting retries can rarely be prevented, we take measures to mitigate their impact.

Mitigating Retry Costs

The Ascend platform uses several strategies to avoid unnecessary retries, including network encryption with Cilium and disk isolation.

Network Encryption with Cilium

Ascend ensures that all your data is encrypted both in transit and at rest. The internal Spark-based engine offers native encryption capabilities, but is not optimized for the purposes of data pipelines and often results in job failures, usually at the most inconvenient times.

To overcome this limitation, the platform now uses a network layer based on Cilium that handles transparent encryption more reliably. By leveraging the built-in functionality of the underlying nodes, Cilium ensures that pipeline jobs don't fail due to encryption issues at the last minute, significantly reducing retries.

Disk Isolation

When utilizing infrastructure at the limits of its capacity, it's common for a compute node to spill data to disk when memory runs low. This behavior can result in a high IOPS load on the node, and when multiple jobs share a node, this can destabilize other workloads on the node.

To address this source of instability and costs, we're introducing dedicated ephemeral disks that protect workloads from each other. This strategy not only improves the stability of workloads but also reduces your total spend on disks at rest. Each node can retain the minimum disk required to run the system, while the processing workloads can access and utilize as much disk as they need for computation.

Efficiently Harnessing Cloud Resources

In a world where cloud expenses can quickly spiral out of control, understanding and managing your resource usage becomes more important than ever. At Ascend, we don't merely recognize this necessity; we act upon it and make it a cornerstone of our operations and our platform.

Controlling cloud costs while running large-scale data pipelines is no small feat, but our deep understanding of cost drivers�storage, compute, networking, and retries� guides our successful optimization strategies. Ascend is continuously innovating and scouring the landscape of tools and cloud capabilities to maximize performance while minimizing the costs incurred in your cloud account.

Additional Reading and Resources