Apache Spark Optimization | Observability & Monitoring

In this blog series we’ll be examining the Five Myths of Apache Spark Optimization. (Stay tuned for the entire series!)

The first myth examines a common assumption of many Spark users: Observing and monitoring your Spark environment means you’ll be able to find the wasteful apps and tune them.

Certainly, identifying wasteful apps and tuning them for greater efficiency is a great starting point for cost optimization. This effort typically involves deploying services like Amazon Cloudwatch or third-party application monitoring tools. These services and tools analyze running applications and propose settings or other modifiable configurations for increased efficiency. Some observability and monitoring solutions even provide specific tuning recommendations for individual applications, such as Change spark.driver.memory from 6g to 4g.

However, observability and monitoring tools do not actually eliminate the waste, and they certainly do not do so automatically. These tools can surface problems and often generate recommendations for remediation, but do not solve them.

This creates a gap, because finding waste is not fixing waste.

Tuning recommendations generated by observability and monitoring tools often translate into a lengthy to-do list, especially as the number of applications increases. What to do with such a recommendations list? Most organizations go back to their developers, list in hand.

The Challenges of Implementing Recommendations

Implementing manual tuning recommendations requires significant effort and is a primary pain point for developers, according to the FinOps Foundation. Because developers are generally not responsible for the cost of their applications, asking them to adjust configurations to minimize cost can seem outside their scope of work. Developers may even be reluctant to tweak something that seems to be running well out of fear of breaking it, following a completely reasonable mindset of “if it ain’t broke, don’t fix it.” And no developer, no matter how dedicated, can keep pace with the real-time dynamism of modern applications and their volatile, ever-changing resource requirements.

Even assuming an army of developers is at the ready to tweak and tune Spark applications in real time, that still doesn’t solve the problem of waste inside the application. The waste inside Spark applications is based on how it uses resources.

With Spark Applications, Provisioning Often Means Overprovisioning

Many Spark applications utilize resources in a highly dynamic way because the data that these applications process is often bursty. Even the way they interact with other concurrently running applications may impact their resource requirements. As a result, the resource utilization profile for a typical Spark application might look like this:

Figure 1: The resource utilization of a Spark application can vary dramatically over time.

As is evident from this graph, most applications run at peak provisioning levels for only a small fraction of their execution time.

Figure 2: Most applications reach peak resource utilization for only a small fraction of their runtime.

Spark developers are required to request a certain allocation level of memory and CPU for each of their applications. To prevent their applications from being killed due to insufficient resources, developers typically request memory and CPU resources to accommodate peak usage (and maybe then some on top, just to be safe).

Figure 3: Developers are required to allocate memory and CPU for each of their Spark applications. To prevent their applications from being killed due to insufficient resources, developers typically request such resources to accommodate peak usage.

Some cost-conscious developers might make an effort to reduce the provisioning line as low as possible, to align with peak resource requirement levels.

Figure 4: From a cost-cutting perspective, the best a developer can do via manual tweaking and tuning is to reduce their resource request level to match the peak of what an application requires. However, since most applications do not run at peak all the time, often significant waste remains.

However, even if a developer reduces the allocation level to match the peak requested by the application, they cannot effectively “bend the allocation line” in real time to align with actual resource usage requirements that vary in real time. As a result, waste cannot be eliminated entirely by tweaking and tuning alone.

Finding Waste ≠ Fixing Waste

Identifying wasteful apps is a great starting point for cost optimization, but observability and monitoring tools don’t fix waste. Attempts to remediate this waste through manual tweaking and tuning can only go so far. To peak ahead at a solution, check out this page on Pepperdata Apache Spark Optimization.

In our next blog entry in this series, we’ll examine the second myth, which centers around cluster autoscaling. Stay tuned!

Myth #1 of Apache Spark Optimization: Observability & Monitoring

The Challenges of Implementing Recommendations

With Spark Applications, Provisioning Often Means Overprovisioning

Finding Waste ≠ Fixing Waste

Explore More