"The power of creativity, the willingness to try out anything, along with a collaborative nature, yields great results."
Hey Kimoon! You’ve been at Pepperdata for ten years! How have things evolved in that time?
Well, it’s been awesome throughout. When we first got started, we only had a macro concept of the product we wanted to build. A lot of discussions, creative activities, and brainstorming ensued. Then, when it was time to execute, everybody stepped up. We all worked together, like a family.
We quickly found excellent customers. You know, a lot of startups in our industry fail to do this, then they disappear. Not us. And over time, we learned a lot from our customers’ experiences of our products, and we diligently applied those lessons back to our product lines. Their insights also helped us identify which of our initial thoughts were a mistake, which we had to drop or adjust over time. Now our product is way more mature.
Today, we’re trying to explore really adventurous markets. All the while, we still spend a lot of time maintaining our product quality, adding more features as we go. Overall, it’s been a very, very interesting experience!
What makes the Pepperdata product unique? As an expert engineer, what do you really value about working with the platform?
Before I answer the question, here’s a bit of background. Before joining Pepperdata, I was a search engine engineer across three different search engine companies, including Yahoo and Google.
When I was at Yahoo, I was part of the content team. We adopted the big data platform Hadoop, and we actually invested heavily in making Hadoop mature and production-ready. Yahoo sort of became the mothership of the big data industry at that point.
Back then, scalability was everything. We were hoping to achieve production-grade scalability—handling billions and billions of data items, terabytes of information. But because we focused so much on scalability, the big data platform became extremely difficult to use. One shocking revelation was that there was no way for the owner to see what was happening right before the big data performance workload failed. There was no visibility, no drivability whatsoever.
Here's an example. Back then, your typical big data workload ran for about eight hours. Often, at the seven-hour mark, it would fail. You'd wait for seven hours just for the workload to fail, without any idea why it happened. So you would tweak a random set of parameters, launch it again, wait for seven hours, all for it to fail again. Madness!
That old approach to big data performance must have been very frustrating!
It really was. The cycle was horrible, almost inhumane. The lack of visibility was killing the big data industry. A lot of people realized that the platform only worked about ten percent of the time, and you had to spend a lot of your human hours trying to get it to work, which cut down productivity. And there's not a lot of return on investment on that. That was the main problem.
Then, Pepperdata came along and offered an innovative monitoring and observability solution to everyone operating large scale data systems! However this was not our original product concept. We'd actually built our product based on something else. But when the initial customers used our product, they immediately said something along the lines of, “So you're saying your product works, but I cannot see it. Let me get a solution to actually look at it.”
And so we built this extra feature on our product, which was monitoring. That was a hit with our customers, and it evolved into a much more important product feature all on its own. Since then we’ve gone from monitoring to observability and on to optimization, and we really own this space.
The cost of running large-scale compute workloads is becoming increasingly crucial, isn’t it?
Indeed. Cloud cost management is essential, but a lot of enterprises are really flying blind. The cloud has an enormous capability but has some dire consequences, too. If you press the wrong button, you could easily double the cluster size by adding, say, 200 more nodes. And once you get the bill, you get this absurdly huge amount you have to pay for, and you might have no idea how it happened. And that’s just one example; there are so many ways that the OpEx model of the cloud can lead to massive overspend.
Customers make huge investments in their big data platform. Those platforms are moving to the cloud, and in terms of cloud cost management, it's a huge investment. And so they want to know, get some insights, around this investment.
The ability of Pepperdata to control cloud spend must be a real feat of engineering.
Yes, it is a serious feat of engineering. But once we got our observability tools up and running, we realized how powerful they would be.
The first thing that we noticed was that there's so much waste inside applications like Apache Spark. Looking at average cluster performance in terms of how much memory, how much CPU is being used, the percentage is often so low. Often, we find that the utilization is below twenty percent.
That was a shocking revelation, to customers and to us as well. Customers spend so much money building up these clusters by adding thousands and thousands of nodes, thinking they are running a huge amount of workloads. But almost the entire time, these workloads are just backlogged. The queue looks filled up, and cluster performance seems to be fine, but the utilization still shows a low percentage.
Why does this happen? Because a lot of the workloads claim way more resources than they actually use. For example, one task asks for a huge amount of memory, say, 20 gigabytes of memory per node across all 5000 nodes. But when you look at what it actually uses, it’s only about 2 gigabytes of memory. So essentially, these workloads take away gigabytes of memory that other workloads could have used. That's waste that customers are paying for all the time.
What did you do in terms of building the product, when you saw this utilization issue?
When we realized that, this was when we really took it upon ourselves to try and help. This was when we doubled down on building a cloud performance management product feature that looks at this waste and engages with the scheduler to essentially eliminate it. This concept became Pepperdata Capacity Optimizer.
Capacity Optimizer does this by changing the data point so that the scheduler sees the truth, that these resources are not entirely claimed by a certain workload, and they can be freed up and used by other workloads that need it. With this, we increased cluster performance and utilization a lot. Sometimes, we double the current utilization, other times we push it close to ninety percent, as opposed to the previous twenty percent utilization.
And so Capacity Optimizer is a godsend for cloud cost optimization.
In addition, the cloud has an autoscaling capability. It looks at the backlog and adds more nodes based on what it sees as the amount of work waiting. But the current set of nodes are not totally being utilized, and with this new set of nodes added, they're not going to be totally utilized either. And so again, the customer just ends up paying a lot more money for an idling cluster.
To fix this, we have recently launched a managed autoscaling feature on top of Capacity Optimizer. With this feature, we improve the utilization of the current set of nodes. Only when they are maxed out do we add more nodes. The end effect is that you save a lot more money from your bill for the same kind of performance.
Looking at cloud computing in the coming years, have any predictions?
It's very clear that more companies will move their workloads to the cloud. Large-scale data infrastructures are way too difficult for an average engineer to maintain, but the cloud makes it convenient because cloud providers manage and maintain those clusters for their customers.
At the same time, however, some of these cloud vendors do not offer built-in observability on workload performance and infrastructure. And when people move to the cloud, they actually demand more observability and transparency, because without that, companies could make a wrong investment and waste a lot of money. Pepperdata will be essential for that going forward. Capacity Optimizer will be a particularly big help there. That’s one trend.
Another trend we see are people trying to migrate to Kubernetes. Often, they’re trying to move their workload from Hadoop-based on-prem to the cloud to Kubernetes. Not everything will be able to make the transition, however, because some of the application premium frameworks like MapReduce are very hard to move to Kubernetes. But frameworks like Spark or TensorFlow are easier to move to Kubernetes. We see a number of our customers beginning to plan the transition to Kubernetes and starting to make the initial move. At the same time, they're also asking Pepperdata to provide the same kind of observability into this infrastructure and the same kind of cost-saving solution like Capacity Optimizer.
It’s a very, very exciting trend in the industry, and I’m happy with this development. Now, more and more people will demand an easier toolset, and cost optimization so they can move more workloads to the cloud and to Kubernetes, and I think Pepperdata will be a big help.
That’s brilliant, Kimoon, thank you. One last question, though: Any one project in Pepperdata you’re particularly proud of?
That’s a great question! Actually, there are a lot, considering most of our hugely important product features were almost serendipitous discoveries. Our brainstorming sessions lead to someone on our team prototyping a solution, which often, over time, with lots of refinements through customer feedback, becomes a winning product.
It just shows that the power of creativity, the willingness to try out anything, along with a collaborative nature, yields great results.
The views expressed on this blog are those of the author and do not necessarily reflect the views of Pepperdata. Any solutions offered by the author are for illustration purposes only and are not part of the commercial solutions or support offered by Pepperdata.