Our “Pepperdata Profiles” series shines a light on our talented individuals and explores employee experiences. This week, we talked to Kimoon Kim, a veteran Pepperdata software engineer. Kimoon elaborates on how Pepperdata products have evolved with the needs of the big data industry in his six years with the company, and he offers some predictions on the future of cloud cost management and big data performance.
Hey Kimoon! You’ve been at Pepperdata for six years. How have things evolved in that time?
Well, it’s been awesome throughout. Six years ago, we only had a macro concept of the product line we wanted to build. So a lot of discussions, creative activities, and brainstorming ensued. Then, when it was time to execute, everybody stepped up. We all worked together, like a family unit.
We quickly found excellent customers. You know, a lot of startups in the big data industry fail to do this, then they disappear. Not us. And over time, we learned a lot from our customers’ experiences of our products, and we diligently applied those lessons back to our product lines. Their insights also helped us identify which of our initial thoughts were a mistake, which we had to drop or adjust over time. Now our product is way more mature.
Today, we’re trying to explore really adventurous markets in the big data industry. All the while, we still spend a lot of time maintaining our product quality, adding more features as we go. So yeah, overall, it’s been a very, very interesting experience!
What makes the Pepperdata product unique? As an expert engineer, what do you really value about working with the platform?
Before I answer the question, here’s a bit of background: Before joining Pepperdata, I was a search engine engineer across three different search engine companies, including Yahoo and Google.
When I was at Yahoo, I was part of the content team. We adopted the big data platform Hadoop, and we actually invested heavily in making Hadoop mature and production-ready. Yahoo sort of became the mothership of the big data industry at that point.
Back then, scalability was everything. We were hoping to achieve production-grade scalability—handling billions and billions of data items, terabytes of information. But because we focused so much on scalability, the big data platform became extremely difficult to use. Unless you touch the right-arm setup buttons or configurations, when you run your big data workload, more often than not, it is bound to fail.
Moreover, since this was open source, there is no detailed documentation about those configuration parameters. Furthermore, a shocking revelation was that there was no way for the owner to see what was happening right before the big data performance workload failed. There was no visibility, no drivability whatsoever.
Case in point: Back then, your typical big data workload ran for about eight hours. What often happened was, at the seven-hour mark, it just failed. You waited for seven hours, just for the workload to fail, without any idea why it happened. So you would tweak a random set of parameters, launch it again, wait for seven hours, all for it to fail again. Madness!
That old approach to big data performance must have been very frustrating?
It really was. The cycle was horrible, almost inhumane. The lack of visibility was killing the big data industry. A lot of people realized that the platform only worked about 10% of the time, and you had to spend a lot of your human hours, which cut down productivity. And there's not a lot of return on investment on that. That was the main problem.
Then, Pepperdata came along and, somehow, offered the monitoring and visibility solution to the whole big data industry! I say “somehow” because, like I said, this was not our original thought. We built our product based on something else we thought was more important.
But when the initial customers used our product, they immediately said something along the lines of, “So you're saying your product works, but I cannot see it. Let me get a solution to actually look at it.” And so we built this extra feature on our product, which was monitoring. That was a hit with our customers, and it evolved into a much more important product feature all on its own. Since then we’ve gone from monitoring to true observability, and we really own this space.
The big data industry evolves fast. Going forward, do you still see Pepperdata solving the same problems and providing the same value?
Yes. I mean, the basic visibility and observability will always be a necessary tool for customers to have. But once they have that tool, they realize that they can do more.
Customers make huge investments in their big data platform. Those platforms are moving to the cloud, and in terms of cloud cost management, it's a huge investment. And so they want to know, get some insights, around this investment.
They ask themselves, “Is it paying off? Am I making the right investment? Is my ROI increasing? Or am I just wasting my money?” A lot of customers realize that this long-term planning based on insights is more important. And so they ask Pepperdata to evolve as an essential tool for that cost need as well.
Increasingly, the cost aspect of big data performance is crucial, isn’t it?
Indeed. Cloud cost management is so crucial but a lot of enterprises are really flying blind. The cloud has an enormous capability but has some dire consequences, too.
If you press the wrong button, you could easily double the cluster size by adding, say, 200 more nodes. And once you get the bill, you get this absurdly huge amount you have to pay for, and you had no idea how it happened. And that’s just one example; there are so many ways that the OpEx model of the cloud can lead to massive overspend.
The ability of Pepperdata to control cloud spend must be a real feat of engineering?
Yes, it is a serious feat of engineering. But once we got the visibility up and running, we realized how powerful it would be.
The first thing that we noticed when we first got visibility into the big data platform was that there is so much waste. Looking at average big data cluster performance in terms of how much memory, how much CPU is being used, the percentage is so low. Often, the utilization is only in the 15-20% range.
That was a shocking revelation, to the customers and to us as well. Think about it: They spend so much money building up these clusters by adding thousands and thousands of nodes, thinking they are running a huge amount of workloads. But almost the entire time, these workloads are just backlogged. The queue looks filled up and cluster performance seems to be fine, but the utilization still shows a low percentage.
Why does this happen? Because a lot of the workload is currently running, claiming way more resources than they actually use. For example, one task asks for a huge amount of memory, say, 20 gigabytes of memory per node across all 5000 nodes. But when you look at what it actually uses, it’s only about 2 gigs of memory. So essentially, these workloads take away gigabytes of memory that other workloads could have used.
What did you do in terms of building the product, when you saw this utilization issue?
When we realized that, this was when we really took it upon ourselves to try and help. This was when we doubled down on building a big data cloud performance management product feature; one that looks at this waste and manipulates the scheduler of the big data cluster. This became Pepperdata Capacity Optimizer.
The feature does this by changing the data point so that the scheduler is “tricked” into seeing the truth: that these resources are not entirely claimed by this particular workload, and can be freed up and used by other workloads that need it. With this, we increased cluster performance and utilization a lot.
Sometimes, we double the current utilization, other times we push it close to 90%, as opposed to the previous 20% utilization.
And so Capacity Optimizer is a godsend for big data cloud performance management because the same kind of wastage is actually happening across other cloud-based clusters. The cloud, as it is, has an auto-scaling capability. It looks at the backlog and adds more nodes based on what it sees as the amount of work waiting. But the current set of nodes are not totally being utilized, and with this new set of nodes added, they're not going to be totally utilized either. And so again, the customer just ends up paying a lot more money for just an idling cluster, which is stupid.
To fix this, we have recently launched Pepperdata Managed Autoscaling, which is built on top of Capacity Optimizer. With this feature, we improve the utilization of the current set of nodes. Only when they are maxed out do we add more nodes. The end effect is that you save a lot more money from your bill for the same kind of performance.
Looking at the big data industry in the coming years, have any predictions?
It's very clear that more companies will move their workloads to the cloud. Big data infrastructures are way too difficult for an average engineer to maintain, but the cloud makes it convenient because cloud providers maintain those clusters for their customers.
At the same time, however, some of these cloud vendors do not offer built-in visibility on big data performance, infrastructure, and workload. And when people move to the cloud, they actually demand more visibility and transparency. Because without that, companies could make a wrong investment and waste a lot of money. The Pepperdata services will be valuable to that, going forward. Capacity Optimizer and Managed Autoscaling will be a particularly big help there. That’s one trend.
Another trend we see are people trying to combine their big data infrastructure and non-big data infrastructure into a single cluster management stack using Kubernetes.
Often, they’re trying to move their workload from Hadoop-based, to big data infrastructure, to Kubernetes. Not everything will be able to make the transition, however, because some of the application premium frameworks like MapReduce are very hard to move to Kubernetes. But for newer ones like Spark or TensorFlow, other AI workload would make the transition.
This is just the start. We see a number of our customers beginning to plan the transition and starting to make the initial move. At the same time, they're also asking Pepperdata to provide the same kind of visibility into this infrastructure and provide the same kind of cost-saving solution like Capacity Optimizer.
It’s a very, very exciting trend in the big data industry, and I’m happy with this development. Now, more and more people will demand an easier line of toolset, and I think Pepperdata will be a big help.
That’s brilliant, Kimoon, thank you. One last question, though: Any one project in Pepperdata you’re particularly proud of?
That’s a great question! Actually, there are a lot, considering most of our hugely important product features were accidental discoveries. Our brainstorming sessions lead to someone on our team prototyping a solution, which often becomes a winning product.
Take Capacity Optimizer as an example. Honestly, the idea was born out of a joke. Four of us were walking to a coffee shop when someone joked about how stubborn big data schedulers were for not dealing with resources properly. Then someone else suggested changing the data structure that the scheduler uses so we can basically fool the scheduler into “doing the right thing.” Fool the scheduler; sounds weird, right? But one of the engineers built the prototype within three days, and lo and behold, it worked!
It just shows that the power of creativity, the willingness to try out anything, along with a collaborative nature, yields great results.