Issue # 5 Reducing your Cloud Bill the Common Sense Way
Hello, and welcome to the 5th issue of The Digital Atmosphere. I am Manu.
I know I have been irregular in sending out issues in the last few months — sorry about that. As they say, life comes in the way of well laid plans, fortunately or unfortunately. But the things that tracked me wayward are, hopefully out of the way now and I should be more regular in sending out editions of this newsletter.
Anyways, a long (long!) time ago, in the last edition, I had started discussing ways in your cloud computing bills might potentially be optimized. I say optimized and not reduced because there is always a chance that you might be running an optimized deployment already, in terms of your cloud computing assets, which cannot be reduced any further. In that case, congratulations! You probably can skip the rest of this newsletter.
However, for the rest of us deploying applications in public clouds, this is typically not the case. Also given how easy most cloud providers have made it to provision resources in the cloud, it is not an unreasonable expectation that you are paying more than what you want to towards your cloud infrastructure. And if this has become a significant spend in your organization, then optimizing these costs becomes necessary.
The path to optimizing cloud costs starts from knowing your workloads and then figuring out the infrastructure that can most efficiently service the set of users you have while meeting the performance, security, reliability and availability promises that you make to your customers. However, after getting some practical experience with cloud infrastructure management in the last couple of years, I have come to the conclusion that most organizations, if they are not careful, would be paying for more cloud resources than they’d actually need. And the reasons for those are manyfold. And almost all of them have to do with culture and processes, more than engineering decisions themselves. Allow me to explain.
One of the worst “features” that ClickOps has provided is the extremely low barrier for creating cloud resources. And to some extent, there is some truth to the feature aspect. Having access to a browser based console that allows you to spawn VMs, managed DB instances, load balancers, LLM inference endpoints located in a datacenter halfway across the world, in a matter of seconds. Whatever you need, you can create right away. This can provide a surprising amount of productivity boost to developers who are deploying software or creating experimental pipelines and architectures.
The benefits are obvious. One does not have to spend hours in assembling, installing, configuring hardware and software. At the push of a button, in all likelihood, what you wanted would be available and just work. But with great power comes great responsibility. And the latter is much harder for most of us to imbibe. If you provide the ability to create cloud resources at the click of a button to a large number of people, let there be no doubt that people will create more resources than they actually need. There is nothing malicious about this; it’s just human nature. Creating something that you need now, right away (assuming that you have the ability to do it) is infinitely easier than going to someone and checking if there is an existing resource that you can potentially utilize for the purpose. And before you realize it, this “feature” will be responsible for significant increases to your cloud bill.
Let me to tell you a horror story in 70 words, from a Dev/Ops person’s point of view.
I need to create a new service endpoint.
I don’t know if it can be deployed on any existing resources?!
Wait, but I have access to the Cloud Console for the company’s provider!
Great - I’ll go ahead and create an asset to deploy this tiny little container to test the service.
(And promptly forget about it allowing it to potentially live forever, adding the company’s already massive cloud bill.)
So, whats the first thing you should do to contain your cloud bill?
Takeaway 1: Restrict the number of people who can create cloud resources on your company’s billing account.
Alright. Let’s say that the above is done. Only a few people in the organization can create cloud resources. But there would be many who would require to use various types of resources. For example to deploy newly created services or carry out performance or load tests, among other things. So, if someone in the organization is in need of a resource - say a VM to deploy a container, they would probably create a new VM only if they don’t know if there already exists a VM that they can use for this purpose. Most people would want to get the job done quickly, which means that they’d want to get access to the cloud resource they need to use right away, without having to go through approvals through a chain-of-command. And if such a mechanism is in place (which could be as simple as a shared spreadsheet) they might be inclined to re-use existing resources (already contributing to the organization’s cloud bill) than creating something new.
But this also assumes a few other things. First, anyone who is in need of a certain type of resource can view the available inventory of cloud resources. Which in turn means that the organization has a mechanism in place for making visible a list of all such resources to folks in, say the Dev/or DevOps groups. Second, assuming that such a list exists, folks can get access to the resource they have identified to be used quickly. This can be done by either talking to the right people (e.g. owners of the resource) or accessing a shared document/webpage from where they can get the credentials for using such a resource, and announcing the same to the owners of the resource. Again, the exact policy might differ from one organization to the next, but the fact remains that some amount of transparency needs to exist across the organization regarding available cloud resources and clearly defined policies on how to access them.
For this to work, an inventory of all cloud resources, with their utilization levels needs to be carried out frequently. This serves two purposes. First, it ensures that all the cloud resources are accounted for, and there exist no resources which were created once and forgotten forever. This also helps ensure that the purposes of the existence of each resource are accounted for and that there are no “surprises”. Second, a periodic inventory helps drive a better understanding of which resources is being used how well, which in turn can allow for better utilization of potentially under utilized resources. Conversely, this can also provide feedback about which resources are under pressure, and might need more instantiations to meet performance/availability/reliability guarantees for the application.
Takeaway 2: Inventory your cloud assets. Make it available to concerned folks; put processes in place for making those resources accessible to people who’ll need them.
Then comes the human side of the optimization operations, which has more to do with an individual’s psychology towards money. Which is to say that you’d probably be less careful with spending money that is not coming from your own bank account. I’d bet good money that even if you know that there is money involved in creating cloud resources, you’d be more willing to do it if it is someone else’s money (in this case, the organization’s) that is being spent than your own. And even if you are raking in all the VC ₹₹s, I believe it’s your responsibility to spend that money well. Well, its 2023 and there are no that many VC ₹₹s going around to begin with, but I digress.
It all boils down to having a culture of “treat the organization’s money as if it’s your own”. For starters, the organization itself needs to be fiscally responsible with its own money. In order to ensure that the cloud costs are kept under check, organizations should have a very good sense of how much money they can spend on the cloud, aka a budget. Once a budget is in place, it needs to be ensured that new cloud resources are instantiated only when the existing ones are being fully utilized. Well, as much as they potentially can be. Simply put, the organizational culture should not be to instantiate new resources willy-nilly just because you can, but to instantiate whatever is required when none of the available ones can help you get the job done. Simply put, that 15th VM should not be instantiated when all your workloads can be packed into three, with room to spare.
Takeaway 3: Make sure you have a Cloud Budget. Instantiate only what you need, after “fully” utilizing what you have already instantiated.
Optimizing cloud costs is an ongoing exercise, which evolves as your deployment evolves to include more features and services and service more users. And constant vigilance is the price that companies will have to pay to ensure that their cloud bills remain in check.