The content in this blog is outdated and we cannot reliably say it is still accurate with the speed in which the cloud industry moves. But don’t worry—below are more recent, up-to-date blogs.
Several years ago I joined a company that was in the middle of a frantic architectural transition that prioritized speed over cost. During my first few months I watched our Amazon Web Services bill creep from $100K per month to over $350K. While at first our CEO was accepting of the growing costs as a necessary one-time expense, when the AWS bill surpassed $350K, he realized we were putting our business at risk. The result: a new mandate to optimize the cost, usage, performance, availability and security of our cloud infrastructure.
What followed in the next year was a journey that took my team through what I call the Four Stages of Cloud Optimization. By the end of the journey we had slashed our monthly AWS expenses in half while simultaneously doubling our customer base. We also had learned how to effectively and securely operate thousands of cores of compute and petabytes of storage in the cloud with high efficiency.
In this article, I’d like to outline the Four Stages of Cloud Optimization in order to allow others to assess where they are at, and what is required to take them to the next level.
Phase 1 - Chaos
Chaos is exemplified by a lack of visibility and controls. An organization at this phase will often be relying on the best intentions of individuals to make effective use of the provisioned infrastructure. Common signs you might be in chaos include:
- Lack of understanding for the reason for your growing AWS expenses.
- A nagging sense of waste in your usage of cloud infrastructure.
- No agreed upon standards for security.
- Limited or no use of automation.
- Need for management to ask operations engineers for data to support business decisions.
- Limited or no monitoring to identify critical issues across cloud usage and performance.
- Little or no usage of cost optimization opportunities available from a cloud provider (e.g. reserved instances).
- Little or no usage of cloud-only features (e.g. burstable instance types, spot nodes).
There are many reasons why an organization might find itself in chaos, including unplanned growth, lack of supporting tools, decentralized organizational adoption, and/or limited cloud expertise. Chaos is particularly less forgiving for organizations that are growing rapidly or have highly dynamic workloads. Those who do not find a way out of chaos frequently become “cloud dropouts,” forsaking the cloud for the familiarity of physical infrastructure and/or traditional hosting providers.
The first step toward leaving chaos is to increase management-level visibility to your cloud infrastructure. Several years ago in my previous company, the tools available in the market were limited, and so we had to invest heavily in internal tools and scripts. Today however, there are a variety of open source and commercial tools that provide at least a partial increase in visibility to allow you to achieve the next phase: consolidation.
Phase 2 - Consolidation
Once an organization achieves visibility over its infrastructure, it can begin consolidation. Consolidation is often a high impact phase within an organization, and involves the elimination of waste and the start of standardization. The types of activities that occur during this phase include:
- Removal of obsolete infrastructure (e.g. unused EBS volumes, snapshots, orphaned instances).
- Review / cleanup of security policies.
- Initial use of features available in cloud for cost optimization (e.g. purchasing reserved instances to optimize compute costs).
- Basic reporting of allocation of cloud costs across key business perspectives.
- Increased visibility to the cost, availability, performance and/or security of the infrastructure to other stakeholders across the organization.
While achieving consolidation can have a substantial impact on your organization, it requires constant vigilance to not fall back into chaos. While the visibility of your usage of the cloud has increased, you are still reliant on the execution of individuals to ensure you maintain best practices. Stopping your journey at this phase also leaves your organization open to substantial gaps in your cost, performance, availability and security of your cloud infrastructure.
Here are some signs you are in the consolidation phase:
- Unsure as to whether you are cost effectively utilizing the infrastructure you have provisioned.
- Unsure of what instance types will provide you the best price to performance for specific application workloads.
- Unable to calculate cost per customer, or some other business perspective that requires correlation with data outside of a cloud provider.
- No authoritative inventory of what data you have in the cloud (e.g. S3 objects, snapshots) and whether it is being used.
- Limited or no understanding of the different workloads you operate in the cloud.
While an investment in open source and scripts might take your organization to the consolidation phase, it will almost certainly not allow you to achieve standardization. Standardization requires a solution that does the following:
- Integrates data from your existing tools (e.g. Chef, Puppet, Nagios, Pagerduty).
- Provides insight from both a cloud hypervisor (e.g. AWS CloudWatch) and an operating system (e.g. Linux kernel) perspective.
- Provides information in the context of your business.
Phase 3 - Standardization
One of the primary objectives at this phase is to identify a standard operating environment for your infrastructure. A standard way to deploy infrastructure is often referred to as a reference architecture, and represents a blueprint that drives the provisioning and operations of your infrastructure. For example, whereas in phase 2 you might focus on ensuring all the nodes in your Cassandra clusters are being utilized, in phase 3 you are focused on standardizing the clusters for optimum cost, performance and availability of your specific workload. Some items you might standardize include:
- Critical performance drivers for workloads.
- Right instance types to optimize cost and performance.
- Usage of attached storage, including specific IOPSs.
- Usage of availability zones for reliability.
- Planned operating characteristics (e.g. operate EBS volumes at 55-80% disk utilization for cost efficiency).
- Usage and lifecycle of data in the cloud.
Achieving standardization requires collecting and analyzing a large amount of disparate data from across your supporting systems and tools. There are two options available for this today: an internally developed solution or a commercial solution such as CloudHealth. Many of the early technical pioneers of large scale cloud computing (e.g. Netflix, Engine Yard, Heroku) have substantial internal systems to show for their ability to achieve phase 3. But increasingly the availability of commercial products like CloudHealth is enabling organizations to purchase services and products that eliminate the need for expensive and labor-intensive internal systems.
Phase 4 - Optimization
Optimization is about automating and refining the standardization of your infrastructure. It is also about taking advantage of the unique characteristics of the cloud. Companies at this stage have a holistic business view of the cloud ecosystem, with tools that facilitate optimization, standardization, and reporting from a business perspective. Organizations at the optimization phase are trying to utilize as close to 100% of the resources provisioned as possible while maintaining availability and performance across their workloads.
Some signs that you have achieved this phase include:
- Full transparency to cloud cost, usage, performance, availability and security across all key stakeholders in an organization.
- Existence of well-defined reference architecture for all applications / workloads, which justifies costs with business value provided.
- Existence of documented policies for managing infrastructure to optimum efficiencies.
- Moderate to heavy utilization of cloud-based opportunities for cost optimization (e.g. reserved instances, spot nodes, burstable instance types, elastic scaling groups).
- Automated scaling (up and down) of infrastructure to support workload at the right performance and cost.
- Proactive notification of deviation from reference architecture.
- Proactive notification of inefficient usage.
- Deep analytics to allow additional "what-if" investigations.
- Automated failover.
- Closed loop process for continuing the optimization of your infrastructure.
Netflix is a great example of a company that achieved the optimization phase - but increasingly many other lower profile organizations have also achieved this level of cloud maturity by using solutions like CloudHealth.