“We are investing in cloud native stacks.”
“We are ‘born in the cloud’.”
“We are running cloud optimized architecture.”
Do any of these phrases remind you of a recent conversation you have had with your team or even a customer? If you are reading this, it is likely that you have had these discussions multiple times a week or even multiple times a day. And we are sure that you have noticed that “cloud-native services” can mean different things to different people, teams, and organizations. The Cloud Native Cloud foundation (CNCF) has a standard definition for this term, which you can read here. In summary, the definition refers to the following design principles:
- Microservices approach to building apps
- Packaging these services as containers
- Taking a dev/ops approach to deploying infrastructure
- Continuously delivering applications to customers
For our team, all of these principles came into focus as we were building out our data management capabilities to be delivered as a service in the cloud. As a mid-sized company, we do not have a large data center footprint; we rely on public cloud infrastructure to run our business applications. We also consume SaaS applications for productivity, HR, business operations, CRM — pretty much wherever we can. So, it was natural for us to think about our customers needing something similar. In fact, they told us that they did.
However, delivering a service in the cloud is not the same as running a virtual appliance in the cloud. In fact, that is a term colloquially known as “cloud washing”. While this may be a quick and dirty way to plant a flag in the cloud, it barely ever works in the long run. This is where the definition and relevance of cloud native services come in. To deliver a service in the cloud, you need to build it for the cloud, leveraging cloud technologies and making it scale in the cloud. Let us break this down a bit, looking at each aspect of ‘cloud native’ and how it applied to what we learned.
Microservices Approach to Building Apps and Packaging Services as Containers
When you deliver software to be run by a customer, it needs to be seamless and simple for the customer to operate and for you to support. Thus, it is fairly typical to deliver the entire application in one piece with a single upgrade and change event. With the emergence of hyperscale cloud providers, led by Amazon Web Services and Microsoft Azure, many enterprises began running software on cloud infrastructure and began consuming business software applications such as email or collaboration tools as a service (SaaS). Enterprises worldwide have a mandate to reduce their data center footprint where they can use cloud. We are seeing this market transition play out in our customer base and most are a hybrid journey, where they have (reducing) data center footprint that is complemented by cloud infrastructure or SaaS applications. As a company that is simplifying data management with a modern approach, creating a SaaS option was only a matter of time to support our customers wherever they are in their hybrid journey. As a result, Data Management as a Service (DMaaS) was born.
To deliver DMaaS, one easy option was to simply run our software on AWS EC2 instances and spin up a new instance each time we had a new customer or deployment need. After all, our clustering model would support auto discovery of other clusters, tenant data isolation and all the same constructs that we have been serving to our customers for their data centers for years. However, several companies in the SaaS graveyard tell us why this would not work. On one hand, it would not be scalable, distributed, elastic or cost effective and on the other, it would not give our customers self service, flexibility, on-demand bursting, high availability; the very capabilities that we associate with SaaS.
At Cohesity we have a motto that is plastered in our walls and more importantly into our minds, “No shortcuts”. True to that spirit, our architects had to consider the tradeoffs for breaking down the components of the application that would be delivered, upgraded and supported independently. They had to think through the scenarios of breaking down to the most atomic components or to a collection of loosely coupled services that logically made sense to group. As with most enterprise SaaS applications, performing a fission into micro-services requires several considerations:
- Conway’s law — how we are organized determines how our components are delivered
- Value to the customer — how these components work and how the customers perceive value from them. Do customers value upgrades as often as they can get them? Or are there more stringent needs around change, especially to on-prem customer-managed components of the managed service?
- Disruption from downtime — monitoring capabilities for service availability, performance, scalability, tenancy. What components are most critical and perceivable by the customer? If they fail and how do we recover from such component failures?
- Statelessness of the components — configuration and state management, user interaction needed to these services.
- Global distribution — making the services available to a global footprint to serve global customers and multinationals who may stretch their infrastructure across the globe and across several regions.
- Operational factors — costs, efficiency, and frequency of upgrades needed
For our DMaaS offerings, after a thorough architecture review, our team decided to begin conservatively, dividing up our services into independent functional blocks that can be delivered, managed and maintained independently. We decoupled our application to numerous services, but decided to maintain functional integrity. To a certain extent, we mapped it to the teams developing, integrating and delivering the services. One of the tradeoffs we considered was to balance agility with efficiency, so although we could have further modularized our services, we decided to iterate with lower coordination costs between services and we still have a considerable backlog of potential optimizations that can further streamline delivery. One of the key benefits with SaaS is that it offers ample opportunities to learn, iterate and adjust and since the launch of our service, we have already learned a ton and the underlying service delivery changes every day.
To summarize, delivering software as a service requires a fundamental architectural rethink. It is not as simple as running software in a cloud instance. We were fortunate to have a team of experts across various disciplines in the team that have delivered multiple SaaS services including our Cohesity Helios platform, which provides our platform services and management interface for managing customer infrastructure has grown to serve thousands of customers over the past three years.
The next step for handling customer data needed a whole new level of diligence. To achieve this, we got much needed assistance, guidance and review from the AWS Well Architected team to ensure we are utilizing best practices, available capabilities and services from AWS as well as planning for the future as the service scales, we add more integrations, third-party services, regions, and workloads.
DevOps Culture and Continuous Delivery
In the previous section, we covered what it takes to architect software to be able to deliver it as a service. The key word we discuss in this segment is ‘deliver’.
Continuous delivery implies that every ‘release’ is ready and will work for customers. Sure, there may be issues, as there are with any software, but these get fixed forward in a subsequent release, which is not months away, but could just be a few hours later. In our case, we needed to make a conscious switch to releasing on a train model vs. payload based.
In this model, the train leaves the station at a scheduled time regardless of how many passengers there are. Cars get added or removed depending on the demand. There is no concept of waiting for some passengers, even if they paid for executive class. To make this happen, teams needed to agree on ground rules that outlined the criteria and timeline for a feature to make a certain release. We had to build automation to help test against this criteria and to enforce it. Any exceptions would require change controls and sign offs.
Cloud DevOps and SRE teams needed to be empowered to make such decisions independently based on data for this model to be successful. Integrity and separation of duties between the development teams and the operational teams is key to successful delivery of SaaS and the healthy tensions between these teams results in a better outcome for the team and customers at large.
Describing operating the service can take a whole blog in itself, but ops teams live and die by the SLA, uptime, stability and scalability of the service. To ensure they are on top of this, operations teams must be able to monitor ‘everything’ and have a process to address issues that impact the fleet. Infrastructure monitoring in the cloud is fairly standardized with tools such as DataDog, SumoLogic, Twistlock, PagerDuty and many native services provided by the infrastructure providers such as AWS.
In addition to infrastructure monitoring in the cloud, ops teams need to monitor software that is being delivered. Many of our customers are service providers using our platform and software to deliver services such as a managed backup service. So, we already built instrumentation and licensing capabilities to monitor usage, utilization, scale and performance for our software. We built operational dashboards to ensure easy triage of issues and a process to work across teams to triangulate defects proactively before customers start experiencing any issues.
Metrics and thresholds are established and an operational procedure that sets the tone for how people work together to resolve issues in the fleet. Such an operating procedure should be well integrated into a company’s support process so that it is transparent to a customer whether they are using an on-prem product or SaaS. In our case, we have customers that are always in a hybrid spectrum so we built these procedures and our service for the hybrid customer.
Bringing It All Together
The next time you hear the phrase ‘cloud native’, hopefully you will better be able to appreciate what really goes into providing such services. Delivering cloud native services as SaaS takes planning and collaborative alignment between teams. It starts by thinking about the architecture of software, how to continuously deliver software and to operate it as a service, and ultimately how to best address customer needs. It takes the right mix of product, people and process that need to come together to deliver as a turnkey service, removing the complexity for the customer and giving them the piece of mind that they are taken care of. This is especially valuable when we are managing their crown jewels — their data.