One of the core values of Cohesity is customer obsession. We believe in delighting our customers, not just serving them, with solutions and services. We put customers first in everything we do – if a customer runs into an issue with a product feature, the engineer who is responsible for it will drop everything to fix the issue. Sometimes, the entire team gets involved to fix the issue. Sometimes, the entire leadership gets involved to address the issue.
Our Product Support Engineer (PSE) team is one of the finest in the valley. What does it take to serve our customers effectively and efficiently? How do they do it? With so many clusters in the field, it is very difficult and inefficient to monitor each and every cluster manually, check the system status and troubleshoot the issues. When customers run into issues, they reach out to our PSE team first, and the team helps. That is called reactive support which comes after an issue has manifested. We don’t want that to happen. We want to reach out to our customers before they run into issues – the other way. It’s called proactive support.
To provide proactive support, we have developed a system that can monitor the clusters, analyze the data and predict actions that need to be taken on the clusters – either immediately or in the near future. This also helps us to avoid potential customer escalations.
Cohesity doesn’t have control over which clusters to monitor and which clusters it needs to provide proactive support for. It is up to the customers to decide if they would like to receive proactive support from Cohesity or not. If they do, they simply enable the proactive monitoring feature on the cluster.
Monitoring
When proactive monitoring is enabled on a cluster, the monitoring agent gets activated. The agent takes a snapshot of the status of different components at periodic intervals and sends it to Cohesity over HTTPs.
Some of the information in the snapshot include:
- Resource usage in the cluster.
- Health status of the hardware components.
- Firmware details of the hardware components.
- Various stats about the internal components.
- Critical alerts generated in the cluster.
- Configuration and critical logs.
This information is essential for us to serve our customers better.
Analytics
Next thing that we do after collecting the data is to run analytics on it, and the analytics component generates actionable intelligence. Some examples of the actionable intelligence include:
- A disk in a cluster has too many read/write errors, it is going to fail, and it should be replaced.
- A disk in a cluster suddenly fails, the cluster throughput drops, and the disk needs to be replaced.
- A node in a cluster fails, the throughput drops and the node needs to be replaced. This is more critical than the disk failure.
- A hardware component from a particular vendor is faulty, and one of the clusters is already in a bad state. The hardware component from the same vendor on rest of the clusters should be replaced.
- To upgrade to a certain software version, the cluster has to go through a recommended upgrade path. If a cluster hasn’t followed this path, that’s an anomaly. We can notify the customer about that.
- Firmware version running on a particular hardware component is different from the recommended version.
- Clusters that are running non recommended workloads can be identified.
- Applying patches on the clusters that are running a specific software version.
The intelligence can be broadly categorized into anomalies, recommendations and potential hardware failures. Depending upon the severity of these actions, our PSE team gets alerted. They take it from there.
Conclusion
As I said in the beginning of this blog, we’re customer obsessed. Cohesity consolidates secondary storage workflows onto a hyper converged solution. Although what happens behind the scenes is complex, what our customers see when they use the Cohesity solution is a very simple management interface that is both easy to use and easy to navigate.