In a recent blog, our CEO, Mohit Aron, debunked an industry myth that secondary data only refers to backups. Secondary also includes files and objects, test/dev, and analytics workloads, and it comprises up to eighty percent of total infrastructure within a modern data center. In today’s post, I talk about how these secondary workloads are highly fragmented, and about how their consolidation under a single data platform is Cohesity’s primary focus. All data is important, especially the eighty percent you might assume isn’t.
How did a word – secondary – screw up an entire industry? What are we going to do about it?
Words mean things. I firmly believe people do things – they take actions or they react – based on how words make them feel. You can stand up in front of somebody and recite statistics until you’re blue in the face, but here’s what people will remember when they walk out of that room: How you made them feel.
Now, don’t get me wrong: You have to be credible; you have to know your stuff; you have to be grounded in facts. Otherwise, you’re just an entertainer. The magic happens when you can combine your facts with meaning and give people a reason to change. This is what makes them feel good about that decision.
One “swing thought” (for all you golfers out there): As you prepare a presentation, you have to answer three questions: What do I want them to think? What do I want them to feel? What do I want them to do? If you want to effect change, the most important thing to consider is feeling. Very few people change when presented with a fact.
How did secondary sneak up on an entire industry?
So, how does the word “secondary” make you feel? What do you do with “secondary” things? “Secondary” doesn’t get your time and attention. It doesn’t get a lot of your money (at least we don’t want it to). A lot of times we look at secondary as a necessary evil. We do what’s minimally necessary. We find a quick fix, get it behind us, and get back to work on some higher priority things.
Our IT world is no different. Once we determine something is secondary, we look for the quick, reliable fix. For backup data, we find a tool, squish it down, and ship it offsite. For archive, we find a tool, squish it down, and ship it offsite. For tiering, we find a tool. For test/dev? Find a tool (or platform)! We have a lot of cool tools and platforms for archive, backup, tiering, and test/dev. (AWS (Amazon Web Services) topped $17B last year; they’re growing 45% Y-o-Y; and their #1 use case is still archive, backup, test/dev). We found a tool, solved the problem, and moved on.
Here’s the issue: We slapped the “secondary” label on a bunch of data. We told ourselves it’s not that important, and we acted the way you would think. We bought an archive product over here; put together some homegrown rsync scripts over there. We even went big with a backup vendor.
Unfortunately, when you step back and look at all that secondary data, it adds up to 80% of our information (80% and growing 800% over the next 5 years).
It’s not that any of our solutions were necessarily bad. It’s that when you take a look at them in aggregate, you didn’t get the ROI you expected: They don’t talk to each other; many blob up data into proprietary formats so it’s hard to “see” into them; there are different APIs, CLIs, and UIs. Throw a cloud strategy on top of that, and it’s no wonder that the data management unicorn is a single pane of glass (SPOG).
Stepping Back: A Holistic Approach to Secondary Data Consolidation
Mohit Aron and the engineering team at Cohesity did step back. They did take a look at “secondary” data challenges as a whole, and knew the design had to be holistic. We couldn’t just be another slick point product. That would simply add to the problem.
Fortunately, this was a familiar problem for the Cohesity engineering team. They knew the solution had to be a DataPlatform similar – at least in concept – to the Google File System they built years ago, which was a platform that provided the foundation to address new use cases. (Think about what you probably use every day – Maps, Gmail, YouTube, etc. Those are all built on the platform that is Google File System. If you want to read more about that history and Mohit’s part in the Google File System, check out this article.)
Start with Backup and Data Protection
Use case 0: backup. For a young company, think how important it is to be excellent at Use Case 0. It’s existential. You’re betting the company on Use Case 0. Having this idea of a data platform is only great if you can prove your data platform actually works. Asked and answered! Customers responded by driving incredible (>600%) growth, industry recognition, and outstanding customer enthusiasm.
No Shortcuts
Now, you could develop some slick point products and maybe get to market a little faster, but the design goal wasn’t to smooth over backup or make archive a little less painful. The idea was to build a DataPlatform that could solve secondary data sprawl.
In order to do that, Mohit had to step back and develop a file system that included the properties you would need for secondary data:
- Intelligence to detect random vs. sequential workloads and optimize for each
- An N+1 scale-out architecture
- Global (variable length) dedupe vs. dedupe by node, workload or fixed block
- Multiprotocol: NFS, CIFS, S3
- Encryption at rest and in flight
- Compression
- Snapshots, clones and replication
- Instant mass (i.e. hundreds) restore
- Archive
- Block or “chunk” tiering
- Erasure Coding
- A native MapR capability to analyze all of this secondary data
- REST APIs
- QoS
Sometimes when you stand back, the big picture comes into focus. Cohesity was really the first to do this. We didn’t let the word “secondary” influence our thinking and impact our solution. All data is important, especially the eighty percent you might assume isn’t.
If you build a DataPlatform to start, all of this cognitive dissonance gets resolved. No more mental friction! If you have about 15 minutes, listen to Mohit himself talk about the road to developing a file system for secondary data.