Data classification is a security and cyber resiliency process where data is organized into categories for easier discovery and identification of an entity’s risk exposure. The practice of classifying data by specific attributes, policies, or security levels—such as confidential, secret and top secret—simplifies how organizations identify what information they have; organize that data for discovery; protect that data from bad actors, manage privacy policies, ransomware, and insider threats; manage that data to gain insights that further business goals; and report on that data to meet compliance and other business requirements.
Traditionally, data classification, or tagging, has been a manual process or with limited tools such as Regex, but as data volumes have exploded and cyberattacks have become more sophisticated and prevalent, organizations are turning to artificial intelligence (AI), specifically machine learning (ML) and natural language processing (NPL)-based pattern matching, to identify the sensitive and regulated data they need to safeguard. This often includes the personal, health, and financial data that bad actors target for ransom payouts.
Today’s businesses produce massive amounts of digital information in the form of both structured and unstructured data. Although much of it is non-noteworthy, some of it is highly valuable to cybercriminals seeking to exploit mission-critical data for financial gain. Sensitive data in organizations’ production, as well as backup and recovery environments, can contain intellectual property (IP), customer personally identifiable information (PII), supplier contracts, protected health information (PHI), payment card information (PCI), and more. Organizations that put in place comprehensive data classification practices are best positioned to understand the full impact of a potential data breach on their organization from all perspectives—financial, operational, and regulatory compliance.
Data classification is important for risk mitigation, governance, cost efficiency, and competitive reasons. The practice specifically helps an organization:
Organizations can choose their own levels of data classification or adopt levels in use by other entities. The key is to define levels in relation to how damaging the data may be to the organization if it were to fall into the wrong hands or be made available by cybercriminals on the dark web or to the public writ large.
For example, a popular approach to classifying documents in commercial settings is to use one of four levels:
Concurrently, business entities often look at three variables to determine data classification:
The U.S. government uses the following data classification levels for sensitive information that can cause national security harm, which according to the National Archives and Records Administration includes these three levels:
These classification levels should not be confused with the levels of security clearance required to view documents considered sensitive by the government, which may include:
There are a number of significant business and security benefits to organizations that thoroughly classify their data. These benefits include:
Manual data classification can be a tedious, time-consuming, and costly process which is why more organizations are automating the process.
Key steps in a modern data classification process include:
Cyber threats such as ransomware perpetrated by individuals and nation-states continue to increase in frequency and severity because successful cyberattacks deliver financial and political gains. Data classification processes enabled by Cohesity data security and management solutions boost cyber resiliency.
Cohesity DataHawk cloud service offerings include data classification that helps organizations discover and classify data to understand if and when sensitive data was potentially compromised during an attack. Specifically, Cohesity discovers and classifies sensitive and mission-critical data with highly accurate scanning based on more than 230 proven patterns as well as machine learning and natural language processing-based training techniques, spanning common personal, health, and financial data combinations. The solution supports regulatory requirements and privacy directives through custom policies.