Unstructured data is information that is not stored according to a predefined data model or schema, such as a relational database management system, or even non-relational databases, such as NoSQL. The vast majority of data in the world is unstructured, encompassing text, rich media, video, images, audio, sensor data from Internet of Things (IoT) devices, and more. Unstructured data can be created by humans or machines and is challenging to store or analyze using traditional data management strategies.
Data is increasingly recognized as the most important asset that businesses possess. Yet few organizations have been able to reap full value from the immense volumes of unstructured data — estimated by analysts to be 80 percent of all data they generate or otherwise acquire during the course of doing business. Managing unstructured data at scale using conventional file services approaches with network attached storage (NAS) devices has proven difficult and costly because of data replication, physical limitations, and governance challenges.
With the right tools, organizations can extract tremendous value from unstructured data. For example, businesses could mine social media posts for data that reflects satisfaction with their brands. Clinicians at hospitals could share a common—and massive—repository of genomic sequences for research purposes.
But how and where to store all this unstructured data, as files or objects, has continued to challenge businesses. Traditional NAS infrastructure helps with performance, but it is costly and doesn’t scale out. Next-generation scale-out NAS is available but not yet widely deployed. Software-defined object storage is beginning to be deployed but most enterprise workloads weren’t designed to use object storage. Adoption has been slow and difficult. Enterprises need a more scalable and efficient way to manage unstructured data.
Examples of unstructured data include the following:
Sources of unstructured data include the following:
Text files—Virtually every office file you’re used to handling is a source of unstructured data. This includes word-processing documents, presentations, and PDFs — anything that doesn’t have a pre-defined format.
Rich media files—Audio and video files do not fit into a structured data model, and neither do digital photographs. Each of these file types can come in its own format, making it even more difficult to analyze.
Email—Some aspects of email are considered semi-structured (the “to” and “from” and “subject” lines, for example), but mostly emails are the source of unstructured text.
Social media—Social media is also a source of unstructured data, although like email, some of it can be considered semi-structured.
IoT data—Device sensors generate an extraordinarily large volume of log files that are unstructured and difficult to analyze in conventional ways.
Unstructured data is used within every business function: finance (invoices), marketing (photos), IT (IoT data), sales (emails with customers), and customer service (social media).
Although it’s changing rapidly, at this point, much of the unstructured data collected and stored is processed manually, if at all. For example, email is mostly processed by a human reading it, extracting what is important (sometimes by copying and pasting into another email or into an application), and taking action based on its contents.
But with advancing AI technologies such as machine learning, machine vision, and natural language processing, more of this unstructured information can be harnessed and analyzed automatically, driving faster business insight.
Structured data is stored in a fixed place within a file or record. It’s typically stored in a relational database (RDBMS) but can also be found in NoSQL databases, for example. Structured data can be text, dates, or numbers.
Unstructured data has not been defined or stored in a predefined way. Although it most commonly consists of text, it can also include numbers, images, and audio.
Data classification is the process of analyzing data and categorizing it into buckets, typically based on metadata (data about data) such as the type of file, its contents, or its date.
By classifying unstructured data by, for example, how sensitive it is, you can better perform unstructured data management that complies with your governance policies by deciding where the data should be stored and who should access it.
Files can be either structured or unstructured data. Common examples of structured data are spreadsheets or SQL database files. Other files, like word-processing documents, presentations, and emails are unstructured. Some files—like invoice templates that display the exact same information in the exact same way every time the template is used—are called semi-structured because there’s a way of getting the information out of them without AI or machine-learning models. So it’s not a question of whether the data is in a file or not; the question is whether within that file the data is stored in a predefined format.
Unstructured data is information that either does not have a predefined data model or is not organized in a predefined manner. That means that it:
Approximately 80% of all data is unstructured, and that percentage grows higher every year.
There are several techniques that you can use to process unstructured data. Here are some of the most widely used:
Metadata analysis—This “data about data” is critical to analyzing unstructured data. For example, a blog post (unstructured text) has metadata consisting of title, author, URL, publishing date, any descriptive tags or keywords, and even perhaps a category name—there are no metadata standards, so each business defines its own.
Image analysis—Images contain unstructured data types that can be very valuable to extract for business, financial, medical, and scientific reasons. New AI-based systems can analyze and match an unstructured image with characteristics similar to a known image. For example, optical character recognition (OCR) technology converts text in image files by matching the shapes of specific images to characters in a language.
Natural language processing (NLP)—This is a subset of AI/ML that aids in analyzing unstructured textual data. NLP uses several techniques to process and extract meaning and make sense of unstructured text, such as grammar and semantics.
Data visualization—When teams choose to visualize data, they present it in a graphical form to allow viewers to understand and analyze it simply by looking at it.
Cohesity’s software-defined, hyperscale platform simplifies data management by consolidating backups and unstructured data in the form of files and objects from multiple application workloads on a single platform. The platform is architected on Cohesity SpanFS, a unique globally distributed file system that supports various protocols, including NFS, SMB, and S3 object storage.
With Cohesity, your organization can protect existing NAS investments—in fact optimize them—by only using that storage for higher-performance data while offloading infrequently accessed-unstructured data to Cohesity SmartFiles. A modern approach to files and objects management, SmartFiles eliminates legacy hardware forklift upgrades and costly and time-consuming manual infrastructure updates while guaranteeing all of your unstructured data is protected wherever it resides—in the data center, the cloud, or at the edge.
Cohesity SmartFiles also features: