Insights from One Billion Data Objects
What does unstructured data actually look like in real organisations? We analysed more than one billion data items to find out.
One Billion Data Insights
The following insights are based on the statistics of more than one billion unstructured data items gathered during the fourth quarter of 2024 from customers across various industries and geographic locations in Europe, Asia, and North America.
Unstructured Data Distribution
This graphic shows the distribution of unstructured data across Exchange, OneDrive, SharePoint Online / Teams, and file share locations.
Distribution of unstructured data by content source
Key finding: 90% of unstructured data lives in Exchange — making email by far the highest-risk content repository in any organisation.
Privacy Data Density
This graphic shows the rate of occurrence of privacy information across different industries. The average across all industries is 1.37%.
Rate of privacy data occurrence by industry — average 1.37%
Every organisation we examined is collecting a significant amount of privacy data — between 3 and 49 items out of every 1,000. Well beyond what any organisation can manage by hand.
Security Information
Plain-text login credentials represent one of the most dangerous risks found in unstructured data. The average number is 16 items per person — or 8 excluding the automotive industry outlier.
Average plain-text login items per user, by industry
Distribution of plain-text login items by content source
The vast majority of plain-text credentials are found in email — creating a straightforward attack vector for credential theft, privilege escalation, or unintended AI disclosure.
Privacy Data Distribution
This graphic shows the type of privacy data discovered. Across all industries, the most common types are national ID cards, salary and financial information, and payment cards.
Types of privacy data discovered across all industries
The type of privacy data varies by industry, but national ID cards, financial records, and payment card data are universally present — all subject to strict regulatory requirements.
Languages
This graphic shows the different languages used to store privacy data. The most common is English, followed by German and French.
Languages in which privacy data was discovered
Searching for privacy data in only one language is insufficient. Multi-language storage occurs in every organisation — including Spanish in the US and French in Canada — both by convenience and to deliberately avoid security controls.
Privacy Data Age
This graphic shows the age, by year, of items containing privacy data. The average age of personal data is 3.22 years.
Age distribution of items containing privacy data
Most organisations aren’t removing privacy data once it’s no longer needed — allowing it to accumulate over time. This both reduces compliance and steadily increases the potential impact of a breach.
What one billion objects tell us
From examining over one billion data objects, every organisation we examined is collecting a significant amount of privacy data. The rate of privacy data density is between 3 and 49 items out of every 1,000 — well beyond what can be managed by hand.
As expected, the type of privacy data varies by industry, but the vast majority is stored in email. This makes Exchange the highest-risk content repository in any organisation. Given the prevalence of phishing and email attacks, all organisations should revisit their policies around multi-factor authentication and email security.
The age data shows most organisations aren’t cleaning up historical privacy data — allowing it to accumulate and steadily increasing both the compliance gap and the risk of regulatory fines.
Of particular risk are plain-text passwords, which are extremely common. They create an easy vector for attackers to escalate privileges — or for Copilot and other AI tools to surface credentials to users who shouldn’t have them.
Although a significant portion of the billion data objects come from Europe, multi-language privacy data was found in every organisation examined — including Spanish in the US and French in Canada. This happens both out of convenience and to deliberately avoid security controls, and represents a scenario organisations must account for in their information management programmes.
See how your data compares
Request a Privacy Data Risk Assessment and benchmark your environment against these industry averages.