Research

Insights from One Billion Data Objects

What does unstructured data actually look like in real organisations? We analysed more than one billion data items to find out.

Q4 2024 1B+ data objects Europe, Asia & North America Multiple industries

About the Report

One Billion Data Insights

The following insights are based on the statistics of more than one billion unstructured data items gathered during the fourth quarter of 2024 from customers across various industries and geographic locations in Europe, Asia, and North America.

1B+

unstructured data objects analysed

Q4 2024

data gathered

3 regions

Europe, Asia & North America

Finding 01

Unstructured Data Distribution

This graphic shows the distribution of unstructured data across Exchange, OneDrive, SharePoint Online / Teams, and file share locations.

Distribution of unstructured data by content source

Key finding: 90% of unstructured data lives in Exchange — making email by far the highest-risk content repository in any organisation.

Finding 02

Privacy Data Density

This graphic shows the rate of occurrence of privacy information across different industries. The average across all industries is 1.37%.

Rate of privacy data occurrence by industry — average 1.37%

Every organisation we examined is collecting a significant amount of privacy data — between 3 and 49 items out of every 1,000. Well beyond what any organisation can manage by hand.

Finding 03

Security Information

Plain-text login credentials represent one of the most dangerous risks found in unstructured data. The average number is 16 items per person — or 8 excluding the automotive industry outlier.

Plain-text credentials per user by industry

Average plain-text login items per user, by industry

Plain-text credentials by content source

Distribution of plain-text login items by content source

The vast majority of plain-text credentials are found in email — creating a straightforward attack vector for credential theft, privilege escalation, or unintended AI disclosure.

Finding 04

Privacy Data Distribution

This graphic shows the type of privacy data discovered. Across all industries, the most common types are national ID cards, salary and financial information, and payment cards.

Types of privacy data discovered across all industries

The type of privacy data varies by industry, but national ID cards, financial records, and payment card data are universally present — all subject to strict regulatory requirements.

Finding 05

Languages

This graphic shows the different languages used to store privacy data. The most common is English, followed by German and French.

Languages in which privacy data was discovered

Searching for privacy data in only one language is insufficient. Multi-language storage occurs in every organisation — including Spanish in the US and French in Canada — both by convenience and to deliberately avoid security controls.

Finding 06

Privacy Data Age

This graphic shows the age, by year, of items containing privacy data. The average age of personal data is 3.22 years.

Age distribution of items containing privacy data

Most organisations aren’t removing privacy data once it’s no longer needed — allowing it to accumulate over time. This both reduces compliance and steadily increases the potential impact of a breach.

Conclusion

What one billion objects tell us

From examining over one billion data objects, every organisation we examined is collecting a significant amount of privacy data. The rate of privacy data density is between 3 and 49 items out of every 1,000 — well beyond what can be managed by hand.

As expected, the type of privacy data varies by industry, but the vast majority is stored in email. This makes Exchange the highest-risk content repository in any organisation. Given the prevalence of phishing and email attacks, all organisations should revisit their policies around multi-factor authentication and email security.

The age data shows most organisations aren’t cleaning up historical privacy data — allowing it to accumulate and steadily increasing both the compliance gap and the risk of regulatory fines.

Of particular risk are plain-text passwords, which are extremely common. They create an easy vector for attackers to escalate privileges — or for Copilot and other AI tools to surface credentials to users who shouldn’t have them.

Although a significant portion of the billion data objects come from Europe, multi-language privacy data was found in every organisation examined — including Spanish in the US and French in Canada. This happens both out of convenience and to deliberately avoid security controls, and represents a scenario organisations must account for in their information management programmes.

See how your data compares

Request a Privacy Data Risk Assessment and benchmark your environment against these industry averages.

Request an assessment

Next: Multi-Factor Classification