Benchmark

Real-World Purview Benchmark: We Tested Microsoft Purview on 89,521 Files

Data & More · 4 min read

Nobody believes a small Danish company when we say Purview classification does not work for real-world compliance with data privacy legislation. So we did something about it: we created a synthetic benchmark dataset that anyone can download and test themselves.

The Dataset

We analyzed over 5 petabytes of real-world data and generated a synthetic dataset that mirrors real unstructured data, without using any actual personal data. The distribution of file types is based on statistical analysis and reflects the typical mix found in enterprise environments.

89,521 files total
73 GB total size
863 KB average file size
100% manually validated ground truth

The dataset is available on Hugging Face for anyone to download and reproduce: huggingface.co/datasets/kbillesk/synthetic-privacy

Privacy and Security Data in the Benchmark

The benchmark dataset contains 25 categories of privacy and security data. The two categories used for direct Purview comparison are Payment Card (5,286 files) and Passport (53 files).

Why Classification Matters

Everything in Microsoft Purview depends on correct classification. Sensitivity labels, auto-labeling, Data Loss Prevention (DLP), and retention management all rely on the ability to classify and tag data correctly. If classification fails, every downstream feature fails with it.

Benchmark Results: Passport Classification

We used Purview’s out-of-the-box Passport classification (set to medium confidence) and compared it against the known ground truth in our dataset. The data was stored in SharePoint.

99.8% misclassified by Purview’s classification. Purview relies on simple regular expressions, which produce false positives and false negatives — passports stored as images or scans go undetected.
98.1% not labeled by Purview’s auto-labeling. Only 1 file was correctly labeled with a sensitivity label, because Purview only supports a limited number of file types.

Benchmark Results: Payment Card Classification

For payment card data, the results were somewhat better but still alarming:

91% misclassified by Purview. The regex-based classification produces far too many false positives, incorrectly flagging data as payment cards.
57% not labeled by auto-labeling. Only 43% of payment card data was correctly labeled, because Purview only supports a limited number of file formats.

What Purview Is Missing for Data Privacy Compliance

Purview has no comprehensive data privacy compliance classification. It offers only a few imprecise samples in a limited number of languages. To achieve actual compliance, organizations would need to custom-develop classification for 28+ languages across dozens of missing privacy categories.

Missing national certificates: Birth, citizenship, marriage, divorce, registered partnership, ancestry, name change, proof of citizenship, marital status, place of residence, religion, and more.

Missing privacy data types: Health information (medicine, diagnosis, illness), political orientation, sexual orientation, ethnic origin, trade union affiliation, religious orientation, personal tax information, salary information, employment contracts, recruitment data, CVs, written warnings, work absence, criminal records, written consents, travel information, photo geolocation, and more.

The Labeling Dilemma

Purview offers two labeling options, and both have fundamental limitations:

Manual Labeling

Only works on new files, but 99.99% of data already exists. Only 0.77% of data gets manually tagged — users dislike it and do not do it voluntarily. Limited to 4–5 label options, which is insufficient for real compliance requirements.

Auto-Labeling

Does not support emails at rest or attachments, and covers only ~5% of all unstructured data in real-world environments. Classification has a 50–99% error rate. Additional per-page costs apply for PDF and image OCR on top of the E5 license.

Try It Yourself

Download the dataset from Hugging Face, load it into your own SharePoint or Exchange environment, and run Purview classification against it. Compare the results to the manually validated ground truth. We are confident you will see the same results.

Want to see how Data & More classifies the same data?

Get in touch for a demo and we’ll walk you through the comparison first-hand.

Get in touch