Used to scan 7BN+ unstructured data items1BN Insight report
Technology

Multi-Factor Classification

Classifying some data is easy. Classifying everything required to meet legal, regulatory, and business requirements is much harder. That’s the secret sauce that makes Data & More unique.

The Challenge

Why Multi-Factor Classification Is Important

Fact

Data classification is hard. The industry standard — a combination of regular expressions and proximity keywords — works for simple cases but misses far too much in the real world.

The industry standard for privacy data discovery is a combination of regular expressions and proximity keywords. Here’s what that looks like in practice — using US passport numbers as an example.

Regular expression — US passport number
^[A-Z]{2}\d{6}$

This ensures the passport number starts with exactly two uppercase letters, followed by exactly six digits. But regular expressions alone produce too many false positives — in manufacturing, part numbers could match this pattern; in retail, so could account numbers. So proximity keywords are added:

passport pass passport number passport no passport id passports date of issue date of expiry

A string matching the regex and containing a nearby keyword is more likely to be a real US passport. But significant problems remain:

False positives

The UK, India, China, Japan, South Korea, Singapore, Thailand, and Taiwan all use the same two-letter + digit format — so this pattern incorrectly flags all of them as US passports.

False negatives

If passport information is recorded without the specified keywords, or stored in another language, the proximity match fails entirely — and the data is missed.

“In the same way multi-factor authentication increases the level of security on your accounts, multi-factor classification increases the accuracy and reliability of your privacy data detection.”

Our Approach

How We Do It

Multi-Factor Classification is built on three foundational steps — and then extended with AI-based validation, multilingual language models, and a comprehensive library of counter-rules.

Step 1

Understand the Applicable Legislation

Fully read and understand every legislation and industry standard applicable to the proper collection, storage, and usage of privacy data.

Step 2

Develop a Privacy Data Mapping

Translate each legislation and standard into a concrete mapping of privacy data and regulatory requirements that can be implemented in an organisation.

Step 3

Create a Classification Library

Build a comprehensive privacy data classification library — then test and tune it across enough data and languages to minimise false positives and eliminate missed items.

Multi-Factor Classification process diagram

The Multi-Factor Classification process — from legislation to classification library

A collection of regular expressions and proximity keywords isn’t enough. That’s why Data & More has added additional validation factors:

AI-Based ID Card Analysis

Computer vision trained to identify and validate identity documents across countries and formats.

Multilingual Language Models

AI-trained models covering 27+ languages, ensuring privacy data is found regardless of the language it’s stored in.

Counter-Rules Library

A large library of counter-rules that systematically eliminate false positives and prevent misclassification.

A Real Example

The Complexity in Practice

All 27+ North American privacy legislations define marital status as Personally Identifiable Information. Accurately detecting marriage certificates, separation agreements, and civil union documents is already complex — and becomes exponentially more complex across languages and countries.

Germany Germany
  • Registered partnershipLebenspartnerschaftsurkunde
  • Civil status documentPersonenstandsurkunde
  • Family status documentFamilienstandsurkunde
  • Marital status recordFamilienbuch
  • Dissolution of partnershipLebenspartnerschaftsurkunde
  • Divorce / invalidityScheidungsurkunde
Poland Poland
  • Short-form marriage cert.odpis skrócony aktu małżeństwa
  • Long-form marriage cert.odpis zupełny aktu małżeństwa
  • Certificate of civil statuszaświadczenie o stanie cywilnym
  • Court ruling on divorceorzeczenie sądu o rozwodzie
  • Court ruling on separationorzeczenie sądu o separacji
  • Annulment of marriageorzeczenie sądu o unieważnieniu małżeństwa
Czechia Czechia
  • Confirmation of marriagepotvrzení o uzavření manželství
  • Registered partnership cert.doklad o registrovaném partnerství
  • Legal capacity certificatevysvědčení o právní způsobilosti
  • Family register datapotvrzení o údajích zapsaných v matriční knize
2,500+

Classification patterns in the library

The example above covers part of just one classification pattern. If your privacy data classification isn’t finding all these documents — and their equivalents in every language you operate in — it’s missing data and exposing your organisation to risk.

Classification Library

Data & More Privacy Classification

A summary of the major items included in the Privacy Data Classification Library, organised by data sensitivity level.

Confidential Personal Data
  • European & international IDID card, number or information
  • Social Security infoSocial security card, number or information
  • Health cardsHealth card, number or information
  • Drivers’ licensesThe card, number or information
  • PassportsThe passport, number or information
  • Credit cardsThe credit card, number or information
  • Tax informationTax returns etc.
  • Residence permitPermits and information in them
  • Salary informationPay slips etc.
  • Employment documentsContracts etc.
  • RecruitmentApplications, job offers, CVs, interviews
  • Bonus agreements
  • Dismissal or resignationTerminations or resignations
  • Written warningsWarnings and expulsions
Criminal Offences
  • Criminal recordCriminal records and related information
  • Offenses, fines & convictionsConvictions, fines etc.
Sensitive Personal Data
  • Health infoDiagnoses, illnesses, medication, sick leave
  • Trade union membership
  • Ethnic originCountry of origin or ethnic background
  • Political orientationMembership of a political party
  • Religious beliefReligious orientation or membership
  • Sexual orientationInformation about sexual orientation
Non-Sensitive Personal Data
  • Pictures with a faceUsed across various document classes
  • Travel informationBookings, reservations, flights showing location history
The Solution

What Multi-Factor Classification Delivers

A comprehensive, highly accurate, tested, and curated privacy data classification library — combined with a high-performance processing engine that leverages AI for multi-language data discovery that can be precisely tuned and easily customised.

What you get

A privacy data classification library built directly from 27+ North American and international privacy legislations — not general-purpose regex patterns

2,500+ classification patterns covering confidential, sensitive, and non-sensitive personal data categories

AI-based ID card analysis and multilingual language models covering 27+ languages

A counter-rules library that systematically eliminates false positives and prevents misclassification

Tested, tuned, and continuously maintained against real-world data across multiple industries and geographies

Next: Solutions