Multi-Factor Classification (v2)

The Challenge

Why Multi-Factor Classification Is Important

The industry standard for privacy data discovery is a combination of regular expressions and proximity keywords. Here’s what that looks like in practice — using US passport numbers as an example.

Regular expression — US passport number
^[A-Z]{2}\d{6}$

This ensures the passport number starts with exactly two uppercase letters, followed by exactly six digits. But regular expressions alone produce too many false positives — in manufacturing, part numbers could match this pattern; in retail, so could account numbers. So proximity keywords are added:

passport pass passport number passport no passport id passports date of issue date of expiry

A string matching the regex and containing a nearby keyword is more likely to be a real US passport. But significant problems remain:

False positives

The UK, India, China, Japan, South Korea, Singapore, Thailand, and Taiwan all use the same two-letter + digit format — so this pattern incorrectly flags all of them as US passports.

False negatives

If passport information is recorded without the specified keywords, or stored in another language, the proximity match fails entirely — and the data is missed.

“In the same way multi-factor authentication increases the level of security on your accounts, multi-factor classification increases the accuracy and reliability of your privacy data detection.”

Our Approach

How We Do It

Multi-Factor Classification is built on three foundational steps — and then extended with AI-based validation, multilingual language models, and a comprehensive library of counter-rules.

Understand the Applicable Legislation

Fully read and understand every legislation and industry standard applicable to the proper collection, storage, and usage of privacy data.

Develop a Privacy Data Mapping

Translate each legislation and standard into a concrete mapping of privacy data and regulatory requirements that can be implemented in an organisation.

Create a Classification Library

Build a comprehensive privacy data classification library — then test and tune it across enough data and languages to minimise false positives and eliminate missed items.

Multi-Factor Classification process diagram

The Multi-Factor Classification process — from legislation to classification library

A collection of regular expressions and proximity keywords isn’t enough. That’s why Data & More has added additional validation factors:

AI-Based ID Card Analysis

Computer vision trained to identify and validate identity documents across countries and formats.

Multilingual Language Models

AI-trained models covering 27+ languages, ensuring privacy data is found regardless of the language it’s stored in.

Counter-Rules Library

A large library of counter-rules that systematically eliminate false positives and prevent misclassification.

A Real Example

The Complexity in Practice

All 27+ North American privacy legislations define marital status as Personally Identifiable Information. Accurately detecting marriage certificates, separation agreements, and civil union documents is already complex — and becomes exponentially more complex across languages and countries.

Germany

Registered partnershipLebenspartnerschaftsurkunde
Civil status documentPersonenstandsurkunde
Family status documentFamilienstandsurkunde
Marital status recordFamilienbuch
Dissolution of partnershipLebenspartnerschaftsurkunde
Divorce / invalidityScheidungsurkunde

Poland

Short-form marriage cert.odpis skrócony aktu małżeństwa
Long-form marriage cert.odpis zupełny aktu małżeństwa
Certificate of civil statuszaświadczenie o stanie cywilnym
Court ruling on divorceorzeczenie sądu o rozwodzie
Court ruling on separationorzeczenie sądu o separacji
Annulment of marriageorzeczenie sądu o unieważnieniu małżeństwa

Czechia

Confirmation of marriagepotvrzení o uzavření manželství
Registered partnership cert.doklad o registrovaném partnerství
Legal capacity certificatevysvědčení o právní způsobilosti
Family register datapotvrzení o údajích zapsaných v matriční knize

2,500+

Classification patterns in the library

The example above covers part of just one classification pattern. If your privacy data classification isn’t finding all these documents — and their equivalents in every language you operate in — it’s missing data and exposing your organisation to risk.

Classification Library

Data & More Privacy Classification

A summary of the major items included in the Privacy Data Classification Library, organised by data sensitivity level.

Confidential Personal Data

European & international IDID card, number or information
Social Security infoSocial security card, number or information
Health cardsHealth card, number or information
Drivers’ licensesThe card, number or information
PassportsThe passport, number or information
Credit cardsThe credit card, number or information
Tax informationTax returns etc.
Residence permitPermits and information in them
Salary informationPay slips etc.
Employment documentsContracts etc.
RecruitmentApplications, job offers, CVs, interviews
Bonus agreements
Dismissal or resignationTerminations or resignations
Written warningsWarnings and expulsions

Criminal Offences

Criminal recordCriminal records and related information
Offenses, fines & convictionsConvictions, fines etc.

Sensitive Personal Data

Health infoDiagnoses, illnesses, medication, sick leave
Trade union membership
Ethnic originCountry of origin or ethnic background
Political orientationMembership of a political party
Religious beliefReligious orientation or membership
Sexual orientationInformation about sexual orientation

Non-Sensitive Personal Data

Pictures with a faceUsed across various document classes
Travel informationBookings, reservations, flights showing location history

The Solution

What Multi-Factor Classification Delivers

A comprehensive, highly accurate, tested, and curated privacy data classification library — combined with a high-performance processing engine that leverages AI for multi-language data discovery that can be precisely tuned and easily customised.

What you get

A privacy data classification library built directly from 27+ North American and international privacy legislations — not general-purpose regex patterns

2,500+ classification patterns covering confidential, sensitive, and non-sensitive personal data categories

AI-based ID card analysis and multilingual language models covering 27+ languages

A counter-rules library that systematically eliminates false positives and prevents misclassification

Tested, tuned, and continuously maintained against real-world data across multiple industries and geographies

Request a free assessment Explore our solutions

Next: Solutions