Multi-Factor Classification
Classifying some data is easy. Classifying everything required to meet legal, regulatory, and business requirements is much harder. That’s the secret sauce that makes Data & More unique.
Why Multi-Factor Classification Is Important
Data classification is hard. The industry standard — a combination of regular expressions and proximity keywords — works for simple cases but misses far too much in the real world.
The industry standard for privacy data discovery is a combination of regular expressions and proximity keywords. Here’s what that looks like in practice — using US passport numbers as an example.
^[A-Z]{2}\d{6}$
This ensures the passport number starts with exactly two uppercase letters, followed by exactly six digits. But regular expressions alone produce too many false positives — in manufacturing, part numbers could match this pattern; in retail, so could account numbers. So proximity keywords are added:
A string matching the regex and containing a nearby keyword is more likely to be a real US passport. But significant problems remain:
The UK, India, China, Japan, South Korea, Singapore, Thailand, and Taiwan all use the same two-letter + digit format — so this pattern incorrectly flags all of them as US passports.
If passport information is recorded without the specified keywords, or stored in another language, the proximity match fails entirely — and the data is missed.
“In the same way multi-factor authentication increases the level of security on your accounts, multi-factor classification increases the accuracy and reliability of your privacy data detection.”
How We Do It
Multi-Factor Classification is built on three foundational steps — and then extended with AI-based validation, multilingual language models, and a comprehensive library of counter-rules.
Understand the Applicable Legislation
Fully read and understand every legislation and industry standard applicable to the proper collection, storage, and usage of privacy data.
Develop a Privacy Data Mapping
Translate each legislation and standard into a concrete mapping of privacy data and regulatory requirements that can be implemented in an organisation.
Create a Classification Library
Build a comprehensive privacy data classification library — then test and tune it across enough data and languages to minimise false positives and eliminate missed items.
The Multi-Factor Classification process — from legislation to classification library
A collection of regular expressions and proximity keywords isn’t enough. That’s why Data & More has added additional validation factors:
AI-Based ID Card Analysis
Computer vision trained to identify and validate identity documents across countries and formats.
Multilingual Language Models
AI-trained models covering 27+ languages, ensuring privacy data is found regardless of the language it’s stored in.
Counter-Rules Library
A large library of counter-rules that systematically eliminate false positives and prevent misclassification.
The Complexity in Practice
All 27+ North American privacy legislations define marital status as Personally Identifiable Information. Accurately detecting marriage certificates, separation agreements, and civil union documents is already complex — and becomes exponentially more complex across languages and countries.
- Registered partnershipLebenspartnerschaftsurkunde
- Civil status documentPersonenstandsurkunde
- Family status documentFamilienstandsurkunde
- Marital status recordFamilienbuch
- Dissolution of partnershipLebenspartnerschaftsurkunde
- Divorce / invalidityScheidungsurkunde
- Short-form marriage cert.odpis skrócony aktu małżeństwa
- Long-form marriage cert.odpis zupełny aktu małżeństwa
- Certificate of civil statuszaświadczenie o stanie cywilnym
- Court ruling on divorceorzeczenie sądu o rozwodzie
- Court ruling on separationorzeczenie sądu o separacji
- Annulment of marriageorzeczenie sądu o unieważnieniu małżeństwa
- Confirmation of marriagepotvrzení o uzavření manželství
- Registered partnership cert.doklad o registrovaném partnerství
- Legal capacity certificatevysvědčení o právní způsobilosti
- Family register datapotvrzení o údajích zapsaných v matriční knize
Classification patterns in the library
The example above covers part of just one classification pattern. If your privacy data classification isn’t finding all these documents — and their equivalents in every language you operate in — it’s missing data and exposing your organisation to risk.
Data & More Privacy Classification
A summary of the major items included in the Privacy Data Classification Library, organised by data sensitivity level.
- European & international IDID card, number or information
- Social Security infoSocial security card, number or information
- Health cardsHealth card, number or information
- Drivers’ licensesThe card, number or information
- PassportsThe passport, number or information
- Credit cardsThe credit card, number or information
- Tax informationTax returns etc.
- Residence permitPermits and information in them
- Salary informationPay slips etc.
- Employment documentsContracts etc.
- RecruitmentApplications, job offers, CVs, interviews
- Bonus agreements
- Dismissal or resignationTerminations or resignations
- Written warningsWarnings and expulsions
- Criminal recordCriminal records and related information
- Offenses, fines & convictionsConvictions, fines etc.
- Health infoDiagnoses, illnesses, medication, sick leave
- Trade union membership
- Ethnic originCountry of origin or ethnic background
- Political orientationMembership of a political party
- Religious beliefReligious orientation or membership
- Sexual orientationInformation about sexual orientation
- Pictures with a faceUsed across various document classes
- Travel informationBookings, reservations, flights showing location history
What Multi-Factor Classification Delivers
A comprehensive, highly accurate, tested, and curated privacy data classification library — combined with a high-performance processing engine that leverages AI for multi-language data discovery that can be precisely tuned and easily customised.
What you get
A privacy data classification library built directly from 27+ North American and international privacy legislations — not general-purpose regex patterns
2,500+ classification patterns covering confidential, sensitive, and non-sensitive personal data categories
AI-based ID card analysis and multilingual language models covering 27+ languages
A counter-rules library that systematically eliminates false positives and prevents misclassification
Tested, tuned, and continuously maintained against real-world data across multiple industries and geographies
