Technical Showcase · How the AI protects documents
A full walkthrough of the two-pass redaction pipeline, regional pattern recognition, entity classification, output format generation, and privacy architecture.
Architecture
The pipeline is intentionally layered — Pass 1 catches everything that can be caught by rules alone; Pass 2 handles context-dependent entities that rules cannot identify.
A deterministic regex pass runs before any LLM call. It scans the full input for identifiers matching the active regional profile:
fødselsnummer (11-digit) and D-numberPattern-matched tokens are replaced immediately and marked as already redacted so the LLM pass does not double-process them.
The LLM reads the full document (with pattern-replaced tokens visible as placeholders). It identifies all remaining named entities:
Each matched entity is classified by type and role (FATHER, MOTHER, JUDGE, SOCIAL WORKER, etc.) and replaced according to the selected output format. The prompt explicitly instructs the model not to infer or guess identities — only replace what is explicitly named.
After both passes, PHP applies final transformations:
The final output is assembled and returned as plain text. The DOCX export converts this to an OOXML document via PHP ZipArchive.
Regional pattern sets
Regional profiles are cumulative — European adds to Nordic, ECHR adds to European, Global adds to ECHR.
| Region | Patterns covered | Notes |
|---|---|---|
| Nordic ★ | Fødselsnummer, D-number, +47 phone, email, Norwegian address | Default for Norwegian documents. All local ID formats. |
| European | + IBAN, Swedish personnummer, Danish CPR, Finnish HETU, UK NI | Cross-border EU documents, Nordic neighbours. |
| ECHR | + ECHR application numbers, DOB phrases, ECtHR case references | Complaints to the European Court of Human Rights. |
| Global | + US SSN, driver's licence formats, generic document numbers | Documents involving non-European parties or jurisdictions. |
Entity classification
| Entity | What qualifies | Default output (contextual) |
|---|---|---|
person |
Any personal name — first, last, or full | [ROLE] — inferred from context (FATHER, MOTHER, JUDGE, etc.) |
organisation |
Companies, agencies, authorities, institutions, clubs | [ORG: partial name] or generic [ORG] |
place |
Streets, towns, counties, countries, regions | [PLACE] or [CITY] or [ADDRESS] |
date |
Dates of birth, age references, personal date phrases | [DOB] or [AGE: xx] |
| Format | Person example | Org example |
|---|---|---|
| Contextual ★ | [FATHER] or [JUDGE: Andersen] |
[BARNEVERNET: Oslo] or [ORG] |
| Generic | [PERSON] |
[ORG] |
| Pseudonym | Ola Nordmann (generated) | Nordnes AS (generated) |
Engines
Both engines produce the same redaction output schema. Engine choice affects accuracy on complex documents and credit cost only.
| Engine | Model | Latency | Best for |
|---|---|---|---|
| Azure gpt-4o-mini ★ | gpt-4o-mini (Azure West Europe) |
~15 s | Default. Most documents, single subject, clear formatting. |
| Azure gpt-4o | gpt-4o (Azure West Europe) |
~45 s | Complex documents with many named parties, overlapping roles, or degraded source text. |
Privacy & security
Privacy by design
gpt-4o, gpt-4o-mini) is configured on the West Europe region. Data processed via Azure OpenAI is not used for model training under the default enterprise agreement.Free for Do Better Norge members.