How Redact works — Two-pass pipeline, regional patterns, entity classification

Architecture

Two passes. Deterministic first, intelligent second.

The pipeline is intentionally layered — Pass 1 catches everything that can be caught by rules alone; Pass 2 handles context-dependent entities that rules cannot identify.

Pass 1 · PHP / regex

Detect & replace known patterns

A deterministic regex pass runs before any LLM call. It scans the full input for identifiers matching the active regional profile:

Norwegian fødselsnummer (11-digit) and D-number
Phone numbers in +47 format, Norwegian mobile (4xx/9xx) and landline patterns
Email addresses (RFC 5322 simplified)
Norwegian postal addresses: street name + number + postnummer + poststed
Additional patterns per region (IBAN, CPR, ECHR numbers, SSN, etc.)

Pattern-matched tokens are replaced immediately and marked as already redacted so the LLM pass does not double-process them.

Pass 2 · gpt-4o-mini / gpt-4o

Sweep for named entities

The LLM reads the full document (with pattern-replaced tokens visible as placeholders). It identifies all remaining named entities:

Personal names — first name, surname, or full name; handles Norwegian, Sami, and foreign names
Organisations — companies, government agencies, NGOs, religious bodies, sports clubs
Places — streets, neighbourhoods, municipalities, counties, countries
Date-of-birth and age phrases (when Dates entity type is checked)

Each matched entity is classified by type and role (FATHER, MOTHER, JUDGE, SOCIAL WORKER, etc.) and replaced according to the selected output format. The prompt explicitly instructs the model not to infer or guess identities — only replace what is explicitly named.

Pass 3 · PHP post-processor

Post-processing & alias substitution

After both passes, PHP applies final transformations:

Officials pass — if Keep official names is checked, named judges, experts, and caseworkers get labelled tags ([JUDGE: Andersen]) instead of generic ones
Alias substitution — user-defined aliases are applied as a final regex replacement, overriding whatever the LLM assigned
Exempt name protection — any token matching an exempt name is restored to its original value
Pseudonym generation — if Pseudonym output is selected, all role tags are replaced with plausible Norwegian names, phone numbers, and addresses drawn from a generation pool

The final output is assembled and returned as plain text. The DOCX export converts this to an OOXML document via PHP ZipArchive.

Regional pattern sets

Four regions. Each adds patterns to the last.

Regional profiles are cumulative — European adds to Nordic, ECHR adds to European, Global adds to ECHR.

Region	Patterns covered	Notes
Nordic ★	Fødselsnummer, D-number, +47 phone, email, Norwegian address	Default for Norwegian documents. All local ID formats.
European	+ IBAN, Swedish personnummer, Danish CPR, Finnish HETU, UK NI	Cross-border EU documents, Nordic neighbours.
ECHR	+ ECHR application numbers, DOB phrases, ECtHR case references	Complaints to the European Court of Human Rights.
Global	+ US SSN, driver's licence formats, generic document numbers	Documents involving non-European parties or jurisdictions.

Entity classification

What the LLM identifies and how it labels each type.

Named entity types

Entity	What qualifies	Default output (contextual)
`person`	Any personal name — first, last, or full	[ROLE] — inferred from context (FATHER, MOTHER, JUDGE, etc.)
`organisation`	Companies, agencies, authorities, institutions, clubs	[ORG: partial name] or generic [ORG]
`place`	Streets, towns, counties, countries, regions	[PLACE] or [CITY] or [ADDRESS]
`date`	Dates of birth, age references, personal date phrases	[DOB] or [AGE: xx]

Output format comparison

Format	Person example	Org example
Contextual ★	`[FATHER] or [JUDGE: Andersen]`	`[BARNEVERNET: Oslo] or [ORG]`
Generic	`[PERSON]`	`[ORG]`
Pseudonym	Ola Nordmann (generated)	Nordnes AS (generated)

Engines

Two engines, one redaction schema.

Both engines produce the same redaction output schema. Engine choice affects accuracy on complex documents and credit cost only.

Engine	Model	Latency	Best for
Azure gpt-4o-mini ★	`gpt-4o-mini` (Azure West Europe)	~15 s	Default. Most documents, single subject, clear formatting.
Azure gpt-4o	`gpt-4o` (Azure West Europe)	~45 s	Complex documents with many named parties, overlapping roles, or degraded source text.

Privacy & security

Processed in memory. Saved only when you say so.

Privacy by design

All uploaded files are extracted to text in memory using PHP's in-process file handlers. The raw binary is not written to disk on the server.
The redacted output is not retained unless you choose to save it. You can save to Min Sak (for use in other tools), to your corpus (for searchable reuse), download as .docx, or copy to clipboard.
Azure OpenAI (gpt-4o, gpt-4o-mini) is configured on the West Europe region. Data processed via Azure OpenAI is not used for model training under the default enterprise agreement.
Azure OpenAI is called only for the LLM sweep pass. No document content is retained by Azure after the response is returned, per the enterprise data-handling agreement.
Telemetry logged: tool name, engine, mode, region, latency. No document text, entity names, or redacted content is logged.

How Redact knows what to replace.