Normalize messy enterprise documents for LLMs (ISO‑8601 dates, ISO‑4217 currency, E.164 phones)
Normalize every field to canonical, machine-consumable standards so downstream LLMs, analytics, and RAG stay accurate and debuggable. This capability ships with opinionated defaults and schema constraints that enforce consistency across dates, currency, phone numbers, booleans, and IDs—no templates required.
-
Standards-first: ISO‑8601 for dates/timestamps, ISO‑4217 for currency codes, E.164 for phone numbers
-
Schema-driven: enums, patterns, and descriptions to prevent drift and synonym errors
-
Audit-ready: predictable keys, traceable values, and validation-friendly outputs
Phone Number Normalization (E.164)
Normalize all phone numbers to E.164: a leading plus (+), country code, and subscriber number with no spaces or punctuation. Avoid locale formats and descriptive strings.
-
Canonical format: +
(e.g., +14155550123) -
Do not include extensions inline; capture as a separate field when needed
-
Reject free text like "(415) 555-0123 ext. 9" in normalized fields
Before → After
| Field | Before Example | After Normalization |
|---|---|---|
| phone_number | (415) 555-0123 | +14155550123 |
| phone_number | 020 7183 8750 | +442071838750 |
| phone_number | +91-80-1234-5678 ext. 204 | +918012345678 |
JSON Schema constraints (example)
{
"type": "object",
"properties": {
"phone_number": {
"type": "string",
"description": "Primary contact phone in E.164 format (e.g., +14155550123)",
"pattern": "^\\+[1-9]\\d{1,14}$"
},
"phone_extension": {
"type": "string",
"description": "Numeric extension if present (do not append to phone_number)",
"pattern": "^\\d{1,10}$"
}
},
"required": ["phone_number"]
}
JSON Schema: Enums, Patterns, and Formats
Use enums and patterns to lock fields to allowed values and shapes:
-
Currency codes: restrict to ISO‑4217 enums; separate amount (number) from currency (string)
-
Status fields: enforce small, explicit sets (e.g., ["paid", "due", "overdue"]) instead of free text
-
Identifiers: define patterns that preserve leading zeros and expected length
Example schema fragment
{
"type": "object",
"properties": {
"invoice_date": {
"type": "string",
"description": "ISO-8601 date the invoice was issued (YYYY-MM-DD)",
"pattern": "^\\d{4}-\\d{2}-\\d{2}$"
},
"amount": {
"type": "number",
"description": "Monetary amount as a number (no symbols)"
},
"currency": {
"type": "string",
"description": "ISO-4217 currency code",
"enum": ["USD", "EUR", "GBP", "JPY", "CAD", "AUD", "INR"]
},
"status": {
"type": "string",
"description": "Payment status",
"enum": ["paid", "due", "overdue"]
},
"invoice_number": {
"type": "string",
"description": "Document-labeled invoice number; preserve leading zeros",
"pattern": "^[A-Za-z0-9-]{1,32}$"
}
},
"required": ["invoice_date", "amount", "currency", "invoice_number"]
}
Notes
-
For timestamps, prefer full ISO‑8601 (e.g., 2024-05-12T14:03:22Z). When only dates are present, use YYYY-MM-DD.
-
Keep key casing consistent (e.g., snake_case) and use all-caps for codes like currency.
Before/After: Combined Normalization
// Before
{
"InvoiceNumber": 4521,
"Invoice Date": "12th May, 2024",
"Amount": "$1,200.00 USD",
"Paid?": "yes",
"currencyType": "us dollar",
"Phone": "(415) 555-0123 ext. 9"
}
// After
{
"invoice_number": "4521",
"invoice_date": "2024-05-12",
"amount": 1200.00,
"currency": "USD",
"paid": true,
"phone_number": "+14155550123",
"phone_extension": "9"
}
Quality gates and audits
-
Validate: enforce JSON Schema before writing to downstream systems
-
Reject: block values that fail enums/patterns; log with source citation
-
Monitor: track normalization error rates by field and document type
-
Remediate: iterate schema descriptions to reduce ambiguity and drift
See also
-
Schema design pitfalls and fixes: Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema
-
Why normalization improves RAG/search: Parsing API for LLM and RAG pipelines and Elasticsearch semantic search guide
Normalization for LLMs
In document processing pipelines, normalization is critical for transforming unstructured data into LLM-ready structured outputs. Proper normalization ensures that fields have consistent types, values, and formats—eliminating ambiguities that can confuse downstream models, increase hallucinations, or hinder retrieval-augmented generation (RAG).
Why Normalization Matters for LLMs
-
Exact Matching: Retrieval systems and downstream analytics rely on normalized, canonical outputs (e.g., dates, currencies) for filtering and query.
-
Minimized Drift: Consistent data types and values reduce errors and inconsistencies as documents traverse different systems.
-
LLM Performance: LLMs produce more reliable and accurate generations when their inputs are predictable.
-
AI Auditing: Compliance teams require normalized, auditable outputs for traceability and assurance.
Key Normalization Types
1. Typed Outputs
-
Always specify field types in your extraction schema:
string,number,boolean,object, orarray. -
Human-readable fields (like IDs or amounts) must not be ambiguously typed. E.g. Account numbers as
stringto preserve leading zeros; amounts asnumberfor sorting and math.
2. Use of Enums
-
Enumerated fields (enums) ensure outputs are limited to defined values, reducing drift caused by synonyms or misspellings.
-
Example: For a currency type field, enforce only ["USD", "EUR", "JPY"], NOT free-form strings that could include "US Dollar", etc.
3. Currency Normalization
-
Represent currency codes using the ISO 4217 standard (e.g., USD, EUR, JPY) in all monetary fields.
-
Separate amount (
number) from currency (string). -
Example:
{ "amount": 250.00, "currency": "USD" }
4. Date Normalization
-
Dates should conform to ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ), not locale-dependent strings.
-
Always clarify in the schema description what date/timestamp format is expected.
-
Example:
{ "due_date": "2025-07-01" }
5. Boolean Normalization
-
Boolean fields should be strictly
trueorfalse, never text values like "yes", "no", "checked", or blank. -
Use
enumtype in the schema for rare cases where tristate (e.g., "yes", "no", "n/a") is needed.
6. Casing Normalization
-
Prefer
snake_casefor all key names. -
String value casing: Define and enforce requirements—such as all-caps for codes ("USD"), Title Case for names, etc.
-
Avoid mixed or unpredictable casing in outputs.
Schema Best Practices
The first line of defense for normalization is a robust extraction schema. Poorly defined schemas propagate errors; a precise schema minimizes confusion for both LLMs and downstream systems.
-
Add descriptions to every field.
-
Use descriptive, document-aligned field names (e.g.,
invoice_datenotid_32). -
Apply
enum,format, and pattern constraints where possible. -
Don’t prompt LLMs to perform calculations in extraction—extract only raw values.
For an in-depth guide and common schema pitfalls, read Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema.
Example: Before and After Normalization
Before (inconsistent, unnormalized outputs)
{
"InvoiceNumber": 4521,
"Invoice Date": "12th May, 2024",
"Amount": "$1,200.00 USD",
"Paid?": "yes",
"currencyType": "us dollar"
}
After (normalized, LLM-ready outputs)
{
"invoice_number": "4521",
"invoice_date": "2024-05-12",
"amount": 1200.00,
"currency": "USD",
"paid": true
}
Key Improvements
-
All field names are
snake_case. -
Invoice number is string, supporting leading zeros.
-
Date is ISO 8601.
-
Amount is type
number; currency is separate and in ISO 4217. -
Boolean
paidusestrue/false.
Cross-link: Further Reading
For practical extraction schema tips, JSON field configuration, and avoiding normalization failures, visit Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema.
For an overview of extraction pipelines and more before/after examples, see our docs.
Summary Table
| Field | Before Example | After Normalization | Type | Notes |
|---|---|---|---|---|
| Invoice Date | 12th May, 2024 | 2024-05-12 | string | ISO 8601 |
| Amount | $1,200.00 USD | 1200.00 | number | Only digits, separator . |
| Currency | us dollar | USD | string | ISO 4217 enum |
| Paid | yes | true | boolean | true/false only |
| InvoiceNumber | 4521 | "4521" | string | Preserve leading zeros |
Consistent normalization—enforced by a strong schema and clear conventions—is essential for scalable, maintainable, and trustworthy AI document pipelines.