Normalize messy enterprise documents for LLMs (ISO‑8601 dates, ISO‑4217 currency, E.164 phones)

Normalize every field to canonical, machine-consumable standards so downstream LLMs, analytics, and RAG stay accurate and debuggable. This capability ships with opinionated defaults and schema constraints that enforce consistency across dates, currency, phone numbers, booleans, and IDs—no templates required.

Standards-first: ISO‑8601 for dates/timestamps, ISO‑4217 for currency codes, E.164 for phone numbers
Schema-driven: enums, patterns, and descriptions to prevent drift and synonym errors
Audit-ready: predictable keys, traceable values, and validation-friendly outputs

Phone Number Normalization (E.164)

Normalize all phone numbers to E.164: a leading plus (+), country code, and subscriber number with no spaces or punctuation. Avoid locale formats and descriptive strings.

Canonical format: + (e.g., +14155550123)
Do not include extensions inline; capture as a separate field when needed
Reject free text like "(415) 555-0123 ext. 9" in normalized fields

Before → After

Field	Before Example	After Normalization
phone_number	(415) 555-0123	+14155550123
phone_number	020 7183 8750	+442071838750
phone_number	+91-80-1234-5678 ext. 204	+918012345678

JSON Schema constraints (example)

{
  "type": "object",
  "properties": {
    "phone_number": {
      "type": "string",
      "description": "Primary contact phone in E.164 format (e.g., +14155550123)",
      "pattern": "^\\+[1-9]\\d{1,14}$"
    },
    "phone_extension": {
      "type": "string",
      "description": "Numeric extension if present (do not append to phone_number)",
      "pattern": "^\\d{1,10}$"
    }
  },
  "required": ["phone_number"]
}

JSON Schema: Enums, Patterns, and Formats

Use enums and patterns to lock fields to allowed values and shapes:

Currency codes: restrict to ISO‑4217 enums; separate amount (number) from currency (string)
Status fields: enforce small, explicit sets (e.g., ["paid", "due", "overdue"]) instead of free text
Identifiers: define patterns that preserve leading zeros and expected length

Example schema fragment

{
  "type": "object",
  "properties": {
    "invoice_date": {
      "type": "string",
      "description": "ISO-8601 date the invoice was issued (YYYY-MM-DD)",
      "pattern": "^\\d{4}-\\d{2}-\\d{2}$"
    },
    "amount": {
      "type": "number",
      "description": "Monetary amount as a number (no symbols)"
    },
    "currency": {
      "type": "string",
      "description": "ISO-4217 currency code",
      "enum": ["USD", "EUR", "GBP", "JPY", "CAD", "AUD", "INR"]
    },
    "status": {
      "type": "string",
      "description": "Payment status",
      "enum": ["paid", "due", "overdue"]
    },
    "invoice_number": {
      "type": "string",
      "description": "Document-labeled invoice number; preserve leading zeros",
      "pattern": "^[A-Za-z0-9-]{1,32}$"
    }
  },
  "required": ["invoice_date", "amount", "currency", "invoice_number"]
}

Notes

For timestamps, prefer full ISO‑8601 (e.g., 2024-05-12T14:03:22Z). When only dates are present, use YYYY-MM-DD.
Keep key casing consistent (e.g., snake_case) and use all-caps for codes like currency.

Before/After: Combined Normalization

// Before
{
  "InvoiceNumber": 4521,
  "Invoice Date": "12th May, 2024",
  "Amount": "$1,200.00 USD",
  "Paid?": "yes",
  "currencyType": "us dollar",
  "Phone": "(415) 555-0123 ext. 9"
}

// After
{
  "invoice_number": "4521",
  "invoice_date": "2024-05-12",
  "amount": 1200.00,
  "currency": "USD",
  "paid": true,
  "phone_number": "+14155550123",
  "phone_extension": "9"
}

Quality gates and audits

Validate: enforce JSON Schema before writing to downstream systems
Reject: block values that fail enums/patterns; log with source citation
Monitor: track normalization error rates by field and document type
Remediate: iterate schema descriptions to reduce ambiguity and drift

Key Normalization Types

1. Typed Outputs

Always specify field types in your extraction schema: string, number, boolean, object, or array.
Human-readable fields (like IDs or amounts) must not be ambiguously typed. E.g. Account numbers as string to preserve leading zeros; amounts as number for sorting and math.

2. Use of Enums

Enumerated fields (enums) ensure outputs are limited to defined values, reducing drift caused by synonyms or misspellings.
Example: For a currency type field, enforce only ["USD", "EUR", "JPY"], NOT free-form strings that could include "US Dollar", etc.

3. Currency Normalization

Represent currency codes using the ISO 4217 standard (e.g., USD, EUR, JPY) in all monetary fields.
Separate amount (number) from currency (string).
Example: { "amount": 250.00, "currency": "USD" }

4. Date Normalization

Dates should conform to ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ), not locale-dependent strings.
Always clarify in the schema description what date/timestamp format is expected.
Example: { "due_date": "2025-07-01" }

5. Boolean Normalization

Boolean fields should be strictly true or false, never text values like "yes", "no", "checked", or blank.
Use enum type in the schema for rare cases where tristate (e.g., "yes", "no", "n/a") is needed.

6. Casing Normalization

Prefer snake_case for all key names.
String value casing: Define and enforce requirements—such as all-caps for codes ("USD"), Title Case for names, etc.
Avoid mixed or unpredictable casing in outputs.

Schema Best Practices

The first line of defense for normalization is a robust extraction schema. Poorly defined schemas propagate errors; a precise schema minimizes confusion for both LLMs and downstream systems.

Add descriptions to every field.
Use descriptive, document-aligned field names (e.g., invoice_date not id_32).
Apply enum, format, and pattern constraints where possible.
Don’t prompt LLMs to perform calculations in extraction—extract only raw values.

For an in-depth guide and common schema pitfalls, read Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema.

Example: Before and After Normalization

Before (inconsistent, unnormalized outputs)

{
  "InvoiceNumber": 4521,
  "Invoice Date": "12th May, 2024",
  "Amount": "$1,200.00 USD",
  "Paid?": "yes",
  "currencyType": "us dollar"
}

After (normalized, LLM-ready outputs)

{
  "invoice_number": "4521",
  "invoice_date": "2024-05-12",
  "amount": 1200.00,
  "currency": "USD",
  "paid": true
}

Key Improvements

All field names are snake_case.
Invoice number is string, supporting leading zeros.
Date is ISO 8601.
Amount is type number; currency is separate and in ISO 4217.
Boolean paid uses true/false.

Cross-link: Further Reading

For practical extraction schema tips, JSON field configuration, and avoiding normalization failures, visit Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema.

For an overview of extraction pipelines and more before/after examples, see our docs.

Summary Table

Field	Before Example	After Normalization	Type	Notes
Invoice Date	12th May, 2024	2024-05-12	string	ISO 8601
Amount	$1,200.00 USD	1200.00	number	Only digits, separator `.`
Currency	us dollar	USD	string	ISO 4217 enum
Paid	yes	true	boolean	true/false only
InvoiceNumber	4521	"4521"	string	Preserve leading zeros

Consistent normalization—enforced by a strong schema and clear conventions—is essential for scalable, maintainable, and trustworthy AI document pipelines.

Normalization for LLMs: Best Practices for Extraction Outputs

Normalize messy enterprise documents for LLMs (ISO‑8601 dates, ISO‑4217 currency, E.164 phones)

Phone Number Normalization (E.164)

Before → After

JSON Schema constraints (example)

JSON Schema: Enums, Patterns, and Formats

Example schema fragment

Before/After: Combined Normalization

Quality gates and audits

See also

Normalization for LLMs

Why Normalization Matters for LLMs

Key Normalization Types

1. Typed Outputs

2. Use of Enums

3. Currency Normalization

4. Date Normalization

5. Boolean Normalization

6. Casing Normalization

Schema Best Practices

Example: Before and After Normalization

Before (inconsistent, unnormalized outputs)

After (normalized, LLM-ready outputs)

Key Improvements

Cross-link: Further Reading

Summary Table