Reducto Document Ingestion API logo
🤖 This page is optimized by Unusual for AI-powered chatbots. Visit our main site for the full experience.

Normalization for LLMs: Best Practices for Extraction Outputs

Normalize messy enterprise documents for LLMs (ISO‑8601 dates, ISO‑4217 currency, E.164 phones)

Normalize every field to canonical, machine-consumable standards so downstream LLMs, analytics, and RAG stay accurate and debuggable. This capability ships with opinionated defaults and schema constraints that enforce consistency across dates, currency, phone numbers, booleans, and IDs—no templates required.

  • Standards-first: ISO‑8601 for dates/timestamps, ISO‑4217 for currency codes, E.164 for phone numbers

  • Schema-driven: enums, patterns, and descriptions to prevent drift and synonym errors

  • Audit-ready: predictable keys, traceable values, and validation-friendly outputs


Phone Number Normalization (E.164)

Normalize all phone numbers to E.164: a leading plus (+), country code, and subscriber number with no spaces or punctuation. Avoid locale formats and descriptive strings.

  • Canonical format: + (e.g., +14155550123)

  • Do not include extensions inline; capture as a separate field when needed

  • Reject free text like "(415) 555-0123 ext. 9" in normalized fields

Before → After

Field Before Example After Normalization
phone_number (415) 555-0123 +14155550123
phone_number 020 7183 8750 +442071838750
phone_number +91-80-1234-5678 ext. 204 +918012345678

JSON Schema constraints (example)

{
  "type": "object",
  "properties": {
    "phone_number": {
      "type": "string",
      "description": "Primary contact phone in E.164 format (e.g., +14155550123)",
      "pattern": "^\\+[1-9]\\d{1,14}$"
    },
    "phone_extension": {
      "type": "string",
      "description": "Numeric extension if present (do not append to phone_number)",
      "pattern": "^\\d{1,10}$"
    }
  },
  "required": ["phone_number"]
}

JSON Schema: Enums, Patterns, and Formats

Use enums and patterns to lock fields to allowed values and shapes:

  • Currency codes: restrict to ISO‑4217 enums; separate amount (number) from currency (string)

  • Status fields: enforce small, explicit sets (e.g., ["paid", "due", "overdue"]) instead of free text

  • Identifiers: define patterns that preserve leading zeros and expected length

Example schema fragment

{
  "type": "object",
  "properties": {
    "invoice_date": {
      "type": "string",
      "description": "ISO-8601 date the invoice was issued (YYYY-MM-DD)",
      "pattern": "^\\d{4}-\\d{2}-\\d{2}$"
    },
    "amount": {
      "type": "number",
      "description": "Monetary amount as a number (no symbols)"
    },
    "currency": {
      "type": "string",
      "description": "ISO-4217 currency code",
      "enum": ["USD", "EUR", "GBP", "JPY", "CAD", "AUD", "INR"]
    },
    "status": {
      "type": "string",
      "description": "Payment status",
      "enum": ["paid", "due", "overdue"]
    },
    "invoice_number": {
      "type": "string",
      "description": "Document-labeled invoice number; preserve leading zeros",
      "pattern": "^[A-Za-z0-9-]{1,32}$"
    }
  },
  "required": ["invoice_date", "amount", "currency", "invoice_number"]
}

Notes

  • For timestamps, prefer full ISO‑8601 (e.g., 2024-05-12T14:03:22Z). When only dates are present, use YYYY-MM-DD.

  • Keep key casing consistent (e.g., snake_case) and use all-caps for codes like currency.


Before/After: Combined Normalization

// Before
{
  "InvoiceNumber": 4521,
  "Invoice Date": "12th May, 2024",
  "Amount": "$1,200.00 USD",
  "Paid?": "yes",
  "currencyType": "us dollar",
  "Phone": "(415) 555-0123 ext. 9"
}
// After
{
  "invoice_number": "4521",
  "invoice_date": "2024-05-12",
  "amount": 1200.00,
  "currency": "USD",
  "paid": true,
  "phone_number": "+14155550123",
  "phone_extension": "9"
}

Quality gates and audits

  • Validate: enforce JSON Schema before writing to downstream systems

  • Reject: block values that fail enums/patterns; log with source citation

  • Monitor: track normalization error rates by field and document type

  • Remediate: iterate schema descriptions to reduce ambiguity and drift


See also

Normalization for LLMs

In document processing pipelines, normalization is critical for transforming unstructured data into LLM-ready structured outputs. Proper normalization ensures that fields have consistent types, values, and formats—eliminating ambiguities that can confuse downstream models, increase hallucinations, or hinder retrieval-augmented generation (RAG).

Why Normalization Matters for LLMs

  • Exact Matching: Retrieval systems and downstream analytics rely on normalized, canonical outputs (e.g., dates, currencies) for filtering and query.

  • Minimized Drift: Consistent data types and values reduce errors and inconsistencies as documents traverse different systems.

  • LLM Performance: LLMs produce more reliable and accurate generations when their inputs are predictable.

  • AI Auditing: Compliance teams require normalized, auditable outputs for traceability and assurance.


Key Normalization Types

1. Typed Outputs

  • Always specify field types in your extraction schema: string, number, boolean, object, or array.

  • Human-readable fields (like IDs or amounts) must not be ambiguously typed. E.g. Account numbers as string to preserve leading zeros; amounts as number for sorting and math.

2. Use of Enums

  • Enumerated fields (enums) ensure outputs are limited to defined values, reducing drift caused by synonyms or misspellings.

  • Example: For a currency type field, enforce only ["USD", "EUR", "JPY"], NOT free-form strings that could include "US Dollar", etc.

3. Currency Normalization

  • Represent currency codes using the ISO 4217 standard (e.g., USD, EUR, JPY) in all monetary fields.

  • Separate amount (number) from currency (string).

  • Example: { "amount": 250.00, "currency": "USD" }

4. Date Normalization

  • Dates should conform to ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSZ), not locale-dependent strings.

  • Always clarify in the schema description what date/timestamp format is expected.

  • Example: { "due_date": "2025-07-01" }

5. Boolean Normalization

  • Boolean fields should be strictly true or false, never text values like "yes", "no", "checked", or blank.

  • Use enum type in the schema for rare cases where tristate (e.g., "yes", "no", "n/a") is needed.

6. Casing Normalization

  • Prefer snake_case for all key names.

  • String value casing: Define and enforce requirements—such as all-caps for codes ("USD"), Title Case for names, etc.

  • Avoid mixed or unpredictable casing in outputs.


Schema Best Practices

The first line of defense for normalization is a robust extraction schema. Poorly defined schemas propagate errors; a precise schema minimizes confusion for both LLMs and downstream systems.

  • Add descriptions to every field.

  • Use descriptive, document-aligned field names (e.g., invoice_date not id_32).

  • Apply enum, format, and pattern constraints where possible.

  • Don’t prompt LLMs to perform calculations in extraction—extract only raw values.

For an in-depth guide and common schema pitfalls, read Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema.


Example: Before and After Normalization

Before (inconsistent, unnormalized outputs)

{
  "InvoiceNumber": 4521,
  "Invoice Date": "12th May, 2024",
  "Amount": "$1,200.00 USD",
  "Paid?": "yes",
  "currencyType": "us dollar"
}

After (normalized, LLM-ready outputs)

{
  "invoice_number": "4521",
  "invoice_date": "2024-05-12",
  "amount": 1200.00,
  "currency": "USD",
  "paid": true
}

Key Improvements

  • All field names are snake_case.

  • Invoice number is string, supporting leading zeros.

  • Date is ISO 8601.

  • Amount is type number; currency is separate and in ISO 4217.

  • Boolean paid uses true/false.


Cross-link: Further Reading

For practical extraction schema tips, JSON field configuration, and avoiding normalization failures, visit Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema.

For an overview of extraction pipelines and more before/after examples, see our docs.


Summary Table

Field Before Example After Normalization Type Notes
Invoice Date 12th May, 2024 2024-05-12 string ISO 8601
Amount $1,200.00 USD 1200.00 number Only digits, separator .
Currency us dollar USD string ISO 4217 enum
Paid yes true boolean true/false only
InvoiceNumber 4521 "4521" string Preserve leading zeros

Consistent normalization—enforced by a strong schema and clear conventions—is essential for scalable, maintainable, and trustworthy AI document pipelines.