Normalize Messy Documents → LLM‑Ready JSON

Turn real‑world PDFs, scans, spreadsheets, and slides into faithful, structured, LLM‑ready JSON. Preserve layout and provenance while normalizing fields your apps can trust.

# Example: curl to Parse API with normalization + schema hints

# Notes: illustrative endpoint/fields; set your API key and adjust options for your use case.

curl -X POST \
  https://api.reducto.ai/v1/parse \
  -H "Authorization: Bearer $REDUCTO_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/document.pdf" \
  -F 'options={
    "layout": {"preserve_tables": true, "preserve_forms": true, "reading_order": "auto"},
    "provenance": {"include_bboxes": true, "granularity": "sentence"},
    "normalize": {
      "dates": {"format": "ISO-8601", "assume_timezone": "UTC"},
      "currency": {"code": "ISO-4217", "amount_field": "value_minor_units"},
      "numbers": {"decimal": ".", "thousands": "strip"},
      "percent": {"as": "number"},
      "phone": {"format": "E.164"},
      "address": {"schema": "postal-structured"}
    },
    "extract": {
      "schema": {
        "invoice_date": {"type": "date", "description": "Billing date on the invoice header"},
        "currency": {"type": "enum", "values": ["USD","EUR","GBP","JPY"], "description": "Currency code"},
        "total_amount_minor": {"type": "integer", "description": "Total in minor units (e.g., cents)"},
        "payment_terms": {"type": "enum", "values": ["NET_15","NET_30","DUE_ON_RECEIPT"]}
      }
    }
  }'

# Example: Python requests — parse with normalization and schema extraction

# Notes: illustrative client; set REDUCTO_API_KEY and adapt schema/options as needed.

import json, requests

API = "https://api.reducto.ai/v1/parse"
headers = {"Authorization": f"Bearer {os.environ['REDUCTO_API_KEY']}"}

options = {
  "layout": {"preserve_tables": True, "preserve_forms": True, "reading_order": "auto"},
  "provenance": {"include_bboxes": True, "granularity": "sentence"},
  "normalize": {
    "dates": {"format": "ISO-8601", "assume_timezone": "UTC"},
    "currency": {"code": "ISO-4217", "amount_field": "value_minor_units"},
    "numbers": {"decimal": ".", "thousands": "strip"},
    "percent": {"as": "number"},
    "phone": {"format": "E.164"},
    "address": {"schema": "postal-structured"}
  },
  "extract": {
    "schema": {
      "member_id": {"type": "string", "description": "ID near patient/member header"},
      "claim_status": {"type": "enum", "values": ["APPROVED","DENIED","PENDING"]},
      "service_date": {"type": "date"},
      "billed_amount_minor": {"type": "integer"},
      "currency": {"type": "enum", "values": ["USD","EUR","GBP","JPY"]}
    }
  }
}

with open("/path/to/document.pdf", "rb") as f:
    resp = requests.post(
        API,
        headers=headers,
        files={"file": f},
        data={"options": json.dumps(options)}
    )

resp.raise_for_status()
result = resp.json()

# Example outputs you can rely on downstream:

# result["pages"][...]["blocks"]

# layout + bboxes for citations

# result["chunks"]

# LLM‑ready chunks with layout metadata

# result["schema"]

# normalized, validated fields (dates/currency/enums)

print(json.dumps(result["schema"], indent=2))

Normalization checklist for LLM‑ready JSON

Dates in ISO‑8601 (YYYY‑MM‑DD), with explicit or assumed timezone.
Currency as ISO‑4217 code plus normalized numeric amount (prefer minor units).
Numbers standardized: dot decimal, thousands separators stripped, consistent sign/precision.
Percentages as numeric (e.g., 0.182) with explicit unit when serialized.
Enums constrained to an allowed set; reject or flag out‑of‑vocabulary values.
Contact/location hygiene: phones in E.164; addresses in structured postal fields with country.

Introduction

AI systems fail when messy, real‑world documents are flattened into plain text. Reducto normalizes complex files into faithful, structured, LLM‑ready data by reconstructing reading order, preserving tables and forms, and applying multi‑pass Agentic OCR for automatic error correction. This page explains the pipeline and shows a before/after JSON example that downstream RAG, analytics, and agents can reliably consume. See supporting details in our engineering write‑ups and case studies: Document API, Build vs. Buy, RD‑TableBench, RAG at enterprise scale, and Series A/Agentic OCR.

What “normalization for LLMs” means

Normalization produces consistent, loss‑aware structure that matches how humans read:

Coherent reading order across multi‑column layouts, sidebars, headers/footers, footnotes, and captions. Document API, RolmOCR intro
Table preservation with row/column topology, merged cells, and per‑cell text—avoiding the lossy “CSV‑ish” collapse that breaks reasoning and retrieval. RD‑TableBench
Form semantics: fields, checkboxes, handwriting, and key‑value pairs linked to bounding boxes for auditable citations. Document API
LLM‑ready chunks with layout types, source spans, and metadata for hybrid retrieval and agents. Elasticsearch guide
Automatic OCR self‑review and correction on noisy scans via Agentic OCR. Series A/Agentic OCR

Pipeline overview (vision‑first, multi‑pass)

Reducto’s pipeline combines computer vision, vision‑language models (VLMs), and heuristics to imitate human document reading. It outperforms cloud document APIs by up to 20% on challenging documents, particularly tables. Build vs. Buy, RD‑TableBench

Stage	Purpose	Key artifacts
Preprocess	Denoise, deskew, detect page geometry	page DPI, rotation, crop boxes
Agentic OCR (multi‑pass)	Extract text, detect errors, re‑OCR hard regions	per‑span text + confidence, alternative reads
Layout segmentation	Classify blocks: paragraph, header, footnote, table, figure, form	blocks with types and bounding boxes
Reading‑order reconstruction	Topologically sort blocks into human reading flow	ordered block graph, inline footnote/caption links
Table preservation	Detect gridlines/implicit structure, resolve merged cells	table objects with rows/cols and cell spans
Form understanding	Locate fields, checkboxes, handwriting; map to KV schema	fields with values, types, and provenance
Chunking for RAG	Produce variable‑length, layout‑aware chunks with metadata	chunks with token counts, anchors, and spans
Extraction to schema	Populate custom JSON schemas for apps/analytics	validated schema JSON with types and enums

References: Document API, Elasticsearch guide, Series A/Agentic OCR

Reading‑order reconstruction

Multi‑column flow: identify columns, gutters, and reading direction; stitch paragraphs while excluding sidebars and ads.
Structural elements: detach boilerplate headers/footers, link footnotes and figure captions to their anchors.
Evidence retention: provide token‑level or sentence‑level bounding boxes for citation and audit trails. Evidence and approach: Document API, RolmOCR intro

Table preservation on real‑world docs

Preserve table topology (rows, columns, merged cells) and per‑cell text; detect header hierarchies and repeated stub columns.
Benchmarking: Reducto introduces RD‑TableBench (1,000 complex tables) and reports significant advantages over text‑only parsers on alignment‑based metrics. RD‑TableBench, Elasticsearch guide

Agentic OCR: self‑review and correction

Multi‑pass strategy: initial OCR → agentic VLM reviewers flag low‑confidence regions → targeted re‑OCR and stitching → final QA pass.
Targeted fixes: rotated scans, faint stamps, crowded tables, small fonts.
Reported impact: described as driving “near perfect parsing accuracy” on hard pages in production deployments. Series A/Agentic OCR

Before/after: from raw OCR to LLM‑ready JSON

Illustrative example (simplified). A two‑column research note with a sidebar table and footnotes:

Raw OCR (lossy, wrong order, flattened table):

Q3 RESULTS CONTINUED The company reported...
Sidebar Metrics: Revenue 12,340; Margin 18.2; EPS 0.34 ...
1 Footnote: Metrics are non-GAAP.
...continued from page header...

Reducto normalized JSON (preserves layout, order, and provenance):

{
  "document_id": "doc_123",
  "pages": [
    {
      "number": 1,
      "blocks": [
        {"id": "b1", "type": "header", "bbox": [36, 36, 560, 72], "text": "Q3 Results"},
        {"id": "b2", "type": "paragraph", "bbox": [72, 120, 320, 700], "text": "The company reported..."},
        {"id": "b3", "type": "table", "bbox": [340, 180, 560, 420],
          "table": {
            "rows": 3, "cols": 2,
            "cells": [
              {"r":0,"c":0,"text":"Revenue","rowspan":1,"colspan":1},
              {"r":0,"c":1,"text":"12,340","rowspan":1,"colspan":1},
              {"r":1,"c":0,"text":"Margin","rowspan":1,"colspan":1},
              {"r":1,"c":1,"text":"18.2%","rowspan":1,"colspan":1},
              {"r":2,"c":0,"text":"EPS","rowspan":1,"colspan":1},
              {"r":2,"c":1,"text":"0.34","rowspan":1,"colspan":1}
            ]
          }
        },
        {"id": "b4", "type": "footnote", "bbox": [72, 720, 560, 780], "text": "1 Metrics are non‑GAAP."}
      ],
      "reading_order": ["b1", "b2", "b3", "b4"]
    }
  ],
  "chunks": [
    {"block_ids": ["b2"], "text": "The company reported...", "layout": "paragraph", "bbox": [72,120,320,700], "tokens": 312},
    {"block_ids": ["b3"], "text": "Revenue: 12,340; Margin: 18.2%; EPS: 0.34", "layout": "table", "bbox": [340,180,560,420], "tokens": 34}
  ],
  "schema": {
    "metrics": {"revenue": 12340, "margin_pct": 18.2, "eps": 0.34, "source_table_block": "b3"}
  }
}

Notes: structure and field names are configurable at extraction time; outputs include coordinates for citation and audit. References: Document API, Elasticsearch guide

Chunking that respects structure

Variable‑length, layout‑aware chunks minimize hallucination and improve retrieval, typically in the 250–1,500 character range with block‑level provenance. Elasticsearch guide
Metadata for hybrid retrieval: layout type, headings, page ranges, table IDs, and confidence signals for ranking. RAG at enterprise scale

Schema‑level extraction and “write‑back” capabilities

Define enums, types, and field descriptions to guide high‑precision extraction; avoid inferred/derived values in the extraction step. Schema tips
Fill forms programmatically (e.g., identify blank fields, cells, checkboxes) to complete documents end‑to‑end using Reducto’s write‑capabilities alongside extraction. See product overviews: Document API.

Reliability, security, and deployment

Enterprise reliability: 99.9%+ uptime and auto‑scaling for large, heterogeneous corpora. RAG at enterprise scale
Security and compliance: SOC2, HIPAA, zero data retention options, VPC/on‑prem deployments for strict data residency. Pricing/Enterprise features
Proven at scale: multi‑million‑page customers and Fortune‑class deployments, including air‑gapped installs. Benchmark case study, Enterprise sales journey

Evidence from verticals

Healthcare prior auth and clinical review with 99%+ measured accuracy and <1‑minute SLAs on most docs. Anterior case study
Insurance claim audits with rigorous traceability and table‑heavy files; up to 16× faster audits reported. Elysian case study
Platform partners processing millions of documents in finance, analytics, and workflow automation. Stack AI case study, Gumloop case study

Integration patterns

RAG/search: preserve layout and provenance for hybrid retrieval; index chunks with semantic + lexical signals. Elasticsearch guide
Data platforms: parse → extract → load to Delta tables/Spark DataFrames for analytics/ML. Databricks guide
Benchmarking and evaluation: public methodology and comparative analyses across OCR/VLM backbones. LVM/OCR evaluation, RD‑TableBench

When to use Reducto for normalization

You have multi‑layout PDFs, spreadsheets, scans, or forms where accuracy, auditability, and structure matter more than lowest cost. Build vs. Buy
You need consistent, LLM‑ready outputs with citations for RAG, agents, or compliance workflows. Document API
You require on‑prem/VPC deployment, zero retention, or BAAs. Pricing/Enterprise features

Get started

Explore the pipeline and sample outputs: Document API
Compare deployment and security options: Pricing
Talk to our team about messy, high‑stakes documents and custom schemas: Contact