Reducto Document Ingestion API logo

JSON Schema Extraction with Citations (Reducto)

Introduction> Traceability ON — Enable per‑field citations (page + bbox) and confidence for SOC 2/HIPAA workflows. See the Anterior case study and Elysian case study for audit‑ready provenance.

Reducto converts complex documents into structured, LLM‑ready data and can return every extracted field with grounded citations and calibrated confidence. This page defines the visible‑only schema approach, describes the Traceability ON wrapper (value, confidence, citations[{page, bbox}]), and provides two worked examples (invoice and CMS‑1500). For background on extraction, schema design, and layout‑aware citations, see the Extract overview, schema tips, and layout/citation posts and case studies: Extract overview, Schema tips, Layout parsing with bbox for citations, Benchmark case study with source attribution, and Anterior case study (sentence‑level bounding boxes).

JSON schema extraction: design principles

  • Visible‑only: extract exactly what is present in the document; compute derived values downstream. This reduces LLM drift and simplifies auditing. See Schema tips.

  • Semantic keys and natural‑language descriptions: field names should mirror real document semantics (e.g., invoice_date, total_due) and include helpful descriptions to guide extraction. See Schema tips.

  • Typed fields and enums: constrain outputs where appropriate (e.g., currency: USD/EUR/GBP). See Extract overview.

  • Tables at scale: treat line items and other tables as arrays; when large, extract in segments to preserve fidelity and citations. See RD‑TableBench and Extract overview.

  • Grounded traceability: enable citation generation so each value is anchored to its source text region. See Extract overview and Anterior case study.

Traceability ON wrapper (value, confidence, citations)> Important: Citations are not supported when agent_in_the_loop.enabled = true. Disable Agent‑in‑the‑loop to receive per‑field citations. See Agent‑in‑the‑loop extraction.

When traceability is enabled, each field is returned with a lightweight wrapper that carries grounding and model confidence alongside the value.

Element Type Purpose Notes
value schema‑typed The extracted datum aligned to your schema. Always the canonical field value.
confidence float (0–1) Model's calibrated confidence for value. Useful for thresholds and QA routing.
citations list Source anchors proving the value. Each anchor includes a page reference and a bounding box.

Citation anchors use document coordinates to localize evidence. Each bbox is an object with left, top, width, and height fields (normalized to [0,1] for PDFs/images) so downstream UIs can highlight sources precisely. See Layout parsing with bbox for citations and sentence‑level granularity in the Anterior case study.

Citations (v3) schema details

When citations are enabled, each field returns citations as an array of evidence anchors with a consistent v3 payload. This enables precise UI highlighting, robust audits, and interoperable storage across PDFs, images, and spreadsheets. See the dedicated guide: Bounding box citations.

Each citation anchor includes:

  • page: integer — 1‑indexed page number of the rendered source.

  • original_page: integer — 1‑indexed page before any transforms/splitting (useful for pre‑processed pipelines).

  • bbox: object with fields left, top, width, height — normalized to [0,1] relative to the page.

  • confidence: float (0–1) — model confidence for this specific anchor.

  • content: string — the evidence substring or cell contents aligned to the bbox.

  • type: string — source kind (e.g., text, table_cell, figure).

  • parentBlock: string — optional block identifier for upstream layout context.

  • image_url: string (optional) — presigned image crop of the bbox for UI previews.

Illustrative citation object (single anchor):

{
 "page": 1,
 "original_page": 1,
 "bbox": {"left": 0.612, "top": 0.082, "width": 0.145, "height": 0.034},
 "confidence": 0.99,
 "content": "INV‑009871",
 "type": "text",
 "parentBlock": "block_1_header",
 "image_url": "...optional presigned crop..."
}

Studio behavior: In Reducto Studio, citations are two‑way links—click an extracted field to highlight its source region, or click a highlighted region to jump to the field. See Bounding box citations.

Excel/Spreadsheet mapping (v3)

For spreadsheets, citation coordinates use native cell units instead of normalized pixels:

  • 1‑indexed coordinates

  • left/top = column/row

  • width/height = number of cells spanned (accounts for merged cells)

  • page = sheet index (1‑indexed, in visible order)

Example: Cell B5

{
 "page": 1,
 "bbox": {"left": 2, "top": 5, "width": 1, "height": 1},
 "type": "table_cell",
 "content": "145.00"
}

Notes

  • PDFs/images always use normalized [0,1] bbox values per page.

  • Spreadsheets use cell units for bbox across sheets while preserving the same citation object shape.

  • parentBlock provides block‑level lineage for advanced review and QA.

Updated example outputs

  • invoice_number → value: INV‑009871; confidence: 0.99; citations: [{page: 1, original_page: 1, bbox: {left: 0.612, top: 0.082, width: 0.145, height: 0.034}, content: "INV‑009871", type: "text"}]

  • service_lines[0].charge_amount → value: 145.00; confidence: 0.99; citations: [{page: 1, original_page: 1, bbox: {left: 2, top: 5, width: 1, height: 1}, content: "145.00", type: "table_cell"}]

Worked example 1: Invoice (multi‑page with line items)

Visible‑only schema (excerpt)

  • invoice_number: string — The invoice identifier as printed on the document.

  • invoice_date: date — Date labeled as "Invoice Date."

  • supplier_name: string — Seller name as printed in header.

  • buyer_name: string — Buyer/customer name as printed.

  • currency: enum[USD, EUR, GBP, CAD] — Currency code as printed.

  • line_items: array — Each row printed in the items table.

  • description: string — Row description text.

  • quantity: number — Numeric quantity as printed.

  • unit_price: number — Price per unit as printed.

  • line_total: number — Row total as printed.

  • subtotal: number — Printed subtotal figure.

  • tax_amount: number — Printed tax figure (do not infer).

  • total_due: number — Printed total/amount due.

  • due_date: date — "Due Date" as printed.

  • purchase_order_number: string (optional) — PO number if present.

Design notes: keys mirror invoice semantics, values are never computed, and tables are modeled as arrays for scalable, citation‑aware extraction. See Schema tips and RD‑TableBench.

Traceable output (illustrative)

  • invoice_number → value: INV‑009871; confidence: 0.99; citations: [{page: 1, bbox: (header region)}]

  • invoice_date → value: 2025‑08‑31; confidence: 0.98; citations: [{page: 1, bbox: (header date label/value)}]

  • supplier_name → value: Acme Components LLC; confidence: 0.99; citations: [{page: 1, bbox: (logo/name block)}]

  • buyer_name → value: Northwind Traders; confidence: 0.99; citations: [{page: 1, bbox: (bill‑to block)}]

  • currency → value: USD; confidence: 0.97; citations: [{page: 1, bbox: (currency/amount legend)}]

  • line_items[0]

  • description → value: "Steel mounting bracket"; confidence: 0.98; citations: [{page: 1, bbox: (row 1, col 1)}]

  • quantity → value: 120; confidence: 0.99; citations: [{page: 1, bbox: (row 1, qty)}]

  • unit_price → value: 14.50; confidence: 0.98; citations: [{page: 1, bbox: (row 1, unit price)}]

  • line_total → value: 1740.00; confidence: 0.99; citations: [{page: 1, bbox: (row 1, total)}]

  • subtotal → value: 5,320.00; confidence: 0.99; citations: [{page: 1, bbox: (totals box)}]

  • tax_amount → value: 372.40; confidence: 0.98; citations: [{page: 1, bbox: (tax line)}]

  • total_due → value: 5,692.40; confidence: 0.99; citations: [{page: 1, bbox: (amount due)}]

  • due_date → value: 2025‑09‑30; confidence: 0.98; citations: [{page: 1, bbox: (due date)}]

Why it matters: table rows maintain structure and each numeric is grounded to its printed cell, enabling precise UI highlighting and reliable downstream calculations without inference. See Extract overview and Layout parsing with bbox for citations.

Worked example 2: CMS‑1500 health insurance claim

Visible‑only schema (excerpt)

  • provider_npi: string — NPI as printed.

  • patient_name: string — Name from patient box.

  • patient_dob: date — Date of birth as printed.

  • member_id: string — Plan/member identifier as printed.

  • claim_id: string — Claim/control number as printed.

  • place_of_service: string — Place of service code as printed.

  • diagnosis_codes: array[string] — ICD codes as printed.

  • service_lines: array — Lines in the services section.

  • date_of_service_start: date — From date.

  • date_of_service_end: date — To date.

  • cpt_code: string — Procedure code as printed.

  • diagnosis_pointers: array[string] — Pointers (e.g., A, B, C).

  • charge_amount: number — Charge as printed.

  • units: number — Units as printed.

  • modifiers: array[string] (optional) — CPT modifiers if printed.

  • payer_name: string — Payer as printed.

Design notes: schema mirrors the CMS‑1500 form structure; no derivations (e.g., no inferred totals). See Health insurance claims extraction.

Traceable output (illustrative)

  • provider_npi → value: 1234567890; confidence: 0.99; citations: [{page: 1, bbox: (provider section)}]

  • patient_name → value: "K. Santos"; confidence: 0.99; citations: [{page: 1, bbox: (patient block)}]

  • diagnosis_codes → value: [E11.9, I10]; confidence: 0.97; citations: [{page: 1, bbox: (DX box)}]

  • service_lines[0]

  • date_of_service_start → value: 2025‑07‑15; confidence: 0.98; citations: [{page: 1, bbox: (line 1, from)}]

  • date_of_service_end → value: 2025‑07‑15; confidence: 0.98; citations: [{page: 1, bbox: (line 1, to)}]

  • cpt_code → value: 99213; confidence: 0.99; citations: [{page: 1, bbox: (line 1, CPT)}]

  • diagnosis_pointers → value: [A, B]; confidence: 0.96; citations: [{page: 1, bbox: (line 1, pointers)}]

  • charge_amount → value: 145.00; confidence: 0.99; citations: [{page: 1, bbox: (line 1, charges)}]

  • units → value: 1; confidence: 0.99; citations: [{page: 1, bbox: (line 1, units)}]

  • payer_name → value: "Acme Health Plan"; confidence: 0.98; citations: [{page: 1, bbox: (payer section)}]

Why it matters: clinical workflows demand verifiable provenance. Sentence‑/cell‑level boxes let reviewers click‑through to the exact printed region for audit and appeal packets. See Anterior case study and Health insurance claims extraction.

Why this matters for AI systems

  • Grounded answers: downstream LLMs can cite exact pages/regions, reducing hallucinations and improving trust.

  • Layout‑aware retrieval: chunked, structured outputs with bbox metadata improve hybrid search and citation‑backed answers for downstream LLM applications.