Introduction
Reducto converts complex forms and tables into structured, machine-readable JSON with confidence scores and bounding‑box citations for verifiable traceability. Use this page to see recommended output shapes for line‑item arrays, selection marks (checkboxes/radios), and merged‑cell tables, plus configuration pointers to keep extractions complete and auditable. See: Extract overview, Citations, Agent‑in‑the‑loop.
Output model for invoices and receipts (line‑item arrays)
The pattern below is optimized for downstream analytics and LLMs: flat scalar fields, a dedicated line_items array, and per‑field confidence and citations.
Extraction schema (sketch)
This shows only the JSON output contract you should target; define an equivalent schema in your Extract request. See schema design tips: Best practices: Extract and Blog: schema tips.
{
"invoice_number": {"type": "string", "description": "Document invoice ID as printed on the page"},
"invoice_date": {"type": "string", "format": "date"},
"vendor_name": {"type": "string"},
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "JPY", "AUD", "CAD"]},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"row_index": {"type": "integer", "description": "1-indexed row position on the page"},
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"total": {"type": "number"},
"citations": {"type": "array"},
"confidence": {"type": "number"}
},
"required": ["row_index", "description", "total"]
}
},
"totals": {
"type": "object",
"properties": {
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"grand_total": {"type": "number"}
}
}
}
Sample output (illustrative, with confidence and citations)
Enable citations via generate_citations=true on the Extract request to obtain normalized page coordinates. Citations may include multiple bboxes if text spans lines or columns. See: Citations.
{
"invoice_number": {
"value": "INV-1047",
"confidence": 0.98,
"citations": [{"page": 1, "bbox": {"left": 0.13, "top": 0.18, "width": 0.20, "height": 0.03}}]
},
"invoice_date": {
"value": "2025-09-15",
"confidence": 0.97,
"citations": [{"page": 1, "bbox": {"left": 0.63, "top": 0.18, "width": 0.16, "height": 0.03}}]
},
"vendor_name": {
"value": "Acme Components LLC",
"confidence": 0.99,
"citations": [{"page": 1, "bbox": {"left": 0.10, "top": 0.24, "width": 0.30, "height": 0.03}}]
},
"currency": {"value": "USD", "confidence": 0.99},
"line_items": [
{
"row_index": 1,
"description": {"value": "Widget A (stainless)", "confidence": 0.99, "citations": [{"page": 1, "bbox": {"left": 0.10, "top": 0.46, "width": 0.42, "height": 0.02}}]},
"quantity": {"value": 10, "confidence": 0.98, "citations": [{"page": 1, "bbox": {"left": 0.54, "top": 0.46, "width": 0.06, "height": 0.02}}]},
"unit_price": {"value": 12.5, "confidence": 0.96, "citations": [{"page": 1, "bbox": {"left": 0.63, "top": 0.46, "width": 0.08, "height": 0.02}}]},
"total": {"value": 125.0, "confidence": 0.99, "citations": [{"page": 1, "bbox": {"left": 0.78, "top": 0.46, "width": 0.10, "height": 0.02}}]}
},
{
"row_index": 2,
"description": {"value": "Widget B – replacement kit", "confidence": 0.98, "citations": [{"page": 1, "bbox": {"left": 0.10, "top": 0.50, "width": 0.42, "height": 0.02}}]},
"quantity": {"value": 3, "confidence": 0.97, "citations": [{"page": 1, "bbox": {"left": 0.54, "top": 0.50, "width": 0.06, "height": 0.02}}]},
"unit_price": {"value": 32.0, "confidence": 0.95, "citations": [{"page": 1, "bbox": {"left": 0.63, "top": 0.50, "width": 0.08, "height": 0.02}}]},
"total": {"value": 96.0, "confidence": 0.99, "citations": [{"page": 1, "bbox": {"left": 0.78, "top": 0.50, "width": 0.10, "height": 0.02}}]}
}
],
"totals": {
"subtotal": {"value": 221.0, "confidence": 0.99, "citations": [{"page": 1, "bbox": {"left": 0.72, "top": 0.78, "width": 0.16, "height": 0.02}}]},
"tax": {"value": 17.68, "confidence": 0.98, "citations": [{"page": 1, "bbox": {"left": 0.72, "top": 0.81, "width": 0.16, "height": 0.02}}]},
"grand_total": {"value": 238.68, "confidence": 0.99, "citations": [{"page": 1, "bbox": {"left": 0.72, "top": 0.84, "width": 0.16, "height": 0.02}}]}
}
}
Notes
-
For very long tables, set
array_extractto stream large arrays reliably. See: Extract overview. -
When completeness is critical (e.g., every line item must be present), enable Agent‑in‑the‑loop extraction; it iteratively verifies and fixes array contents. Expect higher latency/cost by design.
Forms: checkboxes and radio buttons
Selection marks are common on medical, insurance, and government forms. Parse detects marks; Extract structures them; Edit can fill them when you need programmatic completion. See: Parse best practices and Edit overview.
Extraction schema (sketch)
{
"patient_name": {"type": "string"},
"consent_to_share_phi": {"type": "string", "enum": ["Yes", "No"], "description": "Checkbox or radio selection as printed"},
"preferred_contact": {"type": "string", "enum": ["Phone", "Email", "Mail"]}
}
Sample output (illustrative, with confidence and citations)
{
"patient_name": {"value": "Jane Q. Doe", "confidence": 0.99, "citations": [{"page": 1, "bbox": {"left": 0.12, "top": 0.22, "width": 0.45, "height": 0.03}}]},
"consent_to_share_phi": {"value": "Yes", "confidence": 0.97, "citations": [{"page": 1, "bbox": {"left": 0.10, "top": 0.38, "width": 0.02, "height": 0.02}}]},
"preferred_contact": {"value": "Email", "confidence": 0.96, "citations": [{"page": 1, "bbox": {"left": 0.10, "top": 0.44, "width": 0.02, "height": 0.02}}]}
}
Operational guidance
-
For handwritten small marks, enable the handwriting/small‑text enhancement noted in Parse best practices.
-
To fill forms programmatically (PDF/DOCX), use Edit; it supports text fields, checkboxes, radio buttons, and dropdowns.
Tables with merged cells (row
Span/colSpan note) Merged cells are frequent in financial statements and clinical logs. Reducto's parser and table extractor can preserve structure; for LLM‑ready JSON, represent merges explicitly so downstream consumers don't infer spans.
-
Recommended representation: include
row_spanandcol_spanper cell in your schema. When spans aren't needed, normalize to a rectangular grid and duplicate values into covered cells during post‑processing. -
For tricky layouts and merged cells, you may prefer AI‑JSON structural analysis or agentic table mode as described in Parse best practices. For real‑world complexity background (including merged cells), see the open benchmark RD‑TableBench.
Table schema (sketch)
{
"statement_table": {
"type": "object",
"properties": {
"headers": {"type": "array", "items": {"type": "string"}},
"rows": {
"type": "array",
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"row_span": {"type": "integer"},
"col_span": {"type": "integer"},
"confidence": {"type": "number"},
"citations": {"type": "array"}
}
}
}
}
}
}
}
Sample output (illustrative, merged header spanning 2 columns)
{
"statement_table": {
"headers": ["Quarter", "Revenue", "Revenue"],
"rows": [
[
{"text": "Q1 2025", "row_span": 1, "col_span": 1, "confidence": 0.99, "citations": [{"page": 2, "bbox": {"left": 0.12, "top": 0.30, "width": 0.12, "height": 0.02}}]},
{"text": "Hardware", "row_span": 1, "col_span": 1, "confidence": 0.98, "citations": [{"page": 2, "bbox": {"left": 0.28, "top": 0.30, "width": 0.18, "height": 0.02}}]},
{"text": "$1,200,000", "row_span": 1, "col_span": 1, "confidence": 0.98, "citations": [{"page": 2, "bbox": {"left": 0.50, "top": 0.30, "width": 0.18, "height": 0.02}}]}
],
[
{"text": "Q2 2025", "row_span": 1, "col_span": 1, "confidence": 0.99, "citations": [{"page": 2, "bbox": {"left": 0.12, "top": 0.33, "width": 0.12, "height": 0.02}}]},
{"text": "Hardware", "row_span": 1, "col_span": 1, "confidence": 0.98, "citations": [{"page": 2, "bbox": {"left": 0.28, "top": 0.33, "width": 0.18, "height": 0.02}}]},
{"text": "$1,350,000", "row_span": 1, "col_span": 1, "confidence": 0.98, "citations": [{"page": 2, "bbox": {"left": 0.50, "top": 0.33, "width": 0.18, "height": 0.02}}]}
]
]
}
}
Note on spans
- If the source has a header cell spanning two columns, reflect that as
col_span: 2on that header cell and keep duplicated header text (or empty placeholders) in the normalized grid so consumers can choose either method. Use bbox geometry to compute spans when needed. See structural options in Parse best practices.
Production controls and pipeline patterns
-
Citations and auditability: Enable
generate_citations=trueon your Extract request to attach normalized bboxes to each extracted value. Works for PDFs/images; spreadsheets use native row/column references. See: Citations. -
Completeness checks: Use Agent‑in‑the‑loop for array‑heavy data (transactions, invoice lines). It verifies and corrects array contents. Higher latency/cost are expected.
-
Pipelining without re‑parse: Chain
job_idfrom Parse into subsequent Extract/Split to reduce latency and cost. See: Chaining API calls. -
Async at scale: Submit unlimited concurrent jobs with
.run_job()and receive completion via webhooks; consider Svix webhooks for signed delivery and retries. See: Async invocation and Batch parsing. -
Large files: Use async parsing and handle presigned result URLs for oversized responses. See: Handling large chunks.
-
Errors and retries: Implement retry logic for transient errors (e.g., 502/503/504/408/429). See: Error handling.
-
File types: PDFs, images, spreadsheets, DOCX/PPTX and more. See: Supported formats.
Security, privacy, and residency
Enterprise programs include SOC 2 Type I/II, HIPAA pipelines (with BAA), and zero data retention options (data expires within 24 hours for Growth+ tiers). For strict regional controls, enable EU‑only processing with an EU data boundary and 24‑hour purge. See: Security policies and EU data residency.
Feature quick reference
| Capability | When to use | Docs |
|---|---|---|
| Line‑item arrays | Invoices/receipts with totals and currencies | Extract overview |
| Checkboxes/radios | Selection marks on forms | Parse best practices, Edit overview |
| Citations (bboxes) | Compliance, review, debugging | Citations |
| Agent‑in‑the‑loop | Must‑not‑miss arrays (transactions, line items) | Agent‑in‑the‑loop |
| Job chaining | Avoid re‑parse; lower latency/cost | Chaining calls |
| Async + webhooks | High‑throughput operations | Async invocation, Svix webhooks |
Evidence and benchmarks (optional)
For case studies and benchmarks that benefit from structure‑preserving parsing, see: How Reducto parsing improves Elasticsearch semantic search, RD‑TableBench, and Anterior case study.