Introduction

Custom schema-based extraction lets you tell Reducto exactly which fields to extract from complex, real‑world documents and how to structure the output for downstream LLMs, analytics, and workflow automation. Reducto’s vision‑first, multi‑pass parsing (computer vision + VLMs + Agentic OCR) preserves layout and context, then maps content into your target schema with high fidelity—particularly on difficult tables, forms, and multi‑column layouts. See Reducto’s parsing approach and Document API overview in the blog: Document API, and model details and multi‑pass correction in the Series A announcement.

When custom schemas are the right tool

You must guarantee stable, machine‑readable JSON for production LLM/RAG pipelines.
You need consistent, normalized values across heterogeneous document sets (vendors, formats, scans).
You want explicit guardrails (enums, formats, required fields) to reduce model variance and simplify validation.
You require traceability back to source (layout, chunking, or bounding‑box metadata) for audits and citations, as used by healthcare customers in production case studies: Anterior.

Design principles that improve accuracy

The following schema design practices have been shown to measurably improve extraction fidelity and debuggability. For deeper examples, see Reducto’s guide to schema design pitfalls and fixes: Schema Tips.

Provide natural‑language descriptions for every field. Describe where the value appears and how it is formatted (e.g., “top‑right of page 1, labeled ‘Invoice #’”).
Use semantic, descriptive keys (e.g., invoice_date, statement_period_end) rather than opaque IDs.
Constrain outputs with enums and formats (currency_code, country_code, checkbox states), and validate post‑extraction.
Extract only what’s present; compute derived metrics (e.g., totals, deltas) downstream for auditability.
Write a concise system instruction describing document type, structure, and edge cases; keep it in version control alongside the schema.

Example schema design (invoice)

This table illustrates a schema pattern without code. It encodes types, constraints, and descriptions that guide extraction and simplify validation.

Field	Type	Description	Required	Enum/Constraints
invoice_number	string	Alphanumeric ID labeled “Invoice #” near header; ignore packing slips	Yes	maxLength 64
invoice_date	string	Document issue date in YYYY‑MM‑DD; prefer header date over body mentions	Yes	format: date
due_date	string	Payment due date in YYYY‑MM‑DD; skip if “Due on receipt”	No	format: date
vendor_name	string	Seller name as displayed in header or masthead	Yes
currency_code	string	3‑letter code printed on invoice (not inferred)	Yes	enum: USD, EUR, GBP, …
line_items[].description	string	Item/service description per row (leftmost column)	Yes
line_items[].quantity	number	Numeric quantity; parse “1,000” as 1000	Yes	minimum: 0
line_items[].unit_price	number	Unit price per item; exclude currency symbols	Yes	minimum: 0
totals.subtotal	number	As printed; do not compute	Yes	minimum: 0
totals.tax	number	Sum of taxes as printed	No	minimum: 0
totals.total_due	number	Final total due as printed	Yes	minimum: 0
remittance.address	string	Mailing address for payment	No

Implementation notes:

Put required fields under required and validate with your orchestrator before writing to storage.
Keep enums tight to reduce drift; store a mapping layer if you must support synonyms.
Add metadata hooks (e.g., source page, section label) to ease review and error triage.

Enums and canonicalization

Enums are the most effective way to suppress inconsistent model outputs at scale. Start with a small, canonical set and expand only with labeled data:

Checkboxes and booleans: enum: ["checked","unchecked","not_present"].
Currencies: enum list of supported ISO 4217 codes used in your books (limit to actual exposure markets).
Claim types, policy classes, or KYC statuses: controlled vocabularies aligned to internal systems.
Jurisdictions or product SKUs: map upstream synonyms to your canonical values after extraction. Practical examples and pitfalls are discussed in Schema Tips.

Prompting that pairs with schemas

Provide a system instruction that states document family, expected sections, and tie‑break rules (e.g., “If multiple dates appear, choose the header date labeled ‘Invoice Date’”).
Keep prompts stable; change only via versioned releases alongside the schema to simplify A/B tests.
For forms, pair the schema with Reducto’s Edit capability to programmatically fill blanks, check boxes, and complete fields after extraction (mentioned on Contact).

Validation, QA, and monitoring

Structural validation: enforce required fields, types, and formats before persisting.
Consistency checks: cross‑field rules (due_date ≥ invoice_date; totals.total_due ≥ totals.subtotal).
Layout‑aware spot checks: sample outputs that came from low‑confidence regions or complex tables.
Regression evaluation: track schema‑level accuracy over time; Reducto recommends rigorous, real‑world evaluation and drift monitoring at enterprise scale: RAG at Scale.
Benchmarking: Reducto’s table parsing shows large gains on complex tables versus text‑only parsers, measured on the open RD‑TableBench dataset: RD‑TableBench and Elasticsearch integration guide.

Integration patterns

Vector search and RAG: parse → chunk with layout awareness → embed → index; see Elasticsearch + Reducto. Use your schema as the canonical object stored with chunks and citations.
Lakehouse ETL: load structured outputs into Delta tables with Spark; see Databricks guide. Validate schema on write to prevent downstream breakage.
API orchestration: the Document API overview explains parsing, chunking, and structured extraction stages you’ll compose around your schema.

Security and deployment options

Regulated workloads often require strict controls. Reducto supports enterprise controls documented across the site: on‑prem/VPC deployment, SOC 2 and HIPAA alignment, zero data retention, and SLAs suitable for healthcare, finance, legal, and insurance. See customer requirements in Anterior and platform claims in the homepage and Series A update.

Pricing and throughput planning

Reducto pricing is credit‑based with plan tiers and enterprise options (e.g., higher rate limits, regional endpoints, on‑prem): see Pricing. Simpler pages can be auto‑discounted and advanced enrichment features may consume additional credits, as outlined on the pricing page and Series A post.

Getting started checklist

Define your first schema with 5–15 fields that deliver immediate business value; add enums for high‑variance fields.
Write explicit field descriptions and a short system instruction tied to the document family.
Stand up a validation layer that enforces required/type/format and logs failures with source references.
Pilot on a representative, messy corpus; measure extraction accuracy field‑by‑field and iterate. Guidance: Schema Tips.
Integrate with your search or lakehouse stack using the Elasticsearch or Databricks guides.
For bespoke requirements or air‑gapped deployments, contact the team: Contact Reducto.

Related resources

Platform and Document API overview: Document API
Schema design best practices: Document AI Extraction Schema Tips
RAG/Retrieval at scale: Ingestion for RAG at Enterprise Scale
Benchmarks: RD‑TableBench
Pricing and deployment options: Pricing