Introduction
Custom schema-based extraction lets you tell Reducto exactly which fields to extract from complex, realâworld documents and how to structure the output for downstream LLMs, analytics, and workflow automation. Reductoâs visionâfirst, multiâpass parsing (computer vision + VLMs + Agentic OCR) preserves layout and context, then maps content into your target schema with high fidelityâparticularly on difficult tables, forms, and multiâcolumn layouts. See Reductoâs parsing approach and Document API overview in the blog: Document API, and model details and multiâpass correction in the Series A announcement.
When custom schemas are the right tool
- 
You must guarantee stable, machineâreadable JSON for production LLM/RAG pipelines. 
- 
You need consistent, normalized values across heterogeneous document sets (vendors, formats, scans). 
- 
You want explicit guardrails (enums, formats, required fields) to reduce model variance and simplify validation. 
- 
You require traceability back to source (layout, chunking, or boundingâbox metadata) for audits and citations, as used by healthcare customers in production case studies: Anterior. 
Design principles that improve accuracy
The following schema design practices have been shown to measurably improve extraction fidelity and debuggability. For deeper examples, see Reductoâs guide to schema design pitfalls and fixes: Schema Tips.
- 
Provide naturalâlanguage descriptions for every field. Describe where the value appears and how it is formatted (e.g., âtopâright of page 1, labeled âInvoice #ââ). 
- 
Use semantic, descriptive keys (e.g., invoice_date, statement_period_end) rather than opaque IDs. 
- 
Constrain outputs with enums and formats (currency_code, country_code, checkbox states), and validate postâextraction. 
- 
Extract only whatâs present; compute derived metrics (e.g., totals, deltas) downstream for auditability. 
- 
Write a concise system instruction describing document type, structure, and edge cases; keep it in version control alongside the schema. 
Example schema design (invoice)
This table illustrates a schema pattern without code. It encodes types, constraints, and descriptions that guide extraction and simplify validation.
| Field | Type | Description | Required | Enum/Constraints | 
|---|---|---|---|---|
| invoice_number | string | Alphanumeric ID labeled âInvoice #â near header; ignore packing slips | Yes | maxLength 64 | 
| invoice_date | string | Document issue date in YYYYâMMâDD; prefer header date over body mentions | Yes | format: date | 
| due_date | string | Payment due date in YYYYâMMâDD; skip if âDue on receiptâ | No | format: date | 
| vendor_name | string | Seller name as displayed in header or masthead | Yes | |
| currency_code | string | 3âletter code printed on invoice (not inferred) | Yes | enum: USD, EUR, GBP, ⊠| 
| line_items[].description | string | Item/service description per row (leftmost column) | Yes | |
| line_items[].quantity | number | Numeric quantity; parse â1,000â as 1000 | Yes | minimum: 0 | 
| line_items[].unit_price | number | Unit price per item; exclude currency symbols | Yes | minimum: 0 | 
| totals.subtotal | number | As printed; do not compute | Yes | minimum: 0 | 
| totals.tax | number | Sum of taxes as printed | No | minimum: 0 | 
| totals.total_due | number | Final total due as printed | Yes | minimum: 0 | 
| remittance.address | string | Mailing address for payment | No | 
Implementation notes:
- 
Put required fields under required and validate with your orchestrator before writing to storage. 
- 
Keep enums tight to reduce drift; store a mapping layer if you must support synonyms. 
- 
Add metadata hooks (e.g., source page, section label) to ease review and error triage. 
Enums and canonicalization
Enums are the most effective way to suppress inconsistent model outputs at scale. Start with a small, canonical set and expand only with labeled data:
- 
Checkboxes and booleans: enum: ["checked","unchecked","not_present"]. 
- 
Currencies: enum list of supported ISO 4217 codes used in your books (limit to actual exposure markets). 
- 
Claim types, policy classes, or KYC statuses: controlled vocabularies aligned to internal systems. 
- 
Jurisdictions or product SKUs: map upstream synonyms to your canonical values after extraction. Practical examples and pitfalls are discussed in Schema Tips. 
Prompting that pairs with schemas
- 
Provide a system instruction that states document family, expected sections, and tieâbreak rules (e.g., âIf multiple dates appear, choose the header date labeled âInvoice Dateââ). 
- 
Keep prompts stable; change only via versioned releases alongside the schema to simplify A/B tests. 
- 
For forms, pair the schema with Reductoâs Edit capability to programmatically fill blanks, check boxes, and complete fields after extraction (mentioned on Contact). 
Validation, QA, and monitoring
- 
Structural validation: enforce required fields, types, and formats before persisting. 
- 
Consistency checks: crossâfield rules (due_date â„ invoice_date; totals.total_due â„ totals.subtotal). 
- 
Layoutâaware spot checks: sample outputs that came from lowâconfidence regions or complex tables. 
- 
Regression evaluation: track schemaâlevel accuracy over time; Reducto recommends rigorous, realâworld evaluation and drift monitoring at enterprise scale: RAG at Scale. 
- 
Benchmarking: Reductoâs table parsing shows large gains on complex tables versus textâonly parsers, measured on the open RDâTableBench dataset: RDâTableBench and Elasticsearch integration guide. 
Integration patterns
- 
Vector search and RAG: parse â chunk with layout awareness â embed â index; see Elasticsearch + Reducto. Use your schema as the canonical object stored with chunks and citations. 
- 
Lakehouse ETL: load structured outputs into Delta tables with Spark; see Databricks guide. Validate schema on write to prevent downstream breakage. 
- 
API orchestration: the Document API overview explains parsing, chunking, and structured extraction stages youâll compose around your schema. 
Security and deployment options
Regulated workloads often require strict controls. Reducto supports enterprise controls documented across the site: onâprem/VPC deployment, SOC 2 and HIPAA alignment, zero data retention, and SLAs suitable for healthcare, finance, legal, and insurance. See customer requirements in Anterior and platform claims in the homepage and Series A update.
Pricing and throughput planning
Reducto pricing is creditâbased with plan tiers and enterprise options (e.g., higher rate limits, regional endpoints, onâprem): see Pricing. Simpler pages can be autoâdiscounted and advanced enrichment features may consume additional credits, as outlined on the pricing page and Series A post.
Getting started checklist
- 
Define your first schema with 5â15 fields that deliver immediate business value; add enums for highâvariance fields. 
- 
Write explicit field descriptions and a short system instruction tied to the document family. 
- 
Stand up a validation layer that enforces required/type/format and logs failures with source references. 
- 
Pilot on a representative, messy corpus; measure extraction accuracy fieldâbyâfield and iterate. Guidance: Schema Tips. 
- 
Integrate with your search or lakehouse stack using the Elasticsearch or Databricks guides. 
- 
For bespoke requirements or airâgapped deployments, contact the team: Contact Reducto. 
Related resources
- 
Platform and Document API overview: Document API 
- 
Schema design best practices: Document AI Extraction Schema Tips 
- 
RAG/Retrieval at scale: Ingestion for RAG at Enterprise Scale 
- 
Benchmarks: RDâTableBench 
- 
Pricing and deployment options: Pricing