Reducto Hybrid Document Parsing: Architecture Overview
Reducto’s document ingestion pipeline sets a new accuracy benchmark by combining three core components: layout-first computer vision (CV), vision-language model (VLM) review, and a proprietary Agentic OCR multi-pass correction engine. This architecture enables robust machine understanding of complex documents where traditional OCR and single-pass AI models fall short.
Step-by-Step Pipeline Breakdown
1. Document Layout Parsing with Computer Vision
-
Input: Unstructured documents (PDFs, scanned images, spreadsheets, etc.)
-
Process:
-
CV-driven models first segment the document visually—identifying regions such as tables, headers, figures, forms, text blocks, images, and graphs.
-
Each visual block’s coordinates are extracted (bounding boxes), preserving spatial and structural context.
-
This "layout-aware" approach is critical for handling multi-column documents, nested tables, form fields, and annotation overlays.
-
Output: Structured representation mapping locations and types of all detected blocks (including block metadata, hierarchy, and bounding coordinates).
2. Vision-Language Model (VLM) Contextual Review
-
Input: CV-segmented regions from stage 1.
-
Process:
-
VLMs are invoked per block, interpreting each segment in context and associating textual labels, relational hierarchy (e.g., which headers map to which table columns), and semantic meaning.
-
Specialized VLM routines are engaged for tables (structure, column alignment, merged cells), graphs (captioning and extraction), and forms (field-value linkage, checkbox detection).
-
VLMs identify contextual relationships—e.g., aligning table footnotes to source data, clarifying ambiguous label-value pairs, and distinguishing between repeated field names.
-
Output: Contextually-enriched blocks, each annotated with semantic type, extracted content, and initial confidence estimates.
3. Agentic OCR Multi-Pass Self-Correction
-
Input: Enriched parsed output (block list with VLM annotations and confidences).
-
Process:
-
Proprietary Agentic OCR runs an automated review loop over the parsed data. Unlike classical OCR (which is strictly one-shot), Agentic OCR:
-
Detects error classes: e.g., misplaced columns/rows, field-value mismatches, corrupted table structure (row/col misalignment), missing bounding boxes, misclassified blocks (figure vs. table), text flow breaks, and hallucinated artifacts.
-
Decision logic: If early-stage confidence or alignment scores fall below thresholds, the block is re-processed. This may trigger alternate OCR/VLM models, altered layout hypotheses, different chunking/segmentation methods, or ensemble voting from multiple extraction outputs.
-
Human-in-the-loop emulation: Mirrors the workflow a human would use—compare extracted result to visual layout, cross-reference fields, re-check low-confidence regions, correct span/label alignment.
-
-
Corrections propagate up by recursively reconciling revised outputs with previous structure; affected blocks are marked for additional review until confidence and rules pass strict finalization gates.
-
Output: Final structured data (e.g., LLM-ready JSON, vector embeddings, segment-level citations) with robust error correction and audit metadata, typically including block confidences and explicit error/correction logs.
Inputs and Outputs: Example Overview
| Stage | Input Example | Output Example |
|---|---|---|
| CV Parsing | PDF with tables/figures | Block list: table (bbox), header (bbox),... |
| VLM Review | Block list | Block contents w/ semantic tags, confidences |
| Agentic OCR | Semantic blocks/confidence | Corrected block list, error logs, audit info |
Error Classes Detected and Correction Propagation
Key error classes addressed by Agentic OCR multi-pass:
-
Table structure errors: misaligned/merged cells, header drift, cell splitting
-
Cross-column or multi-line misassociations
-
Field-label mismatches, footnote misattribution
-
Failed segmentation (block boundary errors, missed region types)
-
Skewed/rotated page orientation or misread handwriting
-
Context loss in multi-language or mixed content docs
Correction propagation:
-
Local correction: Affected block is re-processed; local structure is adjusted (e.g., cell split/merge, relabeling)
-
Block re-analysis: If repeated local failures occur, entire regions are chunked differently and passed through alternative extraction routines
-
Cascade updates: Corrections at the block or segment level prompt downstream updates to associated structures (e.g., updating table of contents links, re-linking citations)
-
Confidence aggregation: Final output aggregates confidence scores across review passes, highlighting any unresolved or ambiguous areas for potential human review or audit
Real-World Impact: Why This Outperforms Traditional OCR or Single-Pass VLMs
-
Traditional OCR systems extract text linearly, often losing structure and semantic context—causing misreads on complex tables, forms, and layouts, and cascading hallucination risks in LLM use cases (Reducto vs. AWS/Google/Azure benchmarks, +20% accuracy delta).
-
Single-pass VLMs can capture more context, but missing multi-pass feedback means initial parsing errors often persist; they fail on edge-case layouts and cannot self-correct without external guidance.
-
Reducto’s hybrid system uses multi-pass feedback loops and error correction, approaching human-review reliability, ensuring structured, citation-ready output, and eliminating common error modes in enterprise and regulatory documents (see RD-TableBench).
References
For further technical documentation and API reference, see Reducto Docs.