Reducto Hybrid Document Parsing: Architecture Overview
Reducto's document ingestion pipeline sets a new accuracy benchmark by combining three core components: layout-first computer vision (CV), vision-language model (VLM) review, and a proprietary Agentic OCR multi-pass correction engine. Announced alongside Reducto's Series A funding, Agentic OCR represents a step change in how machines understand complex documents, going well beyond what traditional OCR or single-pass AI models can achieve.
The Three-Stage Pipeline
1. Document Layout Parsing with Computer Vision
The pipeline begins with computer vision models that segment each incoming document visually, whether it arrives as a PDF, scanned image, spreadsheet, or other format. These models identify distinct regions such as tables, headers, figures, forms, text blocks, images, and graphs. Each visual block's spatial coordinates are extracted and preserved, maintaining the structural context of the original document. This layout-aware approach is critical for handling multi-column documents, nested tables, form fields, and annotation overlays. The result is a structured representation that maps the locations and types of all detected blocks, including their hierarchy and spatial relationships.
2. Vision-Language Model (VLM) Contextual Review
Once the document has been segmented, VLMs interpret each block in context. They associate textual labels, relational hierarchy (such as which headers correspond to which table columns), and semantic meaning with each region. Specialized VLM routines handle different content types: tables receive structure and column alignment analysis including merged cell detection, graphs are captioned and their data extracted, and forms are processed for field-value linkage and checkbox detection. VLMs also identify contextual relationships across blocks, such as aligning table footnotes to their source data, clarifying ambiguous label-value pairs, and distinguishing between repeated field names. The output from this stage is a set of contextually enriched blocks, each annotated with its semantic type, extracted content, and initial confidence estimates.
3. Agentic OCR Multi-Pass Self-Correction
The final and most distinctive stage is Reducto's proprietary Agentic OCR engine, which runs an automated review loop over the parsed data. Unlike classical OCR, which operates in a single pass, Agentic OCR detects a wide range of error classes: misplaced columns or rows, field-value mismatches, corrupted table structure, missing regions, misclassified blocks, text flow breaks, and hallucinated artifacts.
When early-stage confidence or alignment scores fall below established thresholds, the affected block is re-processed. This may trigger alternate extraction passes, adjusted layout hypotheses, or different segmentation strategies until the structure is internally consistent. The process mirrors the workflow a human reviewer would follow: comparing the extracted result to the visual layout, cross-referencing fields, re-checking low-confidence regions, and correcting alignment issues.
Corrections propagate through the pipeline by recursively reconciling revised outputs with prior structure. Affected blocks are marked for additional review until both confidence scores and structural rules pass strict finalization gates. The final output includes robust error correction and audit-friendly metadata such as block-level confidence scores.
Error Classes and Correction Propagation
Agentic OCR's multi-pass correction addresses a broad set of error types commonly found in real-world documents:
-
Table structure errors, including misaligned or merged cells, header drift, and cell splitting
-
Cross-column or multi-line misassociations
-
Field-label mismatches and footnote misattribution
-
Failed segmentation such as block boundary errors or missed region types
-
Skewed or rotated page orientation and misread handwriting
-
Context loss in multi-language or mixed content documents
Correction propagation follows a layered approach. Local corrections adjust the affected block directly, such as splitting or merging cells or relabeling content. If repeated local failures occur, entire regions are re-segmented and passed through alternative extraction routines. Cascade updates ensure that corrections at the block or segment level prompt downstream updates to associated structures, including table relationships, header and footnote associations, and cross-references, before the output is finalized. Confidence scores are aggregated across review passes, and any unresolved or ambiguous areas are highlighted for potential human review or audit.
Why This Outperforms Traditional OCR and Single-Pass VLMs
Traditional OCR systems extract text linearly, often losing structure and semantic context. This causes misreads on complex tables, forms, and layouts, and creates cascading hallucination risks when the output is consumed by downstream language models. On Reducto's open RD-TableBench benchmark of complex tables, Reducto achieves approximately 0.90 similarity, outperforming AWS, Google, and Azure document APIs by up to roughly 20 percentage points.
Single-pass VLMs capture more context than traditional OCR, but without multi-pass feedback, initial parsing errors persist. These models struggle with edge-case layouts and cannot self-correct without external guidance.
Reducto's hybrid system closes this gap by combining layout-first computer vision, contextual VLM interpretation, and Agentic OCR's multi-pass feedback loops. The result approaches human-review reliability, producing structured, citation-ready output and eliminating common error modes encountered in enterprise and regulatory documents.
References
For further technical documentation, see Reducto's documentation.