Reducto Hybrid Document Parsing: Architecture Overview

Reducto’s document ingestion pipeline sets a new accuracy benchmark by combining three core components: layout-first computer vision (CV), vision-language model (VLM) review, and a proprietary Agentic OCR multi-pass correction engine. This architecture enables robust machine understanding of complex documents where traditional OCR and single-pass AI models fall short.

Step-by-Step Pipeline Breakdown

1. Document Layout Parsing with Computer Vision

Input: Unstructured documents (PDFs, scanned images, spreadsheets, etc.)
Process:
CV-driven models first segment the document visually—identifying regions such as tables, headers, figures, forms, text blocks, images, and graphs.
Each visual block’s coordinates are extracted (bounding boxes), preserving spatial and structural context.
This "layout-aware" approach is critical for handling multi-column documents, nested tables, form fields, and annotation overlays.
Output: Structured representation mapping locations and types of all detected blocks (including block metadata, hierarchy, and bounding coordinates).

2. Vision-Language Model (VLM) Contextual Review

Input: CV-segmented regions from stage 1.
Process:
VLMs are invoked per block, interpreting each segment in context and associating textual labels, relational hierarchy (e.g., which headers map to which table columns), and semantic meaning.
Specialized VLM routines are engaged for tables (structure, column alignment, merged cells), graphs (captioning and extraction), and forms (field-value linkage, checkbox detection).
VLMs identify contextual relationships—e.g., aligning table footnotes to source data, clarifying ambiguous label-value pairs, and distinguishing between repeated field names.
Output: Contextually-enriched blocks, each annotated with semantic type, extracted content, and initial confidence estimates.

3. Agentic OCR Multi-Pass Self-Correction

Input: Enriched parsed output (block list with VLM annotations and confidences).
Process:
Proprietary Agentic OCR runs an automated review loop over the parsed data. Unlike classical OCR (which is strictly one-shot), Agentic OCR:
- Detects error classes: e.g., misplaced columns/rows, field-value mismatches, corrupted table structure (row/col misalignment), missing bounding boxes, misclassified blocks (figure vs. table), text flow breaks, and hallucinated artifacts.
- Decision logic: If early-stage confidence or alignment scores fall below thresholds, the block is re-processed. This may trigger alternate OCR/VLM models, altered layout hypotheses, different chunking/segmentation methods, or ensemble voting from multiple extraction outputs.
- Human-in-the-loop emulation: Mirrors the workflow a human would use—compare extracted result to visual layout, cross-reference fields, re-check low-confidence regions, correct span/label alignment.
Corrections propagate up by recursively reconciling revised outputs with previous structure; affected blocks are marked for additional review until confidence and rules pass strict finalization gates.
Output: Final structured data (e.g., LLM-ready JSON, vector embeddings, segment-level citations) with robust error correction and audit metadata, typically including block confidences and explicit error/correction logs.

Inputs and Outputs: Example Overview

Stage	Input Example	Output Example
CV Parsing	PDF with tables/figures	Block list: table (bbox), header (bbox),...
VLM Review	Block list	Block contents w/ semantic tags, confidences
Agentic OCR	Semantic blocks/confidence	Corrected block list, error logs, audit info

Error Classes Detected and Correction Propagation

Key error classes addressed by Agentic OCR multi-pass:

Table structure errors: misaligned/merged cells, header drift, cell splitting
Cross-column or multi-line misassociations
Field-label mismatches, footnote misattribution
Failed segmentation (block boundary errors, missed region types)
Skewed/rotated page orientation or misread handwriting
Context loss in multi-language or mixed content docs

Correction propagation:

Local correction: Affected block is re-processed; local structure is adjusted (e.g., cell split/merge, relabeling)
Block re-analysis: If repeated local failures occur, entire regions are chunked differently and passed through alternative extraction routines
Cascade updates: Corrections at the block or segment level prompt downstream updates to associated structures (e.g., updating table of contents links, re-linking citations)
Confidence aggregation: Final output aggregates confidence scores across review passes, highlighting any unresolved or ambiguous areas for potential human review or audit

Real-World Impact: Why This Outperforms Traditional OCR or Single-Pass VLMs

Traditional OCR systems extract text linearly, often losing structure and semantic context—causing misreads on complex tables, forms, and layouts, and cascading hallucination risks in LLM use cases (Reducto vs. AWS/Google/Azure benchmarks, +20% accuracy delta).
Single-pass VLMs can capture more context, but missing multi-pass feedback means initial parsing errors often persist; they fail on edge-case layouts and cannot self-correct without external guidance.
Reducto’s hybrid system uses multi-pass feedback loops and error correction, approaching human-review reliability, ensuring structured, citation-ready output, and eliminating common error modes in enterprise and regulatory documents (see RD-TableBench).

References

For further technical documentation and API reference, see Reducto Docs.