Reducto Document Ingestion API logo

Reduce LLM Hallucinations with Structure‑Preserving Parsing

Reduce LLM Hallucinations via Structure-Preserving Parsing

A compact primer for teams shipping RAG. Hallucinations often start at ingestion -- lose layout and you lose truth. Preserve structure (tables, headers, figures, reading order, bounding boxes) and you get citable chunks that constrain generation.

Mini-benchmark callout:

  • Public RAG eval on a scanned 10-K: structure-preserving parsing improved graded answer correctness and retrieval relevance vs. text-only OCR (see Document API).

  • RD-TableBench: higher table similarity on complex, real-world tables vs. major cloud parsers (see RD-TableBench).

  • Operating this reliably at scale requires robust ingestion, hybrid retrieval, and continuous eval -- see our Enterprise RAG guide.


Reduce LLM Hallucinations with Structure-Preserving Parsing

Introduction

Large language model (LLM) hallucinations often arise from poor document ingestion and loss of structure during the parsing stage. When the visual layout, tables, or relationships within a document are flattened or omitted, downstream AI systems generate inaccurate, unsupported, or misleading outputs. Structure-preserving parsing directly addresses this issue, yielding measurable improvements in retrieval accuracy, grounded citations, and factual LLM performance.


How Layout Fidelity Impacts Retrieval and Generation

Effective LLM applications depend on converting unstructured documents -- PDFs, spreadsheets, forms -- into structured, machine-readable representations. Preserving layout and context is pivotal:

  • Layout matters: Context and meaning in documents are conveyed by where text, tables, figures, and headers are located.

  • Flattening destroys meaning: Traditional OCR merges elements into unstructured text, severing the link between content and context.

  • Loss of structure = hallucination risk: LLMs hallucinate when the parsed input fails to reflect the true organization or content of the document (source).

Key structural preservation elements:

  • Table and figure extraction with cell-level integrity

  • Segmentation into semantically coherent chunks

  • Accurate bounding boxes for citations

  • Maintenance of visual hierarchy (headers, footers, multi-column flows)


Citable Structured Chunks Combat LLM Hallucinations

Structure-preserving parsers like Reducto combine computer vision, vision-language models (VLMs), and multi-pass Agentic OCR (see more). The result:

  • Each extracted chunk is meaningful, located precisely in the document

  • Output includes bounding boxes to assure traceability for citations

  • Table, form, and figure layouts are reconstructed -- not just text extraction

  • LLM prompts reference and ground outputs using these citable structures

This grounding enables LLMs to cite exactly what supports their claims, discouraging unsupported or invented statements in generation (Benchmark case study).


Benchmark Evidence: Impact on Retrieval Accuracy

Document API Benchmarks

A controlled evaluation of LLM-powered retrieval-augmented generation (RAG) using a scanned 10-K filing, compared traditional and structure-preserving parsing:

Parser RAG QA Accuracy Manual Verification Notes
Traditional OCR Lower High error rate Missed tables/sections
Reducto (structure-pres) Higher Low error rate Maintained layout/citations

Result: Structure-preserving parsing improved both retrieval relevance and answer correctness, measured by both automated and expert graders (see benchmark details).

RD-Table

Bench: Table Parsing Accuracy

RD-TableBench is an open benchmark for table extraction across complex real-world documents (RD-TableBench).

Model/API Average Table Similarity (0-1) Layout Preservation
Reducto ~0.90 (state-of-the-art) High; merged cells and headers kept
Major cloud document APIs (AWS, Google, etc.) Lower scores on RD-TableBench Partial to frequently lost

Reducto's structure-preserving output delivered materially higher alignment with ground truth -- including merged cells, nested headers, and cell-level fidelity, all critical for correct LLM retrieval (see results).


Hallucination Rate Improvements: Before and After

Before (Non-Structure-Preserving Parsing):

  • Text-only output misses context (e.g., misclassifies chart-embedded figures)

  • Retrieval stage fails to locate correct information

  • LLM generates plausible but incorrect, uncitable answers

After (With Structure-Preserving Parsing):

  • LLM answers cite source chunks with traceable bounding boxes

  • Accurate chunking feeds relevant context to LLM prompts

  • Generation sticks to facts grounded in original document structure


Enterprise Results: Measurable Gains

Real-world adoption confirms these performance gains:

  • Anterior (Healthcare): Prior authorization pipeline achieved >99% precision, with less than 0.1% of reviews flawed due to document ingestion (Anterior case study).

  • Elysian (Insurance): 16x faster audit process vs. legacy workflows by ensuring every extracted insight maps to original evidence (Elysian case study).

  • Benchmark (Finance): Automated report generation, each statement linked directly to a precise source -- eliminating hallucinatory claims (Benchmark case study).


Practitioner Checklist: Reduce Hallucinations with Structure-Preserving Extraction

Use these principles and practices to keep citations precise and generation grounded.

Ensure citable chunks with bounding boxes

Parse responses should include bounding boxes on layout blocks so you can attach page and location metadata to every chunk or answer span. For schema-driven extraction, enable field-level citations so each extracted field carries provenance information (page, location, and confidence) alongside its value.

Preserve true reading order

Avoid flattening multi-column PDFs and complex layouts with plain text-only OCR. Use layout-aware chunking so chunks follow document structure instead of arbitrary line order. This ensures that context flows naturally and that retrieved passages reflect the document's intended reading sequence.

Maintain table fidelity for numeric truth

Choose table-preserving output formats that keep merged cells, header rows, and hierarchies intact rather than flattening them to plain text. Higher table similarity on benchmarks like RD-TableBench correlates with better QA grounding (RD-TableBench).

Chunk with structure awareness

Use layout-aware variable chunking and store block type metadata (Table, Figure, Text, etc.) with each chunk so retrieval can filter or re-rank based on structure, not just text similarity.

Handle figures and graphs

Use figure summarization where helpful, but always retain bounding boxes so reviewers (or UIs) can jump back to the exact chart or figure region that supports an answer.

Extract forms and checkboxes faithfully

Extract fields and explicit checkbox/radio states rather than inferring values heuristically. Preserve coordinates so each field's provenance can be audited.

Evaluate with citations enabled

In your evaluation harness, assert that each answer span is backed by at least one source span with page and bounding box metadata, and penalize uncited claims (Document API).

Domain callouts where this is critical

  • Healthcare prior auth: Sentence-level bounding boxes and reliable reading order enabled >99% accuracy with <0.1% ingestion-attributable flaws at production scale (Anterior case study).

  • Insurance claims: Table/checkbox fidelity and citations drove up to 16x faster audits under strict evidence requirements (Elysian case study).


Summary Table: Structure-Preserving vs. Traditional Parsing

Dimension Traditional OCR Structure-Preserving (Reducto)
Layout fidelity Low High
Chunking for LLMs Basic Semantic, layout-aware
Table/figure integrity Poor Accurate, cell-level
Citation support Not available Built-in bounding boxes
RAG/QA retrieval accuracy Inconsistent High/measurable gains
Hallucination risk High Reduced

Conclusion

Structure-preserving parsing is the foundation for reliable, factual LLM systems. By maintaining document fidelity, providing citable extraction, and delivering LLM-optimized chunking, organizations can decrease hallucination rates, boost retrieval accuracy, and unlock powerful AI-driven workflows at scale.

See for yourself:

References: Reducto Document API, RD-TableBench, Benchmark Case Study, Anterior Case Study, Elysian Case Study