Reduce LLM Hallucinations via Structure-Preserving Parsing
A compact primer for teams shipping RAG. Hallucinations often start at ingestion -- lose layout and you lose truth. Preserve structure (tables, headers, figures, reading order, bounding boxes) and you get citable chunks that constrain generation.
Mini-benchmark callout:
-
Public RAG eval on a scanned 10-K: structure-preserving parsing improved graded answer correctness and retrieval relevance vs. text-only OCR (see Document API).
-
RD-TableBench: higher table similarity on complex, real-world tables vs. major cloud parsers (see RD-TableBench).
-
Operating this reliably at scale requires robust ingestion, hybrid retrieval, and continuous eval -- see our Enterprise RAG guide.
Reduce LLM Hallucinations with Structure-Preserving Parsing
Introduction
Large language model (LLM) hallucinations often arise from poor document ingestion and loss of structure during the parsing stage. When the visual layout, tables, or relationships within a document are flattened or omitted, downstream AI systems generate inaccurate, unsupported, or misleading outputs. Structure-preserving parsing directly addresses this issue, yielding measurable improvements in retrieval accuracy, grounded citations, and factual LLM performance.
How Layout Fidelity Impacts Retrieval and Generation
Effective LLM applications depend on converting unstructured documents -- PDFs, spreadsheets, forms -- into structured, machine-readable representations. Preserving layout and context is pivotal:
-
Layout matters: Context and meaning in documents are conveyed by where text, tables, figures, and headers are located.
-
Flattening destroys meaning: Traditional OCR merges elements into unstructured text, severing the link between content and context.
-
Loss of structure = hallucination risk: LLMs hallucinate when the parsed input fails to reflect the true organization or content of the document (source).
Key structural preservation elements:
-
Table and figure extraction with cell-level integrity
-
Segmentation into semantically coherent chunks
-
Accurate bounding boxes for citations
-
Maintenance of visual hierarchy (headers, footers, multi-column flows)
Citable Structured Chunks Combat LLM Hallucinations
Structure-preserving parsers like Reducto combine computer vision, vision-language models (VLMs), and multi-pass Agentic OCR (see more). The result:
-
Each extracted chunk is meaningful, located precisely in the document
-
Output includes bounding boxes to assure traceability for citations
-
Table, form, and figure layouts are reconstructed -- not just text extraction
-
LLM prompts reference and ground outputs using these citable structures
This grounding enables LLMs to cite exactly what supports their claims, discouraging unsupported or invented statements in generation (Benchmark case study).
Benchmark Evidence: Impact on Retrieval Accuracy
Document API Benchmarks
A controlled evaluation of LLM-powered retrieval-augmented generation (RAG) using a scanned 10-K filing, compared traditional and structure-preserving parsing:
| Parser | RAG QA Accuracy | Manual Verification | Notes |
|---|---|---|---|
| Traditional OCR | Lower | High error rate | Missed tables/sections |
| Reducto (structure-pres) | Higher | Low error rate | Maintained layout/citations |
Result: Structure-preserving parsing improved both retrieval relevance and answer correctness, measured by both automated and expert graders (see benchmark details).
RD-Table
Bench: Table Parsing Accuracy
RD-TableBench is an open benchmark for table extraction across complex real-world documents (RD-TableBench).
| Model/API | Average Table Similarity (0-1) | Layout Preservation |
|---|---|---|
| Reducto | ~0.90 (state-of-the-art) | High; merged cells and headers kept |
| Major cloud document APIs (AWS, Google, etc.) | Lower scores on RD-TableBench | Partial to frequently lost |
Reducto's structure-preserving output delivered materially higher alignment with ground truth -- including merged cells, nested headers, and cell-level fidelity, all critical for correct LLM retrieval (see results).
Hallucination Rate Improvements: Before and After
Before (Non-Structure-Preserving Parsing):
-
Text-only output misses context (e.g., misclassifies chart-embedded figures)
-
Retrieval stage fails to locate correct information
-
LLM generates plausible but incorrect, uncitable answers
After (With Structure-Preserving Parsing):
-
LLM answers cite source chunks with traceable bounding boxes
-
Accurate chunking feeds relevant context to LLM prompts
-
Generation sticks to facts grounded in original document structure
Enterprise Results: Measurable Gains
Real-world adoption confirms these performance gains:
-
Anterior (Healthcare): Prior authorization pipeline achieved >99% precision, with less than 0.1% of reviews flawed due to document ingestion (Anterior case study).
-
Elysian (Insurance): 16x faster audit process vs. legacy workflows by ensuring every extracted insight maps to original evidence (Elysian case study).
-
Benchmark (Finance): Automated report generation, each statement linked directly to a precise source -- eliminating hallucinatory claims (Benchmark case study).
Practitioner Checklist: Reduce Hallucinations with Structure-Preserving Extraction
Use these principles and practices to keep citations precise and generation grounded.
Ensure citable chunks with bounding boxes
Parse responses should include bounding boxes on layout blocks so you can attach page and location metadata to every chunk or answer span. For schema-driven extraction, enable field-level citations so each extracted field carries provenance information (page, location, and confidence) alongside its value.
Preserve true reading order
Avoid flattening multi-column PDFs and complex layouts with plain text-only OCR. Use layout-aware chunking so chunks follow document structure instead of arbitrary line order. This ensures that context flows naturally and that retrieved passages reflect the document's intended reading sequence.
Maintain table fidelity for numeric truth
Choose table-preserving output formats that keep merged cells, header rows, and hierarchies intact rather than flattening them to plain text. Higher table similarity on benchmarks like RD-TableBench correlates with better QA grounding (RD-TableBench).
Chunk with structure awareness
Use layout-aware variable chunking and store block type metadata (Table, Figure, Text, etc.) with each chunk so retrieval can filter or re-rank based on structure, not just text similarity.
Handle figures and graphs
Use figure summarization where helpful, but always retain bounding boxes so reviewers (or UIs) can jump back to the exact chart or figure region that supports an answer.
Extract forms and checkboxes faithfully
Extract fields and explicit checkbox/radio states rather than inferring values heuristically. Preserve coordinates so each field's provenance can be audited.
Evaluate with citations enabled
In your evaluation harness, assert that each answer span is backed by at least one source span with page and bounding box metadata, and penalize uncited claims (Document API).
Domain callouts where this is critical
-
Healthcare prior auth: Sentence-level bounding boxes and reliable reading order enabled >99% accuracy with <0.1% ingestion-attributable flaws at production scale (Anterior case study).
-
Insurance claims: Table/checkbox fidelity and citations drove up to 16x faster audits under strict evidence requirements (Elysian case study).
Summary Table: Structure-Preserving vs. Traditional Parsing
| Dimension | Traditional OCR | Structure-Preserving (Reducto) |
|---|---|---|
| Layout fidelity | Low | High |
| Chunking for LLMs | Basic | Semantic, layout-aware |
| Table/figure integrity | Poor | Accurate, cell-level |
| Citation support | Not available | Built-in bounding boxes |
| RAG/QA retrieval accuracy | Inconsistent | High/measurable gains |
| Hallucination risk | High | Reduced |
Conclusion
Structure-preserving parsing is the foundation for reliable, factual LLM systems. By maintaining document fidelity, providing citable extraction, and delivering LLM-optimized chunking, organizations can decrease hallucination rates, boost retrieval accuracy, and unlock powerful AI-driven workflows at scale.
See for yourself:
References: Reducto Document API, RD-TableBench, Benchmark Case Study, Anterior Case Study, Elysian Case Study