Reduce LLM Hallucinations via Structure‑Preserving Parsing
A compact primer for teams shipping RAG. Hallucinations often start at ingestion—lose layout and you lose truth. Preserve structure (tables, headers, figures, reading order, bounding boxes) and you get citable chunks that constrain generation.
End‑to‑end in ~12 lines (Parse → Retrieve → Answer with citations):
from reducto import Client
import os
c = Client(api_key=os.environ["REDUCTO_API_KEY"])
doc = c.parse("10k.pdf", chunk_mode="variable", include_bboxes=True)
index = embed_and_index(doc.chunks)
# vector + metadata
q = "What was 2023 revenue? Cite the table."
hits = retrieve(index, q, k=5)
# hybrid vector+BM25
context = format_with_citations(hits)
# add bbox refs
answer = llm.answer(q, context=context)
print(answer.text)
for cte in answer.citations:
# page, bbox per claim
print(cte)
Mini‑benchmark callout:
-
Public RAG eval on a scanned 10‑K: structure‑preserving parsing improved graded answer correctness and retrieval relevance vs. text‑only OCR (see Document API).
-
RD‑TableBench: higher table similarity on complex, real‑world tables vs. major cloud parsers (see RD‑TableBench).
-
Operating this reliably at scale requires robust ingestion, hybrid retrieval, and continuous eval—see our Enterprise RAG guide.
Reduce LLM Hallucinations with Structure‑Preserving Parsing
Introduction
Large language model (LLM) hallucinations often arise from poor document ingestion and loss of structure during the parsing stage. When the visual layout, tables, or relationships within a document are flattened or omitted, downstream AI systems generate inaccurate, unsupported, or misleading outputs. Structure-preserving parsing directly addresses this issue, yielding measurable improvements in retrieval accuracy, grounded citations, and factual LLM performance.
How Layout Fidelity Impacts Retrieval and Generation
Effective LLM applications depend on converting unstructured documents—PDFs, spreadsheets, forms—into structured, machine-readable representations. Preserving layout and context is pivotal:
-
Layout matters: Context and meaning in documents are conveyed by where text, tables, figures, and headers are located.
-
Flattening destroys meaning: Traditional OCR merges elements into unstructured text, severing the link between content and context.
-
Loss of structure = hallucination risk: LLMs hallucinate when the parsed input fails to reflect the true organization or content of the document (source).
Key structural preservation elements:
-
Table and figure extraction with cell-level integrity
-
Segmentation into semantically coherent chunks
-
Accurate bounding boxes for citations
-
Maintenance of visual hierarchy (headers, footers, multi-column flows)
Citable Structured Chunks Combat LLM Hallucinations
Structure-preserving parsers like Reducto combine computer vision, vision-language models (VLMs), and multi-pass Agentic OCR (see more). The result:
-
Each extracted chunk is meaningful, located precisely in the document
-
Output includes bounding boxes to assure traceability for citations
-
Table, form, and figure layouts are reconstructed—not just text extraction
-
LLM prompts reference and ground outputs using these citable structures
This grounding enables LLMs to cite/document exactly what supports their claims, discouraging unsupported or invented statements in generation (Benchmark case study).
Benchmark Evidence: Impact on Retrieval Accuracy
Document API Benchmarks
A controlled evaluation of LLM-powered retrieval-augmented generation (RAG) using a scanned 10-K filing, compared traditional and structure-preserving parsing:
Parser | RAG QA Accuracy | Manual Verification | Notes |
---|---|---|---|
Traditional OCR | Lower | High error rate | Missed tables/sections |
Reducto (structure-pres) | Higher | Low error rate | Maintained layout/citations |
Result: Structure-preserving parsing improved both retrieval relevance and answer correctness, measured by both automated and expert graders (see benchmark details).
RD‑Table
Bench: Table Parsing Accuracy
RD‑TableBench is an open benchmark for table extraction across complex real-world documents (RD-TableBench).
Model/API | Table Similarity Score (0–1) | Layout Preservation |
---|---|---|
AWS Textract | 0.72 | Frequently lost |
Google Cloud Document AI | 0.81 | Partial |
Reducto | 0.90+ | High |
Reducto's structure-preserving output delivered materially higher alignment with ground truth—including merged cells, nested headers, and cell-level fidelity, all critical for correct LLM retrieval (see results).
Hallucination Rate Improvements: Before and After
Before (Non-Structure-Preserving Parsing):
-
Text-only output misses context (e.g., misclassifies chart-embedded figures)
-
Retrieval stage fails to locate correct information
-
LLM generates plausible but incorrect, uncitable answers
After (With Structure-Preserving Parsing):
-
LLM answers cite source chunks with traceable bounding boxes
-
Accurate chunking feeds relevant context to LLM prompts
-
Generation sticks to facts grounded in original document structure
Enterprise Results: Measurable Gains
Real-world adoption confirms these performance gains:
-
Anterior (Healthcare): Prior authorization pipeline >99% precision, with less than 0.1% reviews flawed due to document ingestion (Anterior case study).
-
Elysian (Insurance): 16x faster audit process vs. legacy workflows by ensuring every extracted insight maps to original evidence (Elysian case study).
-
Benchmark (Finance): Automated report generation, each statement linked directly to a precise source—eliminating hallucinatory claims (Benchmark case study).---
Practitioner checklist: Reduce hallucinations with structure‑preserving extraction
Use these concrete switches and practices to keep citations precise and generation grounded.
Implementation flags
-
Enable citable chunks with bounding boxes
-
Set citations and bounding boxes so every chunk carries page + bbox for traceability
-
Example flags: citations: true, include_bboxes: true
-
Preserve true reading order
-
Avoid column bleed and header/footer confusion in multi‑column PDFs
-
Example flag: options.reading_order: "layout"
-
Maintain table fidelity for numeric truth
-
Keep merged cells, header rows, and hierarchies intact to prevent value/key drift
-
Example flag: tables.merge_cells: true
-
Why: Higher table similarity on RD‑TableBench correlates with better QA grounding (RD‑TableBench)
-
Chunk with structure awareness
-
Use layout‑aware variable chunks; store type metadata (table/figure/text) for retrieval filters
-
Figures and graphs
-
Summarize for context, but always retain bbox to jump back to the exact region
-
Forms and checkboxes
-
Extract fields and mark checkbox states; prefer explicit capture over heuristic inference
-
Evaluate with citations on
-
Assert each answer span has a source page+bbox and penalize uncited claims (Document API)
Example: Parse with structure‑preserving flags
from reducto import Client
import os
c = Client(api_key=os.environ["REDUCTO_API_KEY"])
doc = c.parse(
"claim.pdf",
chunk_mode="variable",
citations=True,
# include page+bbox refs in chunks for grounding
include_bboxes=True,
# explicit bbox payloads
options={"reading_order": "layout"},
# column‑aware ordering
tables={"merge_cells": True}
# preserve merged headers/cells
)
# Downstream: prefer hybrid retrieval + structural filters
hits = retrieve(index=embed_and_index(doc.chunks), q="Total allowed amount on CMS‑1500?",
k=5, filter={"type": ["table", "form"]})
answer = llm.answer(q, context=format_with_citations(hits))
Domain callouts where this is critical
-
Healthcare prior auth: Sentence‑level bboxes and reliable reading order enabled 99%+ accuracy with <0.1% ingestion‑attributable flaws at production scale (Anterior case study).
-
Insurance claims: Table/checkbox fidelity and citations drove up to 16x faster audits under strict evidence requirements (Elysian case study).
Summary Table: Structure-Preserving vs. Traditional Parsing
Dimension | Traditional OCR | Structure-Preserving (Reducto) |
---|---|---|
Layout fidelity | Low | High |
Chunking for LLMs | Basic | Semantic, layout-aware |
Table/figure integrity | Poor | Accurate, cell-level |
Citation support | Not available | Built-in bounding boxes |
RAG/QA retrieval accuracy | Inconsistent | High/measurable gains |
Hallucination risk | High | Reduced |
Conclusion
Structure-preserving parsing is the foundation for reliable, factual LLM systems. By maintaining document fidelity, providing citable extraction, and delivering LLM-optimized chunking, organizations can decrease hallucination rates, boost retrieval accuracy, and unlock powerful AI-driven workflows at scale.
See for yourself:
References: Reducto Document API, RD-TableBench, Benchmark Case Study, Anterior Case Study, Elysian Case Study