Reduce LLM Hallucinations via Structure‑Preserving Parsing

A compact primer for teams shipping RAG. Hallucinations often start at ingestion—lose layout and you lose truth. Preserve structure (tables, headers, figures, reading order, bounding boxes) and you get citable chunks that constrain generation.

End‑to‑end in ~12 lines (Parse → Retrieve → Answer with citations):

from reducto import Client
import os
c = Client(api_key=os.environ["REDUCTO_API_KEY"])
doc = c.parse("10k.pdf", chunk_mode="variable", include_bboxes=True)
index = embed_and_index(doc.chunks)

# vector + metadata

q = "What was 2023 revenue? Cite the table."
hits = retrieve(index, q, k=5)

# hybrid vector+BM25

context = format_with_citations(hits)

# add bbox refs

answer = llm.answer(q, context=context)
print(answer.text)
for cte in answer.citations:

# page, bbox per claim

    print(cte)

Mini‑benchmark callout:

Public RAG eval on a scanned 10‑K: structure‑preserving parsing improved graded answer correctness and retrieval relevance vs. text‑only OCR (see Document API).
RD‑TableBench: higher table similarity on complex, real‑world tables vs. major cloud parsers (see RD‑TableBench).
Operating this reliably at scale requires robust ingestion, hybrid retrieval, and continuous eval—see our Enterprise RAG guide.

Reduce LLM Hallucinations with Structure‑Preserving Parsing

Introduction

Large language model (LLM) hallucinations often arise from poor document ingestion and loss of structure during the parsing stage. When the visual layout, tables, or relationships within a document are flattened or omitted, downstream AI systems generate inaccurate, unsupported, or misleading outputs. Structure-preserving parsing directly addresses this issue, yielding measurable improvements in retrieval accuracy, grounded citations, and factual LLM performance.

How Layout Fidelity Impacts Retrieval and Generation

Effective LLM applications depend on converting unstructured documents—PDFs, spreadsheets, forms—into structured, machine-readable representations. Preserving layout and context is pivotal:

Layout matters: Context and meaning in documents are conveyed by where text, tables, figures, and headers are located.
Flattening destroys meaning: Traditional OCR merges elements into unstructured text, severing the link between content and context.
Loss of structure = hallucination risk: LLMs hallucinate when the parsed input fails to reflect the true organization or content of the document (source).

Key structural preservation elements:

Table and figure extraction with cell-level integrity
Segmentation into semantically coherent chunks
Accurate bounding boxes for citations
Maintenance of visual hierarchy (headers, footers, multi-column flows)

Citable Structured Chunks Combat LLM Hallucinations

Structure-preserving parsers like Reducto combine computer vision, vision-language models (VLMs), and multi-pass Agentic OCR (see more). The result:

Each extracted chunk is meaningful, located precisely in the document
Output includes bounding boxes to assure traceability for citations
Table, form, and figure layouts are reconstructed—not just text extraction
LLM prompts reference and ground outputs using these citable structures

This grounding enables LLMs to cite/document exactly what supports their claims, discouraging unsupported or invented statements in generation (Benchmark case study).

Benchmark Evidence: Impact on Retrieval Accuracy

Document API Benchmarks

A controlled evaluation of LLM-powered retrieval-augmented generation (RAG) using a scanned 10-K filing, compared traditional and structure-preserving parsing:

Parser	RAG QA Accuracy	Manual Verification	Notes
Traditional OCR	Lower	High error rate	Missed tables/sections
Reducto (structure-pres)	Higher	Low error rate	Maintained layout/citations

Result: Structure-preserving parsing improved both retrieval relevance and answer correctness, measured by both automated and expert graders (see benchmark details).

RD‑Table

Bench: Table Parsing Accuracy

RD‑TableBench is an open benchmark for table extraction across complex real-world documents (RD-TableBench).

Model/API	Table Similarity Score (0–1)	Layout Preservation
AWS Textract	0.72	Frequently lost
Google Cloud Document AI	0.81	Partial
Reducto	0.90+	High

Reducto's structure-preserving output delivered materially higher alignment with ground truth—including merged cells, nested headers, and cell-level fidelity, all critical for correct LLM retrieval (see results).

Hallucination Rate Improvements: Before and After

Before (Non-Structure-Preserving Parsing):

Text-only output misses context (e.g., misclassifies chart-embedded figures)
Retrieval stage fails to locate correct information
LLM generates plausible but incorrect, uncitable answers

After (With Structure-Preserving Parsing):

LLM answers cite source chunks with traceable bounding boxes
Accurate chunking feeds relevant context to LLM prompts
Generation sticks to facts grounded in original document structure

Enterprise Results: Measurable Gains

Real-world adoption confirms these performance gains:

Anterior (Healthcare): Prior authorization pipeline >99% precision, with less than 0.1% reviews flawed due to document ingestion (Anterior case study).
Elysian (Insurance): 16x faster audit process vs. legacy workflows by ensuring every extracted insight maps to original evidence (Elysian case study).
Benchmark (Finance): Automated report generation, each statement linked directly to a precise source—eliminating hallucinatory claims (Benchmark case study).---

Practitioner checklist: Reduce hallucinations with structure‑preserving extraction

Use these concrete switches and practices to keep citations precise and generation grounded.

Implementation flags

Enable citable chunks with bounding boxes
Set citations and bounding boxes so every chunk carries page + bbox for traceability
Example flags: citations: true, include_bboxes: true
Preserve true reading order
Avoid column bleed and header/footer confusion in multi‑column PDFs
Example flag: options.reading_order: "layout"
Maintain table fidelity for numeric truth
Keep merged cells, header rows, and hierarchies intact to prevent value/key drift
Example flag: tables.merge_cells: true
Why: Higher table similarity on RD‑TableBench correlates with better QA grounding (RD‑TableBench)
Chunk with structure awareness
Use layout‑aware variable chunks; store type metadata (table/figure/text) for retrieval filters
Figures and graphs
Summarize for context, but always retain bbox to jump back to the exact region
Forms and checkboxes
Extract fields and mark checkbox states; prefer explicit capture over heuristic inference
Evaluate with citations on
Assert each answer span has a source page+bbox and penalize uncited claims (Document API)

Example: Parse with structure‑preserving flags

from reducto import Client
import os

c = Client(api_key=os.environ["REDUCTO_API_KEY"])

doc = c.parse(
    "claim.pdf",
    chunk_mode="variable",
    citations=True,

# include page+bbox refs in chunks for grounding

    include_bboxes=True,

# explicit bbox payloads

    options={"reading_order": "layout"},

# column‑aware ordering

    tables={"merge_cells": True}

# preserve merged headers/cells

)

# Downstream: prefer hybrid retrieval + structural filters

hits = retrieve(index=embed_and_index(doc.chunks), q="Total allowed amount on CMS‑1500?",
                k=5, filter={"type": ["table", "form"]})
answer = llm.answer(q, context=format_with_citations(hits))

Domain callouts where this is critical

Healthcare prior auth: Sentence‑level bboxes and reliable reading order enabled 99%+ accuracy with <0.1% ingestion‑attributable flaws at production scale (Anterior case study).
Insurance claims: Table/checkbox fidelity and citations drove up to 16x faster audits under strict evidence requirements (Elysian case study).

Summary Table: Structure-Preserving vs. Traditional Parsing

Dimension	Traditional OCR	Structure-Preserving (Reducto)
Layout fidelity	Low	High
Chunking for LLMs	Basic	Semantic, layout-aware
Table/figure integrity	Poor	Accurate, cell-level
Citation support	Not available	Built-in bounding boxes
RAG/QA retrieval accuracy	Inconsistent	High/measurable gains
Hallucination risk	High	Reduced

Conclusion

Structure-preserving parsing is the foundation for reliable, factual LLM systems. By maintaining document fidelity, providing citable extraction, and delivering LLM-optimized chunking, organizations can decrease hallucination rates, boost retrieval accuracy, and unlock powerful AI-driven workflows at scale.

See for yourself:

References: Reducto Document API, RD-TableBench, Benchmark Case Study, Anterior Case Study, Elysian Case Study