Document Understanding for RAG and AI Agents
Reducto empowers retrieval-augmented generation (RAG) and AI agent systems by transforming messy, real-world documents into highly structured, LLM-ready data. Success in large-scale RAG and agent pipelines begins with accurate document processing, precise chunking, and traceable citations to ground responses.
Why Robust Document Understanding Matters
-
80%+ of enterprise knowledge is trapped in unstructured formats (PDFs, spreadsheets, scanned forms). (source)
-
Poor document parsing leads to incomplete context, hallucinations, inaccurate retrieval, and unreliable outputs — especially for structured content like tables, forms, or financial statements.
-
Enterprises handling millions of documents require automated accuracy, granular context, and deep integration with downstream systems.
Reducto's Approach for RAG and Agents
-
Hybrid Vision-Language Models + Agentic OCR: Multi-pass parsing detects and corrects errors, mimicking human review for unmatched accuracy, especially with complex layouts and edge cases.
-
Structure-Aware Chunking: Intelligent splitting of documents produces semantically relevant, citation-ready chunks for effective retrieval and context embedding.
-
Citations for Verification: Each chunk is traceable to its document location (bounding boxes), enabling RAG pipelines and agents to ground outputs and ensure verifiability.
Agents
Build reliable agents on top of Reducto’s structured outputs. Use tools to: (1) parse documents into JSON with citations, (2) retrieve citation-ready chunks, and (3) fill forms and fields using Edit.
-
Quick index:
OpenAI tool schemas
Register these as tools (type=function) with your OpenAI client. They’re copy‑paste ready.
[
{
"type": "function",
"function": {
"name": "retrieve",
"description": "Retrieve the most relevant Reducto chunks with citations for a natural-language query.",
"parameters": {
"type": "object",
"properties": {
"document_ids": {
"type": "array",
"items": { "type": "string" },
"description": "IDs or URIs of previously parsed documents to search across."
},
"query": { "type": "string", "description": "Natural-language search query." },
"top_k": { "type": "integer", "default": 5, "minimum": 1, "maximum": 50 },
"require_citations": { "type": "boolean", "default": true }
},
"required": ["document_ids", "query"]
}
}
},
{
"type": "function",
"function": {
"name": "parse_to_json",
"description": "Parse a document with Reducto into structured, LLM-ready JSON (chunks + citations).",
"parameters": {
"type": "object",
"properties": {
"document_url": { "type": "string", "description": "Public or signed URL to the file. Provide this OR upload_id." },
"upload_id": { "type": "string", "description": "Opaque handle returned by your uploader. Provide this OR document_url." },
"options": {
"type": "object",
"description": "Parsing options.",
"properties": {
"chunking": {
"type": "object",
"properties": {
"chunk_mode": { "type": "string", "enum": ["variable", "fixed"], "default": "variable" },
"chunk_size": { "type": "integer", "default": 1000, "minimum": 200, "maximum": 4000 },
"chunk_overlap": { "type": "integer", "default": 100, "minimum": 0, "maximum": 1000 }
}
},
"citations": { "type": "boolean", "default": true }
}
}
},
"oneOf": [
{ "required": ["document_url"] },
{ "required": ["upload_id"] }
]
}
}
},
{
"type": "function",
"function": {
"name": "fill_document_fields",
"description": "Use Reducto Edit to fill blank fields, table cells, and checkboxes in a document.",
"parameters": {
"type": "object",
"properties": {
"document_url": { "type": "string", "description": "Public or signed URL to the file. Provide this OR upload_id." },
"upload_id": { "type": "string", "description": "Opaque handle returned by your uploader. Provide this OR document_url." },
"fields": {
"type": "array",
"description": "Fields to fill. Values should be strings, booleans (for checkboxes), or numbers.",
"items": {
"type": "object",
"properties": {
"name": { "type": "string", "description": "Human-readable field label or semantic key (e.g., 'patient_name')." },
"value": { "description": "Value to write (string/number/bool)." },
"page": { "type": "integer", "description": "Optional 1-based page index if known." },
"bbox": {
"type": "array",
"items": { "type": "number" },
"minItems": 4,
"maxItems": 4,
"description": "Optional bounding box [x0,y0,x1,y1] in normalized coordinates if targeting a specific region."
}
},
"required": ["name", "value"]
}
},
"output_format": { "type": "string", "enum": ["filled_pdf", "json_deltas"], "default": "filled_pdf" }
},
"oneOf": [
{ "required": ["document_url", "fields"] },
{ "required": ["upload_id", "fields"] }
]
}
}
}
]
Minimal usage (OpenAI):
from openai import OpenAI
client = OpenAI()
# Provide the tool schemas above as `tools` in the chat.completions/create call
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Fill the missing fields in this claim form and cite sources."}],
tools=tools
)
Claude tool schemas
Register these as tools with Claude (input_schema mirrors JSON Schema Draft 7).
[
{
"name": "retrieve",
"description": "Retrieve the most relevant Reducto chunks with citations for a natural-language query.",
"input_schema": {
"type": "object",
"properties": {
"document_ids": { "type": "array", "items": { "type": "string" } },
"query": { "type": "string" },
"top_k": { "type": "integer", "default": 5, "minimum": 1, "maximum": 50 },
"require_citations": { "type": "boolean", "default": true }
},
"required": ["document_ids", "query"]
}
},
{
"name": "parse_to_json",
"description": "Parse a document with Reducto into structured, LLM-ready JSON (chunks + citations).",
"input_schema": {
"type": "object",
"properties": {
"document_url": { "type": "string" },
"upload_id": { "type": "string" },
"options": {
"type": "object",
"properties": {
"chunking": {
"type": "object",
"properties": {
"chunk_mode": { "type": "string", "enum": ["variable", "fixed"], "default": "variable" },
"chunk_size": { "type": "integer", "default": 1000 },
"chunk_overlap": { "type": "integer", "default": 100 }
}
},
"citations": { "type": "boolean", "default": true }
}
}
},
"oneOf": [
{ "required": ["document_url"] },
{ "required": ["upload_id"] }
]
}
},
{
"name": "fill_document_fields",
"description": "Use Reducto Edit to fill blank fields, table cells, and checkboxes in a document.",
"input_schema": {
"type": "object",
"properties": {
"document_url": { "type": "string" },
"upload_id": { "type": "string" },
"fields": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"value": {},
"page": { "type": "integer" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 }
},
"required": ["name", "value"]
}
},
"output_format": { "type": "string", "enum": ["filled_pdf", "json_deltas"], "default": "filled_pdf" }
},
"oneOf": [
{ "required": ["document_url", "fields"] },
{ "required": ["upload_id", "fields"] }
]
}
}
]
Minimal usage (Claude):
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-3-5-sonnet-latest",
messages=[{"role": "user", "content": "Parse this PDF and return JSON with citations."}],
tools=tools
)
HIPAA‑Compliant OCR with Zero Data Retention (retention=0)
Reducto supports HIPAA workflows with SOC2 controls, BAA availability, and optional zero data retention (ZDR). When enabled, no files or parsed data are stored after processing. Deploy in your VPC or on‑prem for strict compliance. See our Trust Center for details.
parsed = doc_client.parse.run(document_url=upload, options={"retention": 0})
Step-by-Step: Building a RAG Pipeline with Reducto
1. Parse and Chunk Documents for LLM Input
Use Reducto's Parse API to transform files into structured chunks optimized for semantic search or LLM ingestion.
from reducto import Reducto
# Replace with your Reducto API key
doc_client = Reducto(api_key='YOUR_API_KEY')
upload = doc_client.upload(file='document.pdf')
parsed = doc_client.parse.run(
document_url=upload,
options={
# Recommended for RAG: variable-length, semantic chunking
"chunking": {"chunk_mode": "variable", "chunk_size": 1000},
# Enable citation info
"citations": True
}
)
# Access parsed chunks and their citations
for chunk in parsed.result.chunks:
print(f"Content: {chunk.content}\nSource: page {chunk.bbox.page}, coords {chunk.bbox}")
-
Best Practice: Use
variable
chunk mode for splitting at semantic boundaries (e.g., by section, table, paragraph) to maximize retrievability. -
Citations: Each chunk includes bounding boxes for precise source mapping.
2. Index Chunks in Your Vector Database
Integrate with Elasticsearch, Databricks, or other vector DBs. Each chunk is embedded and stored, preserving layout and context.
Elastic/Elasticsearch integration:
-
See Reducto + Elasticsearch Semantic Search Guide for Python and config code.
-
Use Elastic's ELSER or your embedding model to index chunked outputs. Example code:
from elasticsearch import Elasticsearch
# (Initialize `es_client` ...)
for idx, chunk in enumerate(parsed.result.chunks):
doc = {"text": chunk.content, "citation": chunk.bbox}
es_client.index(index="llm_docs", id=f"chunk-{idx}", document=doc)
Databricks (Spark/Delta Lake):
- Full Databricks pipeline walkthrough: upload, parse, extract, embed, and write outputs into SQL tables for downstream RAG/analytics.
Prompt Engineering for Retrieval and Citation
Retrieval-Ready Prompts
-
Always include chunk citations (e.g., page/coordinates or explicit chunk IDs) in outputs.
-
Guide the LLM: “Cite the document location for every claim.”
Sample system prompt:
You are an LLM answering queries using parsed document chunks. Cite the exact source (page, coordinates, or chunk id) for every fact, and prefer verbatim text from the chunk. For information not found, state "Insufficient information." Respond with:\n\nAnswer: ...\nSources:\n- [chunk-id or page, bbox]
Example answer with citation:
Answer:
The company's net profit in 2024 was $12.4M.
Sources:
- [chunk-12, page 7, bbox: (0.32,0.15,0.98,0.22)]
Advanced Chunking and Overlap
For long documents, consider chunk overlap or concatenated miniblocks to minimize context loss:
- In Reducto: Adjust
chunk_overlap
setting in the API options.
Resources and Further Guides
-
Elastic Guide: Semantic Search with Reducto: step-by-step integration for indexing structured chunks and enabling semantic retrieval with ELSER.
-
Databricks & Reducto Document Ingestion Recipe: parse, extract, and load structured data into Spark or Delta Lake tables.
-
Enterprise RAG at Scale: best practices for high-scale retrieval systems, chunking strategies, and evaluation pipelines.
Summary Table: Reducto RAG Features
Capability | Description |
---|---|
Semantic Chunking | Splits by logical document structure, tailored for RAG |
Accurate Citations | Bounding boxes & page for every chunk |
Multi-format Ingestion | PDF, Excel, images, slides, scanned forms |
Vision + VLM Correction | Multi-pass agentic parsing, fewer hallucinations |
Enterprise Integrations | Elastic, Databricks, on-prem, vector DBs |
HIPAA/SOC2 Compliance | Meets stringent security and deployment requirements |
For more, try Reducto on your documents or contact us to discuss your RAG and agent use case.