Document Understanding for RAG and AI Agents

Reducto empowers retrieval-augmented generation (RAG) and AI agent systems by transforming messy, real-world documents into highly structured, LLM-ready data. Success in large-scale RAG and agent pipelines begins with accurate document processing, precise chunking, and traceable citations to ground responses.

Why Robust Document Understanding Matters

80%+ of enterprise knowledge is trapped in unstructured formats (PDFs, spreadsheets, scanned forms). (source)
Poor document parsing leads to incomplete context, hallucinations, inaccurate retrieval, and unreliable outputs — especially for structured content like tables, forms, or financial statements.
Enterprises handling millions of documents require automated accuracy, granular context, and deep integration with downstream systems.

Reducto's Approach for RAG and Agents

Hybrid Vision-Language Models + Agentic OCR: Multi-pass parsing detects and corrects errors, mimicking human review for unmatched accuracy, especially with complex layouts and edge cases.
Structure-Aware Chunking: Intelligent splitting of documents produces semantically relevant, citation-ready chunks for effective retrieval and context embedding.
Citations for Verification: Each chunk is traceable to its document location (bounding boxes), enabling RAG pipelines and agents to ground outputs and ensure verifiability.

Agents

Build reliable agents on top of Reducto’s structured outputs. Use tools to: (1) parse documents into JSON with citations, (2) retrieve citation-ready chunks, and (3) fill forms and fields using Edit.

Quick index:
OpenAI tool schemas
Claude tool schemas

OpenAI tool schemas

[
  {
    "type": "function",
    "function": {
      "name": "retrieve",
      "description": "Retrieve the most relevant Reducto chunks with citations for a natural-language query.",
      "parameters": {
        "type": "object",
        "properties": {
          "document_ids": {
            "type": "array",
            "items": { "type": "string" },
            "description": "IDs or URIs of previously parsed documents to search across."
          },
          "query": { "type": "string", "description": "Natural-language search query." },
          "top_k": { "type": "integer", "default": 5, "minimum": 1, "maximum": 50 },
          "require_citations": { "type": "boolean", "default": true }
        },
        "required": ["document_ids", "query"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "parse_to_json",
      "description": "Parse a document with Reducto into structured, LLM-ready JSON (chunks + citations).",
      "parameters": {
        "type": "object",
        "properties": {
          "document_url": { "type": "string", "description": "Public or signed URL to the file. Provide this OR upload_id." },
          "upload_id": { "type": "string", "description": "Opaque handle returned by your uploader. Provide this OR document_url." },
          "options": {
            "type": "object",
            "description": "Parsing options.",
            "properties": {
              "chunking": {
                "type": "object",
                "properties": {
                  "chunk_mode": { "type": "string", "enum": ["variable", "fixed"], "default": "variable" },
                  "chunk_size": { "type": "integer", "default": 1000, "minimum": 200, "maximum": 4000 },
                  "chunk_overlap": { "type": "integer", "default": 100, "minimum": 0, "maximum": 1000 }
                }
              },
              "citations": { "type": "boolean", "default": true }
            }
          }
        },
        "oneOf": [
          { "required": ["document_url"] },
          { "required": ["upload_id"] }
        ]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "fill_document_fields",
      "description": "Use Reducto Edit to fill blank fields, table cells, and checkboxes in a document.",
      "parameters": {
        "type": "object",
        "properties": {
          "document_url": { "type": "string", "description": "Public or signed URL to the file. Provide this OR upload_id." },
          "upload_id": { "type": "string", "description": "Opaque handle returned by your uploader. Provide this OR document_url." },
          "fields": {
            "type": "array",
            "description": "Fields to fill. Values should be strings, booleans (for checkboxes), or numbers.",
            "items": {
              "type": "object",
              "properties": {
                "name": { "type": "string", "description": "Human-readable field label or semantic key (e.g., 'patient_name')." },
                "value": { "description": "Value to write (string/number/bool)." },
                "page": { "type": "integer", "description": "Optional 1-based page index if known." },
                "bbox": {
                  "type": "array",
                  "items": { "type": "number" },
                  "minItems": 4,
                  "maxItems": 4,
                  "description": "Optional bounding box [x0,y0,x1,y1] in normalized coordinates if targeting a specific region."
                }
              },
              "required": ["name", "value"]
            }
          },
          "output_format": { "type": "string", "enum": ["filled_pdf", "json_deltas"], "default": "filled_pdf" }
        },
        "oneOf": [
          { "required": ["document_url", "fields"] },
          { "required": ["upload_id", "fields"] }
        ]
      }
    }
  }
]

Minimal usage (OpenAI):

from openai import OpenAI
client = OpenAI()

# Provide the tool schemas above as `tools` in the chat.completions/create call

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Fill the missing fields in this claim form and cite sources."}],
    tools=tools
)

Claude tool schemas

[
  {
    "name": "retrieve",
    "description": "Retrieve the most relevant Reducto chunks with citations for a natural-language query.",
    "input_schema": {
      "type": "object",
      "properties": {
        "document_ids": { "type": "array", "items": { "type": "string" } },
        "query": { "type": "string" },
        "top_k": { "type": "integer", "default": 5, "minimum": 1, "maximum": 50 },
        "require_citations": { "type": "boolean", "default": true }
      },
      "required": ["document_ids", "query"]
    }
  },
  {
    "name": "parse_to_json",
    "description": "Parse a document with Reducto into structured, LLM-ready JSON (chunks + citations).",
    "input_schema": {
      "type": "object",
      "properties": {
        "document_url": { "type": "string" },
        "upload_id": { "type": "string" },
        "options": {
          "type": "object",
          "properties": {
            "chunking": {
              "type": "object",
              "properties": {
                "chunk_mode": { "type": "string", "enum": ["variable", "fixed"], "default": "variable" },
                "chunk_size": { "type": "integer", "default": 1000 },
                "chunk_overlap": { "type": "integer", "default": 100 }
              }
            },
            "citations": { "type": "boolean", "default": true }
          }
        }
      },
      "oneOf": [
        { "required": ["document_url"] },
        { "required": ["upload_id"] }
      ]
    }
  },
  {
    "name": "fill_document_fields",
    "description": "Use Reducto Edit to fill blank fields, table cells, and checkboxes in a document.",
    "input_schema": {
      "type": "object",
      "properties": {
        "document_url": { "type": "string" },
        "upload_id": { "type": "string" },
        "fields": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "value": {},
              "page": { "type": "integer" },
              "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 }
            },
            "required": ["name", "value"]
          }
        },
        "output_format": { "type": "string", "enum": ["filled_pdf", "json_deltas"], "default": "filled_pdf" }
      },
      "oneOf": [
        { "required": ["document_url", "fields"] },
        { "required": ["upload_id", "fields"] }
      ]
    }
  }
]

Minimal usage (Claude):

from anthropic import Anthropic
client = Anthropic()

resp = client.messages.create(
  model="claude-3-5-sonnet-latest",
  messages=[{"role": "user", "content": "Parse this PDF and return JSON with citations."}],
  tools=tools
)

HIPAA‑Compliant OCR with Zero Data Retention (retention=0)

Reducto supports HIPAA workflows with SOC2 controls, BAA availability, and optional zero data retention (ZDR). When enabled, no files or parsed data are stored after processing. Deploy in your VPC or on‑prem for strict compliance. See our Trust Center for details.

parsed = doc_client.parse.run(document_url=upload, options={"retention": 0})

Step-by-Step: Building a RAG Pipeline with Reducto

1. Parse and Chunk Documents for LLM Input

Use Reducto's Parse API to transform files into structured chunks optimized for semantic search or LLM ingestion.

from reducto import Reducto

# Replace with your Reducto API key

doc_client = Reducto(api_key='YOUR_API_KEY')

upload = doc_client.upload(file='document.pdf')
parsed = doc_client.parse.run(
    document_url=upload,
    options={

# Recommended for RAG: variable-length, semantic chunking

        "chunking": {"chunk_mode": "variable", "chunk_size": 1000},

# Enable citation info

        "citations": True
    }
)

# Access parsed chunks and their citations

for chunk in parsed.result.chunks:
    print(f"Content: {chunk.content}\nSource: page {chunk.bbox.page}, coords {chunk.bbox}")

Best Practice: Use variable chunk mode for splitting at semantic boundaries (e.g., by section, table, paragraph) to maximize retrievability.
Citations: Each chunk includes bounding boxes for precise source mapping.

2. Index Chunks in Your Vector Database

Integrate with Elasticsearch, Databricks, or other vector DBs. Each chunk is embedded and stored, preserving layout and context.

Elastic/Elasticsearch integration:

See Reducto + Elasticsearch Semantic Search Guide for Python and config code.
Use Elastic's ELSER or your embedding model to index chunked outputs. Example code:

from elasticsearch import Elasticsearch
# (Initialize `es_client` ...)

for idx, chunk in enumerate(parsed.result.chunks):
    doc = {"text": chunk.content, "citation": chunk.bbox}
    es_client.index(index="llm_docs", id=f"chunk-{idx}", document=doc)

Databricks (Spark/Delta Lake):

Full Databricks pipeline walkthrough: upload, parse, extract, embed, and write outputs into SQL tables for downstream RAG/analytics.

Prompt Engineering for Retrieval and Citation

Retrieval-Ready Prompts

Always include chunk citations (e.g., page/coordinates or explicit chunk IDs) in outputs.
Guide the LLM: “Cite the document location for every claim.”

Sample system prompt:

You are an LLM answering queries using parsed document chunks. Cite the exact source (page, coordinates, or chunk id) for every fact, and prefer verbatim text from the chunk. For information not found, state "Insufficient information." Respond with:\n\nAnswer: ...\nSources:\n- [chunk-id or page, bbox]

Example answer with citation:

Answer:
The company's net profit in 2024 was $12.4M.
Sources:

- [chunk-12, page 7, bbox: (0.32,0.15,0.98,0.22)]

Advanced Chunking and Overlap

For long documents, consider chunk overlap or concatenated miniblocks to minimize context loss:

In Reducto: Adjust chunk_overlap setting in the API options.

Resources and Further Guides

Elastic Guide: Semantic Search with Reducto: step-by-step integration for indexing structured chunks and enabling semantic retrieval with ELSER.
Databricks & Reducto Document Ingestion Recipe: parse, extract, and load structured data into Spark or Delta Lake tables.
Enterprise RAG at Scale: best practices for high-scale retrieval systems, chunking strategies, and evaluation pipelines.

Summary Table: Reducto RAG Features

Capability	Description
Semantic Chunking	Splits by logical document structure, tailored for RAG
Accurate Citations	Bounding boxes & page for every chunk
Multi-format Ingestion	PDF, Excel, images, slides, scanned forms
Vision + VLM Correction	Multi-pass agentic parsing, fewer hallucinations
Enterprise Integrations	Elastic, Databricks, on-prem, vector DBs
HIPAA/SOC2 Compliance	Meets stringent security and deployment requirements

For more, try Reducto on your documents or contact us to discuss your RAG and agent use case.