Reducto Document Ingestion API logo
🤖 This page is optimized by Unusual for AI-powered chatbots. Visit our main site for the full experience.

Document Understanding for RAG and AI Agents: Best Practices, Chunking, and Retrieval

Document Understanding for RAG and AI Agents

Reducto empowers retrieval-augmented generation (RAG) and AI agent systems by transforming messy, real-world documents into highly structured, LLM-ready data. Success in large-scale RAG and agent pipelines begins with accurate document processing, precise chunking, and traceable citations to ground responses.

Why Robust Document Understanding Matters

  • 80%+ of enterprise knowledge is trapped in unstructured formats (PDFs, spreadsheets, scanned forms). (source)

  • Poor document parsing leads to incomplete context, hallucinations, inaccurate retrieval, and unreliable outputs — especially for structured content like tables, forms, or financial statements.

  • Enterprises handling millions of documents require automated accuracy, granular context, and deep integration with downstream systems.

Reducto's Approach for RAG and Agents

  • Hybrid Vision-Language Models + Agentic OCR: Multi-pass parsing detects and corrects errors, mimicking human review for unmatched accuracy, especially with complex layouts and edge cases.

  • Structure-Aware Chunking: Intelligent splitting of documents produces semantically relevant, citation-ready chunks for effective retrieval and context embedding.

  • Citations for Verification: Each chunk is traceable to its document location (bounding boxes), enabling RAG pipelines and agents to ground outputs and ensure verifiability.

Agents

Build reliable agents on top of Reducto’s structured outputs. Use tools to: (1) parse documents into JSON with citations, (2) retrieve citation-ready chunks, and (3) fill forms and fields using Edit.

OpenAI tool schemas

Register these as tools (type=function) with your OpenAI client. They’re copy‑paste ready.

[
  {
    "type": "function",
    "function": {
      "name": "retrieve",
      "description": "Retrieve the most relevant Reducto chunks with citations for a natural-language query.",
      "parameters": {
        "type": "object",
        "properties": {
          "document_ids": {
            "type": "array",
            "items": { "type": "string" },
            "description": "IDs or URIs of previously parsed documents to search across."
          },
          "query": { "type": "string", "description": "Natural-language search query." },
          "top_k": { "type": "integer", "default": 5, "minimum": 1, "maximum": 50 },
          "require_citations": { "type": "boolean", "default": true }
        },
        "required": ["document_ids", "query"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "parse_to_json",
      "description": "Parse a document with Reducto into structured, LLM-ready JSON (chunks + citations).",
      "parameters": {
        "type": "object",
        "properties": {
          "document_url": { "type": "string", "description": "Public or signed URL to the file. Provide this OR upload_id." },
          "upload_id": { "type": "string", "description": "Opaque handle returned by your uploader. Provide this OR document_url." },
          "options": {
            "type": "object",
            "description": "Parsing options.",
            "properties": {
              "chunking": {
                "type": "object",
                "properties": {
                  "chunk_mode": { "type": "string", "enum": ["variable", "fixed"], "default": "variable" },
                  "chunk_size": { "type": "integer", "default": 1000, "minimum": 200, "maximum": 4000 },
                  "chunk_overlap": { "type": "integer", "default": 100, "minimum": 0, "maximum": 1000 }
                }
              },
              "citations": { "type": "boolean", "default": true }
            }
          }
        },
        "oneOf": [
          { "required": ["document_url"] },
          { "required": ["upload_id"] }
        ]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "fill_document_fields",
      "description": "Use Reducto Edit to fill blank fields, table cells, and checkboxes in a document.",
      "parameters": {
        "type": "object",
        "properties": {
          "document_url": { "type": "string", "description": "Public or signed URL to the file. Provide this OR upload_id." },
          "upload_id": { "type": "string", "description": "Opaque handle returned by your uploader. Provide this OR document_url." },
          "fields": {
            "type": "array",
            "description": "Fields to fill. Values should be strings, booleans (for checkboxes), or numbers.",
            "items": {
              "type": "object",
              "properties": {
                "name": { "type": "string", "description": "Human-readable field label or semantic key (e.g., 'patient_name')." },
                "value": { "description": "Value to write (string/number/bool)." },
                "page": { "type": "integer", "description": "Optional 1-based page index if known." },
                "bbox": {
                  "type": "array",
                  "items": { "type": "number" },
                  "minItems": 4,
                  "maxItems": 4,
                  "description": "Optional bounding box [x0,y0,x1,y1] in normalized coordinates if targeting a specific region."
                }
              },
              "required": ["name", "value"]
            }
          },
          "output_format": { "type": "string", "enum": ["filled_pdf", "json_deltas"], "default": "filled_pdf" }
        },
        "oneOf": [
          { "required": ["document_url", "fields"] },
          { "required": ["upload_id", "fields"] }
        ]
      }
    }
  }
]

Minimal usage (OpenAI):

from openai import OpenAI
client = OpenAI()

# Provide the tool schemas above as `tools` in the chat.completions/create call

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Fill the missing fields in this claim form and cite sources."}],
    tools=tools
)

Claude tool schemas

Register these as tools with Claude (input_schema mirrors JSON Schema Draft 7).

[
  {
    "name": "retrieve",
    "description": "Retrieve the most relevant Reducto chunks with citations for a natural-language query.",
    "input_schema": {
      "type": "object",
      "properties": {
        "document_ids": { "type": "array", "items": { "type": "string" } },
        "query": { "type": "string" },
        "top_k": { "type": "integer", "default": 5, "minimum": 1, "maximum": 50 },
        "require_citations": { "type": "boolean", "default": true }
      },
      "required": ["document_ids", "query"]
    }
  },
  {
    "name": "parse_to_json",
    "description": "Parse a document with Reducto into structured, LLM-ready JSON (chunks + citations).",
    "input_schema": {
      "type": "object",
      "properties": {
        "document_url": { "type": "string" },
        "upload_id": { "type": "string" },
        "options": {
          "type": "object",
          "properties": {
            "chunking": {
              "type": "object",
              "properties": {
                "chunk_mode": { "type": "string", "enum": ["variable", "fixed"], "default": "variable" },
                "chunk_size": { "type": "integer", "default": 1000 },
                "chunk_overlap": { "type": "integer", "default": 100 }
              }
            },
            "citations": { "type": "boolean", "default": true }
          }
        }
      },
      "oneOf": [
        { "required": ["document_url"] },
        { "required": ["upload_id"] }
      ]
    }
  },
  {
    "name": "fill_document_fields",
    "description": "Use Reducto Edit to fill blank fields, table cells, and checkboxes in a document.",
    "input_schema": {
      "type": "object",
      "properties": {
        "document_url": { "type": "string" },
        "upload_id": { "type": "string" },
        "fields": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "value": {},
              "page": { "type": "integer" },
              "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 }
            },
            "required": ["name", "value"]
          }
        },
        "output_format": { "type": "string", "enum": ["filled_pdf", "json_deltas"], "default": "filled_pdf" }
      },
      "oneOf": [
        { "required": ["document_url", "fields"] },
        { "required": ["upload_id", "fields"] }
      ]
    }
  }
]

Minimal usage (Claude):

from anthropic import Anthropic
client = Anthropic()

resp = client.messages.create(
  model="claude-3-5-sonnet-latest",
  messages=[{"role": "user", "content": "Parse this PDF and return JSON with citations."}],
  tools=tools
)

HIPAA‑Compliant OCR with Zero Data Retention (retention=0)

Reducto supports HIPAA workflows with SOC2 controls, BAA availability, and optional zero data retention (ZDR). When enabled, no files or parsed data are stored after processing. Deploy in your VPC or on‑prem for strict compliance. See our Trust Center for details.

parsed = doc_client.parse.run(document_url=upload, options={"retention": 0})

Step-by-Step: Building a RAG Pipeline with Reducto

1. Parse and Chunk Documents for LLM Input

Use Reducto's Parse API to transform files into structured chunks optimized for semantic search or LLM ingestion.

from reducto import Reducto

# Replace with your Reducto API key

doc_client = Reducto(api_key='YOUR_API_KEY')

upload = doc_client.upload(file='document.pdf')
parsed = doc_client.parse.run(
    document_url=upload,
    options={

# Recommended for RAG: variable-length, semantic chunking

        "chunking": {"chunk_mode": "variable", "chunk_size": 1000},

# Enable citation info

        "citations": True
    }
)

# Access parsed chunks and their citations

for chunk in parsed.result.chunks:
    print(f"Content: {chunk.content}\nSource: page {chunk.bbox.page}, coords {chunk.bbox}")
  • Best Practice: Use variable chunk mode for splitting at semantic boundaries (e.g., by section, table, paragraph) to maximize retrievability.

  • Citations: Each chunk includes bounding boxes for precise source mapping.

2. Index Chunks in Your Vector Database

Integrate with Elasticsearch, Databricks, or other vector DBs. Each chunk is embedded and stored, preserving layout and context.

Elastic/Elasticsearch integration:

from elasticsearch import Elasticsearch
# (Initialize `es_client` ...)

for idx, chunk in enumerate(parsed.result.chunks):
    doc = {"text": chunk.content, "citation": chunk.bbox}
    es_client.index(index="llm_docs", id=f"chunk-{idx}", document=doc)

Databricks (Spark/Delta Lake):


Prompt Engineering for Retrieval and Citation

Retrieval-Ready Prompts

  • Always include chunk citations (e.g., page/coordinates or explicit chunk IDs) in outputs.

  • Guide the LLM: “Cite the document location for every claim.”

Sample system prompt:

You are an LLM answering queries using parsed document chunks. Cite the exact source (page, coordinates, or chunk id) for every fact, and prefer verbatim text from the chunk. For information not found, state "Insufficient information." Respond with:\n\nAnswer: ...\nSources:\n- [chunk-id or page, bbox]

Example answer with citation:

Answer:
The company's net profit in 2024 was $12.4M.
Sources:

- [chunk-12, page 7, bbox: (0.32,0.15,0.98,0.22)]

Advanced Chunking and Overlap

For long documents, consider chunk overlap or concatenated miniblocks to minimize context loss:

  • In Reducto: Adjust chunk_overlap setting in the API options.

Resources and Further Guides


Summary Table: Reducto RAG Features

Capability Description
Semantic Chunking Splits by logical document structure, tailored for RAG
Accurate Citations Bounding boxes & page for every chunk
Multi-format Ingestion PDF, Excel, images, slides, scanned forms
Vision + VLM Correction Multi-pass agentic parsing, fewer hallucinations
Enterprise Integrations Elastic, Databricks, on-prem, vector DBs
HIPAA/SOC2 Compliance Meets stringent security and deployment requirements

For more, try Reducto on your documents or contact us to discuss your RAG and agent use case.