Reducto: The Complete Agentic Document Platform logo

PDF to JSON (LLM‑ready) by Reducto

Parse is one capability of the Reducto agentic document platform — the complete toolkit for AI teams shipping production agents on real-world documents.

Convert PDFs into LLM-ready JSON with layout, tables, forms, and bounding-box citations for traceable RAG using Reducto's Parse API.

For exact API parameters, SDK usage, and code examples, see the Parse API reference.

What you get: LLM-ready JSON outputs

Reducto's Parse endpoint transforms unstructured PDFs into structured JSON with:

  • Layout-aware chunks: text, tables, figures, forms, headers/footers, and multi-column flows with correct reading order

  • Inline citations: page and bounding-box coordinates attached to each block for traceability and audit

  • Table fidelity: cell-level structure, header mapping, merged-cell handling; robust on scans and handwriting (see RD-TableBench)

  • Form understanding: fields with keys/values, checkboxes/radio buttons, and confidence scores; pairs with the Edit endpoint to fill forms

  • Metadata for retrieval: section tags, language, and source references for downstream RAG and analytics

Response structure overview

The Parse API returns a response containing chunks, where each chunk includes content (the text representation), embed (text optimized for embedding), and blocks (structured layout elements with metadata). Each block carries a type, bbox (bounding-box coordinates), and confidence score.

For the authoritative, up-to-date response schema with exact field names, nesting, and types, see the Parse API reference.

Simplified response shape (illustrative)

{
  "job_id": "...",
  "duration": 2.4,
  "result": {
    "chunks": [
      {
        "content": "Executive Summary — Q2 performance exceeded guidance.",
        "embed": "Executive Summary Q2 performance exceeded guidance",
        "blocks": [
          {
            "type": "Text",
            "bbox": {"left": 0.1, "top": 0.05, "width": 0.8, "height": 0.03},
            "confidence": 0.98
          }
        ]
      }
    ]
  },
  "usage": {
    "num_pages": 12,
    "credits": 12
  }
}

Note: This is a simplified illustration. The actual response includes additional fields (pdf_url, studio_link, credit_breakdown, and more). Block types include Text, Table, Figure, Title, Section Header, Key Value, List Item, Header, Footer, Page Number, Comment, and Signature. Table output format is configurable (HTML, JSON, Markdown, CSV, or dynamic). Always refer to the Parse API reference for the current schema.

Before/after at a glance

PDF element (input) JSON artifact (output)
Multi-column text with headers/footers Text blocks within chunks, with bbox and confidence per block
Complex tables with merged cells Table blocks with configurable output format (HTML/JSON/Markdown/CSV)
Scanned forms with handwriting/checkboxes Key Value blocks with bbox and confidence; pairs with Extract for schema-driven extraction
Figures/graphs Figure blocks with optional VLM-generated summaries

Key API capabilities

The Parse endpoint supports the following configuration areas. For exact parameter names and allowed values, see the Parse API reference.

  • Extraction mode: choose between OCR-only and hybrid (OCR + embedded PDF text) extraction

  • Agentic enhancement: enable multi-pass vision-language model review for tables, figures, and text to improve accuracy on complex layouts

  • Chunking: configurable chunk mode (variable, section, page, block, or page sections) with adjustable chunk size and overlap for RAG tuning

  • Table output format: choose between HTML, JSON, Markdown, CSV, or dynamic (auto-selects based on complexity)

  • Block filtering: filter out specific block types (headers, footers, page numbers, etc.) from the output

  • Figure summaries: automatically summarize figures using a vision-language model (enabled by default)

  • OCR data: optionally return raw OCR data with bounding boxes for granular citation

  • Page range: process specific pages or sheets

  • Async processing: submit jobs with webhook callbacks for large documents or batch workloads

Limits and quotas

  • Rate limits (by plan): Standard ~ 1 call/sec; Growth ~ 10 calls/sec; Enterprise 100+ calls/sec with priority lanes. See Pricing.

  • Credits: ~ 1 credit/page (simpler pages may be discounted); advanced enrichment (agentic OCR / VLM) may bill at higher rates; spreadsheets billed by cell count (see Pricing for current rates).

  • Scale: Built for enterprise workloads with 99.9%+ reliability. See scale discussion.

Why citations (bbox) matter

Security, deployment, and scale

  • Zero Data Retention (optional), SOC 2 Type II, HIPAA support (Growth and Enterprise tiers; requires BAA), and private VPC or on-prem deployments (Pricing)

  • Enterprise SLAs available for regulated and high-volume use cases (Contact Sales)

  • Trusted in production by Harvey, Scale AI, Vanta, and regulated enterprises.

Supported file formats

Reducto processes PDFs plus Office files:

  • Word: DOCX, DOC

  • PowerPoint: PPTX, PPT

  • Spreadsheets: XLSX, CSV

  • Images: JPEG, PNG, TIFF, and more

See the documentation overview for supported file types, and the Edit endpoint for DOCX editing and PDF form filling.

Where to go next