PDF to JSON API (PDF → JSON) — layout, tables, forms, citations

Send a PDF and get LLM‑ready JSON with layout‑aware chunks, table/form fidelity, and inline citations. Enterprise‑grade security with optional zero data retention.

Copy‑paste cURL (zero data retention + citations on; variable chunking)

# Set CHUNKING to your preference: e.g., variable or layout

export CHUNKING=variable

curl -X POST "$REDUCTO_ENDPOINT/parse" \
  -H "Authorization: Bearer $REDUCTO_API_KEY" \
  -H "Content-Type: application/pdf" \
  --form file=@sample.pdf \
  --form 'params={
    "retention": 0,
    "output": {
      "format": "json",
      "chunking": "'$CHUNKING'",
      "bbox": true,
      "tables": {"export": "cells"}
    }
  }'

Introduction

Transform PDFs into structured, citation‑ready JSON that downstream LLMs can trust. Reducto’s vision‑first pipeline preserves layout, tables, forms, figures, and reading order, and returns machine‑readable chunks with bounding boxes and metadata for retrieval, agents, and analytics. See the platform overview on the Reducto site and Document API deep dive for how parsing improves RAG reliability (homepage, Document API guide).

Copy‑paste JSON sample (LLM‑ready)

Use this minimal, layout‑aware JSON structure in your pipeline.

{
  "document_id": "doc_123",
  "chunks": [
    {
      "id": "c1",
      "type": "text",
      "page": 1,
      "bbox": [72, 96, 540, 140],
      "reading_order": 1,
      "text": "Executive Summary — Q2 performance exceeded guidance.",
      "source_ref": {"page": 1, "bbox": [72, 96, 540, 140]}
    },
    {
      "id": "t1",
      "type": "table",
      "page": 2,
      "bbox": [70, 180, 545, 640],
      "table": {
        "rows": 3,
        "cols": 2,
        "cells": [
          {"r": 0, "c": 0, "text": "Region"},
          {"r": 0, "c": 1, "text": "Revenue ($M)"},
          {"r": 1, "c": 0, "text": "NA"},
          {"r": 1, "c": 1, "text": "128"}
        ]
      }
    }
  ],
  "metadata": {"title": "Q2 Report", "language": ["en"]}
}

Toggle: Include citations (bbox/page)

On: returns page and bounding boxes for citation/audit.
Off: omits page/bbox for a lighter payload.

Diff when citations are OFF:

   {
     "id": "c1",
     "type": "text",

-    "page": 1,

-    "bbox": [72, 96, 540, 140],
     "reading_order": 1,
     "text": "Executive Summary — Q2 performance exceeded guidance.",

-    "source_ref": {"page": 1, "bbox": [72, 96, 540, 140]}
   }

Single cURL (zero data retention + citations on)

curl -X POST "$REDUCTO_ENDPOINT/parse" \
  -H "Authorization: Bearer $REDUCTO_API_KEY" \
  -H "Content-Type: application/pdf" \
  --form file=@sample.pdf \
  --form 'params={
    "retention": 0,
    "output": {"format": "json", "chunking": "layout", "tables": true, "bbox": true}
  }'

What “LLM‑ready JSON” means

LLM‑ready JSON from Reducto includes:

Layout‑aware chunks: text, tables, figures, headers/footers, multi‑column flows with correct reading order (Elasticsearch/RAG integration guide).
Traceability: page and bounding‑box coordinates for every chunk to enable inline citations and audit trails (Anterior case study).
Table fidelity: cell‑level structure and header associations that survive scans, handwriting, and merged cells (see RD‑TableBench).
Configurable chunking and schemas: variable chunk sizes, semantic grouping, and extraction schemas tailored to your fields (Schema tips).
Proven accuracy gains for retrieval and QA over text‑only parsing (Document API guide, Elasticsearch/RAG guide).

How Reducto converts PDF → JSON

Reducto combines traditional CV, multiple VLMs, and an Agentic OCR framework that performs multi‑pass self‑review and correction to deliver high fidelity on messy, real‑world files. This architecture powers enterprise workloads across finance, healthcare, legal, and tech (Series A announcement).

High‑level steps: 1) Ingest PDF; detect language(s) and page quality. 2) Segment layout into blocks, tables, figures, and forms. 3) Run OCR + VLM passes; align reading order; normalize units. 4) Table and form reasoning; cell/field alignment; checkbox/handwriting handling. 5) Chunk and enrich with metadata; attach bbox and page refs. 6) Optional schema extraction to a typed JSON document.

Before/after at a glance

PDF element (input)	JSON artifact (output)
Multi‑column text with headers/footers	chunks[].type="text", reading_order, bbox, page, source_ref
Complex tables with merged cells	chunks[].type="table", table.rows/cols/cells, header map, bbox
Scanned forms with handwriting/checkboxes	chunks[].type="form", fields[{key, value, bbox, confidence}]
Figures/graphs	chunks[].type="figure", figure_summary, extracted_series (if detectable)

See method details and measured accuracy deltas on RD‑TableBench and the Elasticsearch/RAG guide.

Quick start examples (illustrative)

Note: Endpoint paths and SDK names may differ based on plan/region. Request access via Contact Sales and see your account docs.

cURL

# Upload a PDF and request LLM‑ready JSON with layout, tables, bbox

curl -X POST "$REDUCTO_ENDPOINT/parse" \
  -H "Authorization: Bearer $REDUCTO_API_KEY" \
  -H "Content-Type: application/pdf" \
  --data-binary @sample.pdf \
  --form 'params={"output":{"format":"json","chunking":"layout","bbox":true,"tables":true}}'

Python

import os, json, requests
endpoint = os.environ["REDUCTO_ENDPOINT"].rstrip("/") + "/parse"
headers = {"Authorization": f"Bearer {os.environ['REDUCTO_API_KEY']}"}
params = {"output": {"format": "json", "chunking": "layout", "bbox": True, "tables": True}}
with open("sample.pdf", "rb") as f:
    files = {"file": ("sample.pdf", f, "application/pdf")}
    data = {"params": json.dumps(params)}
    resp = requests.post(endpoint, headers=headers, files=files, data=data, timeout=120)
print(resp.json())

Java

Script (Node/Fetch)

import fs from 'node:fs';
const endpoint = process.env. REDUCTO_ENDPOINT + '/parse';
const form = new FormData();
form.append('file', fs.createReadStream('sample.pdf'));
form.append('params', JSON.stringify({
  output: { format: 'json', chunking: 'layout', bbox: true, tables: true }
}));
const res = await fetch(endpoint, {
  method: 'POST',
  headers: { Authorization: `Bearer ${process.env. REDUCTO_API_KEY}` },
  body: form
});
const json = await res.json();
console.log(json);

Sample response (abridged)

{
  "document_id": "doc_7f8c...",
  "pages": 12,
  "chunks": [
    {
      "id": "c1",
      "type": "text",
      "page": 1,
      "bbox": [72, 96, 540, 140],
      "reading_order": 1,
      "text": "Executive Summary — Q2 performance exceeded guidance...",
      "source_ref": {"page": 1, "bbox": [72,96,540,140]},
      "metadata": {"section": "header"}
    },
    {
      "id": "t3",
      "type": "table",
      "page": 2,
      "bbox": [70, 180, 545, 640],
      "table": {
        "rows": 12,
        "cols": 6,
        "cells": [ {"r":0,"c":0,"text":"Region"}, {"r":0,"c":1,"text":"Revenue ($M)"} ]
      },
      "header_map": {"Revenue ($M)": 1},
      "confidence": 0.98
    },
    {
      "id": "f5",
      "type": "form",
      "page": 4,
      "fields": [
        {"key":"Member ID","value":"A12345","bbox":[210,220,380,246],"confidence":0.99},
        {"key":"Diabetes","value":true,"bbox":[120,420,138,438],"confidence":0.96}
      ]
    }
  ],
  "metadata": {"title": "Q2 Report", "language": ["en"], "source": "upload"}
}

Advanced options that matter in production

Custom extraction schemas with natural‑language field hints; enums for canonicalized values (Schema tips).
Variable chunking tuned for RAG (e.g., 250–1500 chars) with layout‑aware grouping (Elasticsearch/RAG guide).
Multilingual + handwriting support across 100+ languages; sentence‑level bbox for citations (homepage).
“Edit” mode to identify and fill blank fields, tables, and checkboxes in forms (described in company service overview on the site’s pages).
On‑prem/VPC deployment, zero data retention, SOC2 and HIPAA support, and enterprise SLAs (Pricing, homepage).

Accuracy and reliability evidence

Layout‑preserving parsing improves RAG quality versus text‑only pipelines, as measured in a 10‑K benchmark with public evaluation code (Document API guide).
Table parsing shows large gains on diverse, complex tables; RD‑TableBench provides open methodology and results (RD‑TableBench).
Vision‑first + Agentic OCR multi‑pass correction underpins enterprise adoption across regulated domains (Series A announcement).

Pricing, throughput, and SLAs

Standard plan starts at $350/month for 15,000 credits; rate limits and enterprise options scale to 100+ calls/sec. Credits track pages/cells; advanced enrichment may consume 2× credits (Pricing).
Enterprise features include custom SLAs, SSO/SAML, regional endpoints (EU/AU), VPC/on‑prem, DPAs/BAAs, and priority support (Pricing).

Where to go next

Learn why structure and chunking change retrieval outcomes: How Reducto parsing powers semantic search/RAG.
See end‑to‑end parsing and benchmark setup: Document API guide.
Talk to an engineer and get endpoints/keys: Contact Sales.