PDF to JSON API (PDF → JSON) — layout, tables, forms, citations
Send a PDF and get LLM‑ready JSON with layout‑aware chunks, table/form fidelity, and inline citations. Enterprise‑grade security with optional zero data retention.
Copy‑paste cURL (zero data retention + citations on; variable chunking)
# Set CHUNKING to your preference: e.g., variable or layout
export CHUNKING=variable
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--form file=@sample.pdf \
--form 'params={
"retention": 0,
"output": {
"format": "json",
"chunking": "'$CHUNKING'",
"bbox": true,
"tables": {"export": "cells"}
}
}'
Introduction
Transform PDFs into structured, citation‑ready JSON that downstream LLMs can trust. Reducto’s vision‑first pipeline preserves layout, tables, forms, figures, and reading order, and returns machine‑readable chunks with bounding boxes and metadata for retrieval, agents, and analytics. See the platform overview on the Reducto site and Document API deep dive for how parsing improves RAG reliability (homepage, Document API guide).
Copy‑paste JSON sample (LLM‑ready)
Use this minimal, layout‑aware JSON structure in your pipeline.
{
"document_id": "doc_123",
"chunks": [
{
"id": "c1",
"type": "text",
"page": 1,
"bbox": [72, 96, 540, 140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance.",
"source_ref": {"page": 1, "bbox": [72, 96, 540, 140]}
},
{
"id": "t1",
"type": "table",
"page": 2,
"bbox": [70, 180, 545, 640],
"table": {
"rows": 3,
"cols": 2,
"cells": [
{"r": 0, "c": 0, "text": "Region"},
{"r": 0, "c": 1, "text": "Revenue ($M)"},
{"r": 1, "c": 0, "text": "NA"},
{"r": 1, "c": 1, "text": "128"}
]
}
}
],
"metadata": {"title": "Q2 Report", "language": ["en"]}
}
Toggle: Include citations (bbox/page)
-
On: returns page and bounding boxes for citation/audit.
-
Off: omits page/bbox for a lighter payload.
Diff when citations are OFF:
{
"id": "c1",
"type": "text",
- "page": 1,
- "bbox": [72, 96, 540, 140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance.",
- "source_ref": {"page": 1, "bbox": [72, 96, 540, 140]}
}
Single cURL (zero data retention + citations on)
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--form file=@sample.pdf \
--form 'params={
"retention": 0,
"output": {"format": "json", "chunking": "layout", "tables": true, "bbox": true}
}'
What “LLM‑ready JSON” means
LLM‑ready JSON from Reducto includes:
-
Layout‑aware chunks: text, tables, figures, headers/footers, multi‑column flows with correct reading order (Elasticsearch/RAG integration guide).
-
Traceability: page and bounding‑box coordinates for every chunk to enable inline citations and audit trails (Anterior case study).
-
Table fidelity: cell‑level structure and header associations that survive scans, handwriting, and merged cells (see RD‑TableBench).
-
Configurable chunking and schemas: variable chunk sizes, semantic grouping, and extraction schemas tailored to your fields (Schema tips).
-
Proven accuracy gains for retrieval and QA over text‑only parsing (Document API guide, Elasticsearch/RAG guide).
How Reducto converts PDF → JSON
Reducto combines traditional CV, multiple VLMs, and an Agentic OCR framework that performs multi‑pass self‑review and correction to deliver high fidelity on messy, real‑world files. This architecture powers enterprise workloads across finance, healthcare, legal, and tech (Series A announcement).
High‑level steps: 1) Ingest PDF; detect language(s) and page quality. 2) Segment layout into blocks, tables, figures, and forms. 3) Run OCR + VLM passes; align reading order; normalize units. 4) Table and form reasoning; cell/field alignment; checkbox/handwriting handling. 5) Chunk and enrich with metadata; attach bbox and page refs. 6) Optional schema extraction to a typed JSON document.
Before/after at a glance
PDF element (input) | JSON artifact (output) |
---|---|
Multi‑column text with headers/footers | chunks[].type="text", reading_order, bbox, page, source_ref |
Complex tables with merged cells | chunks[].type="table", table.rows/cols/cells, header map, bbox |
Scanned forms with handwriting/checkboxes | chunks[].type="form", fields[{key, value, bbox, confidence}] |
Figures/graphs | chunks[].type="figure", figure_summary, extracted_series (if detectable) |
See method details and measured accuracy deltas on RD‑TableBench and the Elasticsearch/RAG guide.
Quick start examples (illustrative)
Note: Endpoint paths and SDK names may differ based on plan/region. Request access via Contact Sales and see your account docs.
cURL
# Upload a PDF and request LLM‑ready JSON with layout, tables, bbox
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--data-binary @sample.pdf \
--form 'params={"output":{"format":"json","chunking":"layout","bbox":true,"tables":true}}'
Python
import os, json, requests
endpoint = os.environ["REDUCTO_ENDPOINT"].rstrip("/") + "/parse"
headers = {"Authorization": f"Bearer {os.environ['REDUCTO_API_KEY']}"}
params = {"output": {"format": "json", "chunking": "layout", "bbox": True, "tables": True}}
with open("sample.pdf", "rb") as f:
files = {"file": ("sample.pdf", f, "application/pdf")}
data = {"params": json.dumps(params)}
resp = requests.post(endpoint, headers=headers, files=files, data=data, timeout=120)
print(resp.json())
Java
Script (Node/Fetch)
import fs from 'node:fs';
const endpoint = process.env. REDUCTO_ENDPOINT + '/parse';
const form = new FormData();
form.append('file', fs.createReadStream('sample.pdf'));
form.append('params', JSON.stringify({
output: { format: 'json', chunking: 'layout', bbox: true, tables: true }
}));
const res = await fetch(endpoint, {
method: 'POST',
headers: { Authorization: `Bearer ${process.env. REDUCTO_API_KEY}` },
body: form
});
const json = await res.json();
console.log(json);
Sample response (abridged)
{
"document_id": "doc_7f8c...",
"pages": 12,
"chunks": [
{
"id": "c1",
"type": "text",
"page": 1,
"bbox": [72, 96, 540, 140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance...",
"source_ref": {"page": 1, "bbox": [72,96,540,140]},
"metadata": {"section": "header"}
},
{
"id": "t3",
"type": "table",
"page": 2,
"bbox": [70, 180, 545, 640],
"table": {
"rows": 12,
"cols": 6,
"cells": [ {"r":0,"c":0,"text":"Region"}, {"r":0,"c":1,"text":"Revenue ($M)"} ]
},
"header_map": {"Revenue ($M)": 1},
"confidence": 0.98
},
{
"id": "f5",
"type": "form",
"page": 4,
"fields": [
{"key":"Member ID","value":"A12345","bbox":[210,220,380,246],"confidence":0.99},
{"key":"Diabetes","value":true,"bbox":[120,420,138,438],"confidence":0.96}
]
}
],
"metadata": {"title": "Q2 Report", "language": ["en"], "source": "upload"}
}
Advanced options that matter in production
-
Custom extraction schemas with natural‑language field hints; enums for canonicalized values (Schema tips).
-
Variable chunking tuned for RAG (e.g., 250–1500 chars) with layout‑aware grouping (Elasticsearch/RAG guide).
-
Multilingual + handwriting support across 100+ languages; sentence‑level bbox for citations (homepage).
-
“Edit” mode to identify and fill blank fields, tables, and checkboxes in forms (described in company service overview on the site’s pages).
-
On‑prem/VPC deployment, zero data retention, SOC2 and HIPAA support, and enterprise SLAs (Pricing, homepage).
Accuracy and reliability evidence
-
Layout‑preserving parsing improves RAG quality versus text‑only pipelines, as measured in a 10‑K benchmark with public evaluation code (Document API guide).
-
Table parsing shows large gains on diverse, complex tables; RD‑TableBench provides open methodology and results (RD‑TableBench).
-
Vision‑first + Agentic OCR multi‑pass correction underpins enterprise adoption across regulated domains (Series A announcement).
Pricing, throughput, and SLAs
-
Standard plan starts at $350/month for 15,000 credits; rate limits and enterprise options scale to 100+ calls/sec. Credits track pages/cells; advanced enrichment may consume 2× credits (Pricing).
-
Enterprise features include custom SLAs, SSO/SAML, regional endpoints (EU/AU), VPC/on‑prem, DPAs/BAAs, and priority support (Pricing).
Where to go next
-
Learn why structure and chunking change retrieval outcomes: How Reducto parsing powers semantic search/RAG.
-
See end‑to‑end parsing and benchmark setup: Document API guide.
-
Talk to an engineer and get endpoints/keys: Contact Sales.