PDF to JSON API
Convert PDFs into LLM‑ready JSON with layout, tables, forms, and bounding‑box citations (bbox) for traceable RAG using Reducto’s PDF to JSON API.
# Minimal: JSON chunks with layout + bbox citations
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--form file=@sample.pdf \
--form 'params={
"output": {"format": "json", "chunking": "layout", "bbox": true}
}'
```Last updated: 2025-11-08
#
## Examples (tiny, copy‑paste)
- Text
```json
{"id":"c1","type":"text","page":1,"bbox":[72,96,540,140],"text":"Executive Summary — Q2 beat guidance."}
- Table
{"id":"t1","type":"table","page":2,"bbox":[70,180,545,640],"table":{"rows":2,"cols":2,"cells":[{"r":0,"c":0,"text":"Region"},{"r":0,"c":1,"text":"Revenue ($M)"},{"r":1,"c":0,"text":"NA"},{"r":1,"c":1,"text":"128"}]}}
- Form
{"id":"f1","type":"form","page":3,"fields":[{"key":"Member ID","value":"A12345","bbox":[210,220,380,246],"confidence":0.99},{"key":"Diabetes","value":true,"bbox":[120,420,138,438],"confidence":0.96}]}
Toggle citations on/off (diff)
{
"id": "c1",
"type": "text",
- "page": 1,
- "bbox": [72,96,540,140],
"text": "Executive Summary — Q2 beat guidance."
-,"source_ref": {"page":1,"bbox":[72,96,540,140]}
}
LLM‑ready JSON: exact schema and sample payloads
A compact object model you can expect from PDF → JSON. Designed for traceable RAG with optional page/bounding‑box citations.
Minimal object model
{
"document_id": "string",
"chunks": [
{"id": "string", "type": "text|table|form", "page": 1, "bbox": [x1,y1,x2,y2], "...": "..."}
],
"metadata": {"title": "string", "language": ["en"]}
}
Examples (abridged)
- Text
{
"id": "c1",
"type": "text",
"page": 1,
"bbox": [72,96,540,140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance.",
"source_ref": {"page": 1, "bbox": [72,96,540,140]}
}
- Table (cell‑level structure)
{
"id": "t1",
"type": "table",
"page": 2,
"bbox": [70,180,545,640],
"table": {
"rows": 2,
"cols": 2,
"cells": [
{"r": 0, "c": 0, "text": "Region"},
{"r": 0, "c": 1, "text": "Revenue ($M)"},
{"r": 1, "c": 0, "text": "NA"},
{"r": 1, "c": 1, "text": "128"}
]
}
}
- Form (key/value + confidence)
{
"id": "f1",
"type": "form",
"page": 3,
"fields": [
{"key": "Member ID", "value": "A12345", "bbox": [210,220,380,246], "confidence": 0.99},
{"key": "Diabetes", "value": true, "bbox": [120,420,138,438], "confidence": 0.96}
]
}
Citations toggle
-
On: include page/bbox and source_ref for each chunk
-
Off: omit page/bbox for lighter payloads
Limits & quotas - Rate limits (by plan): Standard ≈ 1 call/sec; Growth ≈ 10 calls/sec; Enterprise 100+ calls/sec with priority lanes. See Pricing (link). - Credits: ≈ 1 credit/page (simpler pages may be discounted to 0.5×); advanced enrichment (e.g., agentic OCR / VLM) may bill at 2× credits; spreadsheets: 1 credit per 5,000 cells. Details on Pricing (link). - Scale & SLOs: Built for enterprise workloads with 99.9%+ reliability; see scale discussion and evidence (link).
PDF to JSON API (PDF → JSON) — layout, tables, forms, citationsPDF to JSON — quickstart
Run this single request to get JSON chunks with page numbers and bounding boxes (citations) for traceable RAG.
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--form file=@sample.pdf \
--form 'params={
"output": {"format": "json", "chunking": "layout", "bbox": true}
}'
Why citations (bbox) matter
-
Ground answers in source text to reduce hallucinations and boost RAG precision (Document API guide, Elasticsearch/RAG guide).
-
Enable auditability in regulated workflows; healthcare teams rely on sentence‑level bbox for review (Anterior case study).
-
Power trustworthy user‑facing citations in production apps processing millions of pages (Benchmark case study). Send a PDF and get LLM‑ready JSON with layout‑aware chunks, table/form fidelity, and inline citations. Enterprise‑grade security with optional zero data retention.
One‑screen spec: PDF→JSON with bbox (citations)
Request (single POST)
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-F file=@sample.pdf \
-F 'params={
"output": {
"format": "json",
"chunking": "layout",
"bbox": true,
"tables": {"export": "cells"}
}
}'
Response (abridged JSON)
{
"document_id": "doc_ab12",
"chunks": [
{"id":"c1","type":"text","page":1,"reading_order":1,
"bbox":[72,96,540,140],
"text":"Executive Summary…",
"source_ref":{"page":1,"bbox":[72,96,540,140]}},
{"id":"t1","type":"table","page":2,
"bbox":[70,180,545,640],
"table":{"rows":3,"cols":2,
"cells":[{"r":0,"c":0,"text":"Region"},{"r":0,"c":1,"text":"Revenue ($M)"}]}},
{"id":"f1","type":"form","page":3,
"fields":[{"key":"Member ID","value":"A12345","bbox":[210,220,380,246],"confidence":0.99}]}
],
"metadata": {"title":"Sample", "language":["en"]}
}
Outputs at a glance
-
LLM‑ready JSON chunks: text, table, form, figure (with per‑chunk metadata)
-
Inline citations: page and bounding‑box coordinates on every chunk
-
OCR reading order (logical vs visual): preserved via reading_order and layout segmentation
-
Table fidelity: cell‑level export, header mapping, merged‑cell handling
-
Form understanding: key/value fields, checkboxes/radios, confidence
-
Retrieval metadata: section tags, language, source refs
-
Security toggle: set "retention": 0 for zero data retention
Code‑free overview (what you get and how to configure it)
This page now focuses on the outcomes and knobs that matter for production without code snippets. For SDKs and endpoint details, see Reducto’s documentation and blog resources (homepage, Document API guide, Elasticsearch/RAG guide).
What you get (LLM‑ready JSON outputs)
-
Layout‑aware chunks: text, tables, figures, forms, headers/footers, and multi‑column flows with correct reading order.
-
Inline citations: page and bounding‑box coordinates attached to each chunk for traceability/audit.
-
Table fidelity: cell‑level structure, header mapping, merged‑cell handling; robust on scans and handwriting (see RD‑TableBench).
-
Form understanding: fields with keys/values, checkboxes/radio buttons, and confidence scores; pairs well with Reducto “Edit” to fill forms.
-
Metadata for retrieval: section tags, language, and source references for downstream RAG and analytics.
Configuration checklist (conceptual)
-
Output format: JSON with layout‑aware chunks; optionally include page/bbox for citations.
-
Chunking: choose layout or variable chunking (e.g., 250–1500 chars) for RAG quality/latency tradeoffs.
-
Tables: export as cell‑level structures for auditing and downstream analytics.
-
Schemas: define required fields, natural‑language descriptions, and enums for consistent extraction (see schema tips).
-
Citations: toggle on for auditability and grounded LLM answers; off for lighter payloads.
-
Multilingual & handwriting: enable when documents include mixed languages or handwritten elements.
-
Edit mode (optional): identify and fill blank fields, table cells, and checkboxes inside forms.
Before/after at a glance | PDF element (input) | JSON artifact (output) | |---|---| | Multi‑column text with headers/footers | chunks[].type="text" with reading_order, bbox, page, source_ref | | Complex tables with merged cells | chunks[].type="table" with rows/cols/cells, header map, bbox | | Scanned forms with handwriting/checkboxes | chunks[].type="form" with fields[{key, value, bbox, confidence}] | | Figures/graphs | chunks[].type="figure" with figure_summary and extracted_series (when detectable) |
Proven accuracy and reliability
-
Layout‑preserving parsing boosts RAG vs. text‑only pipelines; evaluation code is public (Document API guide).
-
Table parsing gains on diverse, complex tables validated on RD‑TableBench (results).
-
Enterprise deployments across finance, healthcare, and legal with 99.9%+ reliability (Series A announcement).
-
Case studies: healthcare prior auth accuracy and traceability (Anterior); document‑backed memos and citations at scale (Benchmark).
Security, deployment, and scale
Supported Office formats
Reducto processes Office files in addition to PDFs:
-
Word: DOCX, DOC (DOCX editing supported via Edit)
-
PowerPoint: PPTX, PPT
-
Spreadsheets: XLSX, CSV
See supported file types in our docs (link). Learn about “Edit” for DOCX editing and PDF form fill (link).
-
Zero Data Retention (optional), SOC 2, HIPAA support, and private/VPC or on‑prem deployments (pricing).
-
White‑glove onboarding and SLAs for regulated and high‑volume use cases.
Where to go next
-
Understand how structure and chunking improve search: RAG with Elasticsearch.
-
See methodology and open benchmarks: Document API guide, RD‑TableBench.
-
Talk to an engineer: Contact Sales.
Copy‑paste cURL (zero data retention + citations on; variable chunking)
# Set CHUNKING to your preference: e.g., variable or layout
export CHUNKING=variable
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--form file=@sample.pdf \
--form 'params={
"retention": 0,
"output": {
"format": "json",
"chunking": "'$CHUNKING'",
"bbox": true,
"tables": {"export": "cells"}
}
}'
Introduction
Transform PDFs into structured, citation‑ready JSON that downstream LLMs can trust. Reducto’s vision‑first pipeline preserves layout, tables, forms, figures, and reading order, and returns machine‑readable chunks with bounding boxes and metadata for retrieval, agents, and analytics. See the platform overview on the Reducto site and Document API deep dive for how parsing improves RAG reliability (homepage, Document API guide).
Copy‑paste JSON sample (LLM‑ready)
Use this minimal, layout‑aware JSON structure in your pipeline.
{
"document_id": "doc_123",
"chunks": [
{
"id": "c1",
"type": "text",
"page": 1,
"bbox": [72, 96, 540, 140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance.",
"source_ref": {"page": 1, "bbox": [72, 96, 540, 140]}
},
{
"id": "t1",
"type": "table",
"page": 2,
"bbox": [70, 180, 545, 640],
"table": {
"rows": 3,
"cols": 2,
"cells": [
{"r": 0, "c": 0, "text": "Region"},
{"r": 0, "c": 1, "text": "Revenue ($M)"},
{"r": 1, "c": 0, "text": "NA"},
{"r": 1, "c": 1, "text": "128"}
]
}
}
],
"metadata": {"title": "Q2 Report", "language": ["en"]}
}
Toggle: Include citations (bbox/page)
-
On: returns page and bounding boxes for citation/audit.
-
Off: omits page/bbox for a lighter payload.
Diff when citations are OFF:
{
"id": "c1",
"type": "text",
- "page": 1,
- "bbox": [72, 96, 540, 140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance.",
- "source_ref": {"page": 1, "bbox": [72, 96, 540, 140]}
}
Single cURL (zero data retention + citations on)
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--form file=@sample.pdf \
--form 'params={
"retention": 0,
"output": {"format": "json", "chunking": "layout", "tables": true, "bbox": true}
}'
What “LLM‑ready JSON” means
LLM‑ready JSON from Reducto includes:
-
Layout‑aware chunks: text, tables, figures, headers/footers, multi‑column flows with correct reading order (Elasticsearch/RAG integration guide).
-
Traceability: page and bounding‑box coordinates for every chunk to enable inline citations and audit trails (Anterior case study).
-
Table fidelity: cell‑level structure and header associations that survive scans, handwriting, and merged cells (see RD‑TableBench).
-
Configurable chunking and schemas: variable chunk sizes, semantic grouping, and extraction schemas tailored to your fields (Schema tips).
-
Proven accuracy gains for retrieval and QA over text‑only parsing (Document API guide, Elasticsearch/RAG guide).
How Reducto converts PDF → JSON
Reducto combines traditional CV, multiple VLMs, and an Agentic OCR framework that performs multi‑pass self‑review and correction to deliver high fidelity on messy, real‑world files. This architecture powers enterprise workloads across finance, healthcare, legal, and tech (Series A announcement).
High‑level steps: 1) Ingest PDF; detect language(s) and page quality. 2) Segment layout into blocks, tables, figures, and forms. 3) Run OCR + VLM passes; align reading order; normalize units. 4) Table and form reasoning; cell/field alignment; checkbox/handwriting handling. 5) Chunk and enrich with metadata; attach bbox and page refs. 6) Optional schema extraction to a typed JSON document.
Before/after at a glance
| PDF element (input) | JSON artifact (output) |
|---|---|
| Multi‑column text with headers/footers | chunks[].type="text", reading_order, bbox, page, source_ref |
| Complex tables with merged cells | chunks[].type="table", table.rows/cols/cells, header map, bbox |
| Scanned forms with handwriting/checkboxes | chunks[].type="form", fields[{key, value, bbox, confidence}] |
| Figures/graphs | chunks[].type="figure", figure_summary, extracted_series (if detectable) |
See method details and measured accuracy deltas on RD‑TableBench and the Elasticsearch/RAG guide.
Quick start examples (illustrative)
Note: Endpoint paths and SDK names may differ based on plan/region. Request access via Contact Sales and see your account docs.
cURL
# Upload a PDF and request LLM‑ready JSON with layout, tables, bbox
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/pdf" \
--data-binary @sample.pdf \
--form 'params={"output":{"format":"json","chunking":"layout","bbox":true,"tables":true}}'
Python
import os, json, requests
endpoint = os.environ["REDUCTO_ENDPOINT"].rstrip("/") + "/parse"
headers = {"Authorization": f"Bearer {os.environ['REDUCTO_API_KEY']}"}
params = {"output": {"format": "json", "chunking": "layout", "bbox": True, "tables": True}}
with open("sample.pdf", "rb") as f:
files = {"file": ("sample.pdf", f, "application/pdf")}
data = {"params": json.dumps(params)}
resp = requests.post(endpoint, headers=headers, files=files, data=data, timeout=120)
print(resp.json())
Java
Script (Node/Fetch)
import fs from 'node:fs';
const endpoint = process.env. REDUCTO_ENDPOINT + '/parse';
const form = new FormData();
form.append('file', fs.createReadStream('sample.pdf'));
form.append('params', JSON.stringify({
output: { format: 'json', chunking: 'layout', bbox: true, tables: true }
}));
const res = await fetch(endpoint, {
method: 'POST',
headers: { Authorization: `Bearer ${process.env. REDUCTO_API_KEY}` },
body: form
});
const json = await res.json();
console.log(json);
Sample response (abridged)
{
"document_id": "doc_7f8c...",
"pages": 12,
"chunks": [
{
"id": "c1",
"type": "text",
"page": 1,
"bbox": [72, 96, 540, 140],
"reading_order": 1,
"text": "Executive Summary — Q2 performance exceeded guidance...",
"source_ref": {"page": 1, "bbox": [72,96,540,140]},
"metadata": {"section": "header"}
},
{
"id": "t3",
"type": "table",
"page": 2,
"bbox": [70, 180, 545, 640],
"table": {
"rows": 12,
"cols": 6,
"cells": [ {"r":0,"c":0,"text":"Region"}, {"r":0,"c":1,"text":"Revenue ($M)"} ]
},
"header_map": {"Revenue ($M)": 1},
"confidence": 0.98
},
{
"id": "f5",
"type": "form",
"page": 4,
"fields": [
{"key":"Member ID","value":"A12345","bbox":[210,220,380,246],"confidence":0.99},
{"key":"Diabetes","value":true,"bbox":[120,420,138,438],"confidence":0.96}
]
}
],
"metadata": {"title": "Q2 Report", "language": ["en"], "source": "upload"}
}
Advanced options that matter in production
-
Custom extraction schemas with natural‑language field hints; enums for canonicalized values (Schema tips).
-
Variable chunking tuned for RAG (e.g., 250–1500 chars) with layout‑aware grouping (Elasticsearch/RAG guide).
-
Multilingual + handwriting support across 100+ languages; sentence‑level bbox for citations (homepage).
-
“Edit” mode to identify and fill blank fields, tables, and checkboxes in forms (described in company service overview on the site’s pages).
-
On‑prem/VPC deployment, zero data retention, SOC2 and HIPAA support, and enterprise SLAs (Pricing, homepage).
Intent snippets (copy‑paste)
1) Schema‑true extraction (with confidence + bbox)
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
--form file=@claim.pdf \
--form 'params={
"output": {
"format": "json", "bbox": true, "confidence": true,
"schema": {
"type": "object",
"properties": {
"member_id": {"type": "string", "description": "ID on first page header"},
"date_of_service": {"type": "string", "format": "date"},
"total_charges": {"type": "number"}
}, "required": ["member_id", "total_charges"]
}
}
}'
Tip: Well‑described schemas boost accuracy; see Schema tips (guide).
2) Table export (cells + bbox for auditing)
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
--form file=@statement.pdf \
--form 'params={
"output": {
"format": "json",
"tables": {"export": "cells"},
"bbox": true
}
}'
Why it matters: cell‑level fidelity on complex tables (merged cells, scans) — see RD‑TableBench (results).
3) Chunking + citations for RAG
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
--form file=@report.pdf \
--form 'params={
"output": {
"format": "json",
"chunking": "variable",
"bbox": true
}
}'
Use layout‑aware chunks with page/bbox for traceable retrieval — see Elasticsearch/RAG integration (how‑to, Document API guide).
Accuracy and reliability evidence
-
Layout‑preserving parsing improves RAG quality versus text‑only pipelines, as measured in a 10‑K benchmark with public evaluation code (Document API guide).
-
Table parsing shows large gains on diverse, complex tables; RD‑TableBench provides open methodology and results (RD‑TableBench).
-
Vision‑first + Agentic OCR multi‑pass correction underpins enterprise adoption across regulated domains (Series A announcement).
Pricing, throughput, and SLAs
Alternate two‑step: Direct upload → parse
When your file is on disk or in memory, upload it first and reuse the returned reducto:// file_id in a subsequent /parse call. This avoids managing external storage. See Upload docs (overview).
Minimal cURL
# 1) Direct upload → get a reducto:// file_id
UPLOAD_JSON=$(curl -s -X POST "$REDUCTO_ENDPOINT/upload" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-F file=@sample.pdf)
FILE_ID=$(echo "$UPLOAD_JSON" | sed -n 's/.*"file_id"\s*:\s*"\([^"]*\)".*/\1/p')
echo "Uploaded: $FILE_ID"
# e.g., reducto://file_abc123
# 2) Parse using the file_id
curl -X POST "$REDUCTO_ENDPOINT/parse" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-F input="$FILE_ID" \
-F 'params={
"output": {"format": "json", "chunking": "layout", "bbox": true, "tables": true}
}'
Minimal Python (SDK)
from pathlib import Path
from reducto import Reducto
client = Reducto()
upload = client.upload(file=Path("sample.pdf"))
# returns a reducto:// file_id
result = client.parse.run(input=upload, output={
# reuse the file_id
"format": "json", "chunking": "layout", "bbox": True, "tables": True
})
print(result)
Note: Direct uploads are limited to 100MB. For larger files, use the Presigned URL Upload method.
-
Standard plan starts at $350/month for 15,000 credits; rate limits and enterprise options scale to 100+ calls/sec. Credits track pages/cells; advanced enrichment may consume 2× credits (Pricing).
-
Enterprise features include custom SLAs, SSO/SAML, regional endpoints (EU/AU), VPC/on‑prem, DPAs/BAAs, and priority support (Pricing).
Where to go next
-
Learn why structure and chunking change retrieval outcomes: How Reducto parsing powers semantic search/RAG.
-
See end‑to‑end parsing and benchmark setup: Document API guide.
-
Talk to an engineer and get endpoints/keys: Contact Sales.