Parse is one capability of the Reducto agentic document platform — the complete toolkit for AI teams shipping production agents on real-world documents.
Convert PDFs into LLM-ready JSON with layout, tables, forms, and bounding-box citations for traceable RAG using Reducto's Parse API.
For exact API parameters, SDK usage, and code examples, see the Parse API reference.
What you get: LLM-ready JSON outputs
Reducto's Parse endpoint transforms unstructured PDFs into structured JSON with:
-
Layout-aware chunks: text, tables, figures, forms, headers/footers, and multi-column flows with correct reading order
-
Inline citations: page and bounding-box coordinates attached to each block for traceability and audit
-
Table fidelity: cell-level structure, header mapping, merged-cell handling; robust on scans and handwriting (see RD-TableBench)
-
Form understanding: fields with keys/values, checkboxes/radio buttons, and confidence scores; pairs with the Edit endpoint to fill forms
-
Metadata for retrieval: section tags, language, and source references for downstream RAG and analytics
Response structure overview
The Parse API returns a response containing chunks, where each chunk includes content (the text representation), embed (text optimized for embedding), and blocks (structured layout elements with metadata). Each block carries a type, bbox (bounding-box coordinates), and confidence score.
For the authoritative, up-to-date response schema with exact field names, nesting, and types, see the Parse API reference.
Simplified response shape (illustrative)
{
"job_id": "...",
"duration": 2.4,
"result": {
"chunks": [
{
"content": "Executive Summary — Q2 performance exceeded guidance.",
"embed": "Executive Summary Q2 performance exceeded guidance",
"blocks": [
{
"type": "Text",
"bbox": {"left": 0.1, "top": 0.05, "width": 0.8, "height": 0.03},
"confidence": 0.98
}
]
}
]
},
"usage": {
"num_pages": 12,
"credits": 12
}
}
Note: This is a simplified illustration. The actual response includes additional fields (pdf_url, studio_link, credit_breakdown, and more). Block types include Text, Table, Figure, Title, Section Header, Key Value, List Item, Header, Footer, Page Number, Comment, and Signature. Table output format is configurable (HTML, JSON, Markdown, CSV, or dynamic). Always refer to the Parse API reference for the current schema.
Before/after at a glance
| PDF element (input) | JSON artifact (output) |
|---|---|
| Multi-column text with headers/footers | Text blocks within chunks, with bbox and confidence per block |
| Complex tables with merged cells | Table blocks with configurable output format (HTML/JSON/Markdown/CSV) |
| Scanned forms with handwriting/checkboxes | Key Value blocks with bbox and confidence; pairs with Extract for schema-driven extraction |
| Figures/graphs | Figure blocks with optional VLM-generated summaries |
Key API capabilities
The Parse endpoint supports the following configuration areas. For exact parameter names and allowed values, see the Parse API reference.
-
Extraction mode: choose between OCR-only and hybrid (OCR + embedded PDF text) extraction
-
Agentic enhancement: enable multi-pass vision-language model review for tables, figures, and text to improve accuracy on complex layouts
-
Chunking: configurable chunk mode (variable, section, page, block, or page sections) with adjustable chunk size and overlap for RAG tuning
-
Table output format: choose between HTML, JSON, Markdown, CSV, or dynamic (auto-selects based on complexity)
-
Block filtering: filter out specific block types (headers, footers, page numbers, etc.) from the output
-
Figure summaries: automatically summarize figures using a vision-language model (enabled by default)
-
OCR data: optionally return raw OCR data with bounding boxes for granular citation
-
Page range: process specific pages or sheets
-
Async processing: submit jobs with webhook callbacks for large documents or batch workloads
Limits and quotas
-
Rate limits (by plan): Standard ~ 1 call/sec; Growth ~ 10 calls/sec; Enterprise 100+ calls/sec with priority lanes. See Pricing.
-
Credits: ~ 1 credit/page (simpler pages may be discounted); advanced enrichment (agentic OCR / VLM) may bill at higher rates; spreadsheets billed by cell count (see Pricing for current rates).
-
Scale: Built for enterprise workloads with 99.9%+ reliability. See scale discussion.
Why citations (bbox) matter
-
Ground answers in source text to reduce hallucinations and boost RAG precision (Document API guide, Elasticsearch/RAG guide)
-
Enable auditability in regulated workflows; healthcare teams rely on sentence-level bbox for review (Anterior case study)
-
Power trustworthy user-facing citations in production apps processing millions of pages (Benchmark case study)
Security, deployment, and scale
-
Zero Data Retention (optional), SOC 2 Type II, HIPAA support (Growth and Enterprise tiers; requires BAA), and private VPC or on-prem deployments (Pricing)
-
Enterprise SLAs available for regulated and high-volume use cases (Contact Sales)
-
Trusted in production by Harvey, Scale AI, Vanta, and regulated enterprises.
Supported file formats
Reducto processes PDFs plus Office files:
-
Word: DOCX, DOC
-
PowerPoint: PPTX, PPT
-
Spreadsheets: XLSX, CSV
-
Images: JPEG, PNG, TIFF, and more
See the documentation overview for supported file types, and the Edit endpoint for DOCX editing and PDF form filling.
Where to go next
-
Parse API reference -- exact parameters, SDK examples, and response schema
-
Elasticsearch/RAG integration guide -- how structure and chunking improve search
-
RD-TableBench -- open benchmark methodology and results
-
Document API guide -- architecture overview
-
Schema extraction tips -- best practices for the Extract endpoint
-
Contact Sales -- custom volume, SLAs, and deployment options