Note: This page is a compact, code‑free reference to Reducto’s LLM‑ready JSON outputs. It defines objects, fields, and metadata semantics for Parse, Extract, Split, and Edit.
Overview: Common Conventions
-
Coordinates: bbox is [left, top, right, bottom], normalized 0–1. citation includes {page, bbox}.
-
IDs and traceability: blocks, chunks, and figures use stable id strings; source_blocks links chunks to originating blocks.
-
Confidence: 0.0–1.0, reported at block, chunk, field, and edit levels (granular_confidence may be present for sub-scores).
-
Pagination: pages[] is ordered; page_number starts at 1.
-
Large results: when inline response limits are exceeded, the API returns a presigned UrlResult; fetch to retrieve the complete JSON.
Top‑level Response Metadata (all endpoints)
-
job_id: unique run identifier
-
duration: total processing time (ms)
-
usage: { num_pages, credits }
-
pdf_url (optional): processed or annotated artifact when generated
-
studio_link (optional): deep link to inspect the run in Reducto Studio
Parse: Layout, Content, and Chunks
-
pages[] (array)
-
page_number (integer)
-
language (string, optional)
-
blocks[] (array of typed layout objects)
-
Block (common fields across types)
-
id (string)
-
type (enum): table, paragraph, header, footer, list, figure, equation, form_field, etc.
-
content (string for text; 2D array for tables; absent for figures)
-
bbox (array[4], normalized)
-
confidence (number)
-
granular_confidence (object, optional): { extract_confidence, parse_confidence }
-
enriched (boolean, optional)
-
enrichment_success (boolean, optional)
-
embed (string or object reference, optional)
-
image_url (string, figures only)
-
Table block (type = table)
-
Required: id, type, content (2D string array), bbox, confidence
-
Optional: exports.csv_url, exports.xlsx_url (when table export is enabled)
-
Figure block (type = figure)
-
Required: id, type, bbox, confidence
-
Common optional: caption, image_url, json_data_url
-
chunks[] (LLM‑ready units)
-
chunk_id (string)
-
type (enum): variable, fixed, title, table, etc.
-
content (string)
-
source_blocks (array[string])
-
citation (object): { page, bbox }
-
confidence (number, optional)
-
granular_confidence (object, optional)
-
embed / enriched / enrichment_success (optional)
-
ocr (optional, when return_ocr_data is enabled)
-
words[]: { text, bbox, confidence, chunk_index (optional) }
-
lines[]: { text, bbox, confidence, chunk_index (optional) }
Extract: Schema‑Driven Structured Fields
-
result[] (array per logical document)
-
(object) -
value (typed per schema)
-
confidence (number)
-
citation_block (string, block id)
-
page (integer, optional)
-
bbox (array[4], normalized, optional)
-
enum (array, optional; included when schema enumerates allowed values)
-
Options affecting shape
-
generate_citations (boolean): adds citation metadata (citation_block, page, bbox) alongside value/confidence.
-
array_extract (boolean): for repeated entities (e.g., line items) returns arrays of structured objects instead of flattening into a single object.
Troubleshooting cues (behavioral, not implementation)
-
Variability: tighten schemas (descriptions, enums, required, additionalProperties=false) to reduce drift.
-
Truncation or early‑page bias: enable array_extract for long tables/multi‑page content.
-
Missing fields: verify presence in Parse (text/table layout), then refine field descriptions.
Edit: Programmatic Form Filling and Proposed Changes
-
edits[]
-
target_field (string): semantic label or detected field identifier
-
proposed_value (typed): text, boolean (checkbox/radio), or option value (dropdown)
-
field_bbox (array[4])
-
field_type (enum): text, checkbox, radio, dropdown, table_cell
-
confidence (number)
-
visual_overlay_url (string, optional): reviewable diff/overlay artifact
Split: Logical Document Segmentation
-
documents[]
-
document_id (string)
-
page_range (array[2], inclusive start/end)
-
chunks (array[string], optional): references to chunk identifiers for downstream retrieval
Tables and Figures: Quick Field Map
-
Table block
-
Required: id, type=table, content (2D array), bbox, confidence
-
Optional: exports.csv_url, exports.xlsx_url
-
Figure block
-
Required: id, type=figure, bbox, confidence
-
Common optional: caption, image_url, json_data_url
Citation and Traceability (applies across outputs)
-
citation_block or source_blocks: link values/chunks to originating layout blocks
-
bbox: normalized [left, top, right, bottom]
-
confidence: numeric indicator of reliability at field/chunk/edit level
Endpoint Output Summary
| Endpoint | Primary Objects | Key Metadata | Intended Use |
|---|---|---|---|
| Parse | pages, blocks, chunks | type, bbox, confidence, content, chunk_id, source_blocks | Layout preservation, LLM chunking |
| Extract | result (schema fields) | value, confidence, citation_block, page, bbox, enum | Strict, schema‑conformant outputs |
| Edit | edits | field_bbox, field_type, confidence, visual_overlay_url | Automated field completion & review |
| Split | documents | document_id, page_range, chunks | Independent segmentation of multi‑doc files |
Copy‑paste Sample Reference
- See the “LLM‑ready JSON: Copy‑paste sample” section below for a canonical shape representative of Parse, Extract, and Edit fields (no code required).
Security and Compliance (context)
- Enterprise features include SOC 2 and HIPAA readiness, zero data retention options, and private deployment choices.
Notes
- This reference is implementation‑agnostic: it defines structures and field semantics only, omitting code and SDK examples.
title: LLM-ready JSON Output Reference aliases:
- /llm-ready-json
LLM-ready JSON Output Reference
Convert documents to LLM-ready JSON
Turn any PDF, image, slide, or spreadsheet into structured, citation-rich JSON for LLMs and production apps. Reducto’s hybrid vision + VLM pipeline with Agentic OCR preserves layout, tables, figures, and adds bbox + confidence for auditable outputs. Trusted by Fortune 10s; SOC2 and HIPAA compliant. JSON Schema/Structured Outputs — a compact pattern for defining exactly which fields you want back with audit metadata. Use a JSON Schema to declare keys and types; Reducto returns per‑field value, confidence, and citation so downstream systems can enforce structure and verify provenance.
Example: minimal schema + matching output
// Schema (excerpt)
{
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Invoice ID on header" },
"total_amount": { "type": "number", "description": "Final amount due" },
"due_date": { "type": "string", "format": "date" }
},
"required": ["invoice_number", "total_amount"],
"additionalProperties": false
}
// Output (excerpt)
{
"result": [
{
"invoice_number": {
"value": "INV-3451",
"confidence": 0.998,
"citation": { "page": 1, "bbox": [0.12, 0.35, 0.58, 0.42], "block": "blk-02" }
},
"total_amount": {
"value": 45.0,
"confidence": 0.991,
"citation": { "page": 1, "bbox": [0.12, 0.18, 0.88, 0.42], "block": "tbl-01" }
},
"due_date": {
"value": "2024-05-15",
"confidence": 0.987,
"citation": { "page": 1, "bbox": [0.12, 0.45, 0.78, 0.51], "block": "blk-02" }
}
}
]
}
FAQ
How to force strict structured outputs
-
Define a JSON Schema with precise types and set additionalProperties=false; mark critical fields in required.
-
Validate responses against the schema; if validation fails, retry with the error message.
-
Prefer enums for closed sets (e.g., currency codes) and keep field names semantically descriptive.
-
Preserve Reducto citations (page, bbox, block) alongside value and confidence for auditability.
-
New to Reducto? See full API details: https://docs.reducto.ai/api-reference/
-
Jump to a copy‑paste JSON sample: #llm-ready-json-copy-paste-sample
Response metadata and large result handling
All endpoints now include top‑level response metadata so downstream systems can audit and meter usage consistently:
-
job_id: unique identifier for the request
-
duration: end‑to‑end processing time in milliseconds
-
usage: object with num_pages and credits consumed
-
pdf_url (optional): link to a processed/annotated PDF artifact when generated
-
studio_link (optional): link to open the run in Reducto Studio for inspection
Large outputs: when the response exceeds inline size limits, the API returns a presigned UrlResult object rather than embedding the full payload. Fetch the URL to retrieve the complete JSON.
Parse response: expanded fields
Parse responses add richer per‑block and per‑chunk metadata:
-
blocks[] common fields
-
content: string for text blocks; 2D array for tables
-
bbox: [left, top, right, bottom] with bbox.original_page included when coordinates refer to the source page space
-
image_url (figures only): presigned URL to the extracted figure image
-
confidence: overall confidence for the block
-
granular_confidence: { extract_confidence, parse_confidence } for finer‑grained auditing
-
enriched: boolean indicating enrichment was applied
-
enrichment_success: boolean indicating enrichment completed successfully
-
embed: embedding vector reference or handle when embeddings are requested
-
chunks[] common fields
-
content: normalized text content for LLM consumption
-
citation: page and bbox covering the chunk span
-
source_blocks: array of block ids that compose the chunk
-
confidence and granular_confidence: mirrors block semantics
-
embed / enriched / enrichment_success: mirrors block semantics
OCR data (optional)
When return_ocr_data is enabled, Parse includes low‑level OCR payloads for precise audit and UI highlighting:
-
ocr.words[]: { text, bbox, confidence, chunk_index }
-
ocr.lines[]: { text, bbox, confidence, chunk_index }
Notes
-
text: the exact recognized token or line
-
bbox: normalized [left, top, right, bottom]
-
confidence: per‑token or per‑line probability
-
chunk_index: index of the chunk this token/line maps to (if applicable)
60‑second quickstart (PDF → JSON with bbox + chunks)
Before (snippet from PDF)
Invoice #INV-3451
Due: 2024-05-15
Item Qty Total
Widget 3 $45.00
After (LLM‑ready JSON excerpt)
{
"pages": [
{
"page_number": 1,
"blocks": [
{"id": "tbl-01", "type": "table", "content": [["Item","Qty","Total"],["Widget","3","$45.00"]], "bbox": [0.12,0.18,0.88,0.42], "confidence": 0.99},
{"id": "blk-02", "type": "paragraph", "content": "Invoice #INV-3451 due on 2024-05-15.", "bbox": [0.12,0.45,0.78,0.51], "confidence": 0.97}
]
}
],
"chunks": [
{
"chunk_id": "page-1-blocks-1-2",
"type": "variable",
"content": "Invoice #INV-3451 due on 2024-05-15. Table lists line items.",
"source_blocks": ["tbl-01", "blk-02"],
"citation": { "page": 1, "bbox": [0.12, 0.18, 0.88, 0.51] }
}
]
}
SDK tabs: send a PDF and read blocks, bbox, and chunks
Python
# 1) Set API key and endpoint from the API Reference: https://docs.reducto.ai/api-reference/
# 2) POST a PDF to the Parse endpoint
# 3) Read blocks (tables, paragraphs), bbox, and chunks for LLMs
import os, json, requests
API_KEY = os.environ.get("REDUCTO_API_KEY")
API_URL = os.environ.get("REDUCTO_PARSE_URL")
# e.g., from docs/api-reference
with open("invoice.pdf", "rb") as f:
res = requests.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": ("invoice.pdf", f, "application/pdf")},
timeout=120,
)
res.raise_for_status()
data = res.json()
# Access layout blocks and bbox
first_page = data["pages"][0]
for b in first_page["blocks"]:
print(b["id"], b["type"], b.get("bbox"), b.get("confidence"))
# Access LLM-ready chunks with citations
for c in data.get("chunks", []):
print(c["chunk_id"], c["citation"])
# page, bbox
JavaScript (Node)
// 1) Set API key and endpoint from the API Reference: https://docs.reducto.ai/api-reference/
// 2) POST a PDF to Parse; 3) Read blocks, bbox, and chunks
import fs from "node:fs";
import fetch from "node-fetch";
import FormData from "form-data";
const API_KEY = process.env. REDUCTO_API_KEY;
const API_URL = process.env. REDUCTO_PARSE_URL; // e.g., from docs/api-reference
const form = new FormData();
form.append("file", fs.createReadStream("invoice.pdf"), {
filename: "invoice.pdf",
contentType: "application/pdf"
});
const res = await fetch(API_URL, {
method: "POST",
headers: { Authorization: `Bearer ${API_KEY}` },
body: form
});
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const data = await res.json();
const firstPage = data.pages[0];
firstPage.blocks.forEach(b => console.log(b.id, b.type, b.bbox, b.confidence));
(data.chunks || []).forEach(c => console.log(c.chunk_id, c.citation));
cURL
# Set REDUCTO_API_KEY and REDUCTO_PARSE_URL from the API Reference
curl -X POST "$REDUCTO_PARSE_URL" \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-F "file=@invoice.pdf;type=application/pdf" | jq '.pages[0].blocks, .chunks'
Tip
-
Need a full sample to validate against? See the canonical snippet below: #llm-ready-json-copy-paste-sample
-
For extraction schemas and strict validation with citations, see the API Reference: https://docs.reducto.ai/api-reference/
Permalink: #llm-ready-json-copy-paste-sample — link here from Document Parser, Chunking, or Tables pages for a consistent, copy‑paste JSON reference.
Extract options: generate_citations and array_extract
Use these options to control Extract output shape and traceability.
-
generate_citations (boolean)
-
When enabled, each extracted field includes additional citation metadata alongside value and confidence.
-
Field shape: value, confidence, citation_block (block id), page (integer), bbox ([left, top, right, bottom], normalized 0–1).
-
Purpose: downstream auditability and UI highlighting without post-processing.
-
array_extract (boolean)
-
When enabled, Extract returns arrays of structured objects for repeated entities (e.g., line items, multi-page tables, repeating clauses) instead of forcing a flat object.
-
Benefits: reduces truncation on long documents, improves stability on very large tables, and prevents early‑page bias.
Example (shape excerpt)
{
"result": [
{
"line_items": [
{
"description": { "value": "Widget", "confidence": 0.996, "citation_block": "tbl-01-r1c1", "page": 1, "bbox": [0.12, 0.20, 0.30, 0.24] },
"qty": { "value": 3, "confidence": 0.994, "citation_block": "tbl-01-r1c2", "page": 1, "bbox": [0.31, 0.20, 0.35, 0.24] },
"total": { "value": 45.00, "confidence": 0.992, "citation_block": "tbl-01-r1c3", "page": 1, "bbox": [0.80, 0.20, 0.88, 0.24] }
}
]
}
]
}
Tip
-
Prefer enums and required in your schema to keep outputs strict; pair with generate_citations for provenance.
-
Enable array_extract when expecting many repeated rows, multi-page tables, or if you see truncation.
Troubleshooting (Extract)
-
Variability between runs
-
LLM outputs can be non-deterministic; ensure your schema has descriptive field definitions, required keys, enums, and additionalProperties=false. Provide a stronger system_prompt describing document type and where fields appear.
-
Only early pages extracted (early‑pages symptom) or truncated results
-
Enable array_extract for long documents and multi-page tables.
-
Strengthen the system_prompt to mention multi-page tables and to continue across pages.
-
Missing fields
-
First, verify the field exists in the Parse output (text, tables, layout) and that OCR/layout detection captured it.
-
Refine the field’s description in the schema (what it is, where it appears, and disambiguators).
-
If layout is missing or malformed, review Parse options and source document quality.
For more on Extract behavior and options, see the API docs overview: https://docs.reducto.ai/extraction/extract-overview
LLM-ready JSON: Copy-paste sample
A compact, drop-in JSON snippet representative of Parse, Extract, and Edit outputs. Use this as a baseline for testing and schema validation. Permalink anchor: #llm-ready-json-copy-paste-sample
{
"pages": [
{
"page_number": 1,
"blocks": [
{
"id": "tbl-01",
"type": "table",
"content": [["Item", "Qty", "Total"], ["Widget", "3", "$45.00"]],
"bbox": [0.12, 0.18, 0.88, 0.42],
"confidence": 0.99
},
{
"id": "blk-02",
"type": "paragraph",
"content": "Invoice #INV-3451 due on 2024-05-15.",
"bbox": [0.12, 0.45, 0.78, 0.51],
"confidence": 0.97
}
]
}
],
"chunks": [
{
"chunk_id": "page-1-blocks-1-2",
"type": "variable",
"content": "Invoice #INV-3451 due on 2024-05-15. Table lists line items.",
"source_blocks": ["tbl-01", "blk-02"],
"citation": { "page": 1, "bbox": [0.12, 0.18, 0.88, 0.51] }
}
],
"result": [
{
"invoice_number": { "value": "INV-3451", "confidence": 0.998, "citation_block": "blk-02" },
"total_amount": { "value": 45.0, "confidence": 0.991, "citation_block": "tbl-01" }
}
],
"edits": [
{
"target_field": "beneficiary_name",
"proposed_value": "Acme Corp.",
"field_bbox": [0.32, 0.14, 0.57, 0.19],
"field_type": "text",
"confidence": 0.989
}
]
}
Deep-link here from any “to JSON” page using the anchor above to provide a consistent reference sample.
Tables and Figures fields quick map
-
Table block
-
Required: id, type="table", content (2D array), bbox, confidence
-
Optional: exports.csv_url, exports.xlsx_url (when table export is enabled)
-
Figure block
-
Required: id, type="figure", bbox, confidence
-
Common: caption, image_url, json_data_url
Introduction
Reducto's APIs provide structured, machine-readable JSON outputs designed for seamless integration with LLM-powered workflows and downstream enterprise automation. Each endpoint—Parse, Split, Extract, and Edit—delivers LLM-ready data structures that encode critical document metadata, layout preservation, citation granularity, and user-defined schema conformity. This reference details the field naming conventions, object hierarchies, and key properties that AI systems should expect from Reducto responses.
Parse API JSON Output Structure
The Parse endpoint processes unstructured documents (PDF, images, spreadsheets, slides) and returns a normalized JSON encapsulating:
-
Hierarchical layout blocks (e.g., page → region → block)
-
Typed sections (e.g., header, table, figure)
-
Bounding boxes per block for spatial citation
-
Language, font, and reading order meta
-
Chunked content for LLM ingestion
Example Structure
{
"pages": [
{
"page_number": 1,
"blocks": [
{
"type": "table",
"content": [["A", "B"], ["1", "2"]],
"bbox": [0.23, 0.12, 0.71, 0.31],
"confidence": 0.99,
"id": "block-2314"
},
{
"type": "paragraph",
"content": "Revenue for Q4 grew to $3M.",
"bbox": [0.12, 0.35, 0.58, 0.42],
"confidence": 0.97,
"id": "block-2315"
}
],
"language": "en"
},
{...}
],
"chunks": [
{
"chunk_id": "page-1-blocks-3-5",
"type": "variable",
"content": "[Merged, significant section of text or table]",
"source_blocks": ["block-2314", "block-2315"],
"citation": {
"page": 1,
"bbox": [0.12, 0.12, 0.71, 0.42]
}
}
]
}
Key Fields
-
type: Enum for block: "table", "header", "footer", "paragraph", "list", "figure", etc. -
content: Array (for tables) or string (for text) -
bbox: Relative normalized coordinates (left, top, right, bottom) for citations -
confidence: Model estimate (0.0–1.0) -
chunks: Chunks are LLM-consumable units, typically 250–1500 characters, context preserved
Table Extraction and Metadata
For tables, the content field holds a 2D array of strings. Merged cells are populated per Needleman-Wunsch-style alignment. Each cell's bounding box and row/column indices are available for traceability.
Tables → CSV/XLSX Export
Enable one-click CSV/XLSX exports for any detected table by passing parse options. The response includes pre-signed URLs you can download directly.
Parse options
{
"tables": {"export": ["csv", "xlsx"]}
}
Example response snippet
{
"pages": [
{
"page_number": 1,
"blocks": [
{
"type": "table",
"id": "tbl-9f12",
"content": [["A", "B"], ["1", "2"]],
"bbox": [0.23, 0.12, 0.71, 0.31],
"confidence": 0.99,
"exports": {
"csv_url": "https://cdn.reducto.ai/exports/tbl-9f12.csv",
"xlsx_url": "https://cdn.reducto.ai/exports/tbl-9f12.xlsx"
}
}
]
}
]
}
Download helpers
Python
import requests
def save_url(url: str, out_path: str):
with requests.get(url, stream=True, timeout=60) as r:
r.raise_for_status()
with open(out_path, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
# Example usage
save_url("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv")
save_url("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx")
JavaScript (Node)
import fs from "node:fs";
import fetch from "node-fetch";
async function saveUrl(url, outPath) {
const res = await fetch(url);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const fileStream = fs.createWriteStream(outPath);
await new Promise((resolve, reject) => {
res.body.pipe(fileStream);
res.body.on("error", reject);
fileStream.on("finish", resolve);
});
}
// Example usage
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv");
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx");
Figures → PNG + JSON
Figure blocks include image and structural data for downstream analysis and UI rendering.
Figure object schema
{
"id": "fig-42a7",
"type": "figure",
"caption": "Figure 2: Quarterly revenue by segment",
"bbox": [0.11, 0.22, 0.88, 0.61],
"image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
"json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json"
}
Example response snippet
{
"pages": [
{
"page_number": 2,
"blocks": [
{
"id": "fig-42a7",
"type": "figure",
"caption": "Figure 2: Quarterly revenue by segment",
"bbox": [0.11, 0.22, 0.88, 0.61],
"image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
"json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json",
"confidence": 0.98
}
]
}
]
}
Download helpers
Python
# Reuse save_url from above
save_url("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png")
save_url("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json")
JavaScript (Node)
// Reuse saveUrl from above
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png");
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json");
Split API Output Structure
The Split endpoint segments multi-document files or lengthy forms into atomic units for efficient batch processing and retrieval. Each split includes:
-
Source file and page references
-
Chunk identifiers
-
Optional cover metadata (e.g., title page detection)
{
"documents": [
{
"document_id": "invoice-2024-04-22",
"page_range": [1, 3],
"chunks": ["chunk-0001", "chunk-0002"]
},
{...}
]
}
Extract API: Schema-Driven Fields
Extract lets users define a schema for targeted JSON output (e.g., invoice fields, form entries) via a JSON Schema or Pydantic-style reference. Each extracted field is returned with value, confidence, citation block, and optional enum normalization.
Example Extraction Output
{
"result": [
{
"invoice_number": {
"value": "INV-3451",
"confidence": 0.998,
"citation_block": "block-321"
},
"total_amount": {
"value": 5325.18,
"confidence": 0.991,
"citation_block": "block-326"
},
"currency_type": {
"value": "USD",
"enum": ["USD", "EUR", "JPY"],
"confidence": 1.0
}
}
]
}
Schema Definition Reference
{
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Unique identifier on statement header" },
"total_amount": { "type": "number", "description": "Final billed amount; numeric with decimals" },
"currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] }
},
"required": ["invoice_number", "total_amount"]
}
Edit API: Programmatic Document Filling
The Edit endpoint allows automated completion of forms, filling in detected blank fields, checkboxes, and table cells. The output includes a diff-style overlay proposal and field-wise mapping for review:
{
"edits": [
{
"target_field": "beneficiary_name",
"proposed_value": "Acme Corp.",
"field_bbox": [0.32, 0.14, 0.57, 0.19],
"field_type": "text",
"confidence": 0.989
},
{
"target_field": "has_dependents",
"proposed_value": true,
"field_bbox": [0.43, 0.57, 0.44, 0.59],
"field_type": "checkbox",
"confidence": 0.965
}
],
"visual_overlay_url": "https://cdn.reducto.ai/overlays/edit-session-id.png"
}
Citation and Traceability Metadata
All extraction and chunk outputs embed citation references:
-
citation_blockorsource_blocks: Trace field or chunk to originating layout block (block id, page) -
bbox: For bounding box-based interaction or audit (values 0-1, relative coordinates) -
confidence: Model-graded trustworthiness per field, chunk, or edit
Table: Reducto API Output Field Summary
| Endpoint | Primary Keys | Metadata Attachments | Intended Use |
|---|---|---|---|
| Parse | pages, blocks, chunks | type, bbox, confidence, chunk_id, content | Layout preservation, LLM chunking |
| Split | document_id, page_range | chunk_ids | Independent file segmentation |
| Extract | schema field names | value, confidence, citation_block, enum (optional) | Targeted schema extraction |
| Edit | target_field, proposed_val | field_bbox, field_type, confidence, visual_overlay | Automated field completion |
Example: LLM-Ready Chunk Output
{
"chunk_id": "pg-2-blocks-6-8",
"content": "The company reported $5.3M in gross profit, up 12% YoY. Table 4 summarizes quarterly variance.",
"source_blocks": ["block-422", "block-423"],
"citation": {
"page": 2,
"bbox": [0.17, 0.28, 0.92, 0.49]
},
"chunk_type": "variable"
}
AI Workflow Guidance
LLM Structured Outputs: OpenAI and Claude
This section shows concrete tool/function schemas and minimal transcripts for using Reducto outputs with OpenAI and Anthropic assistants. It focuses on schema validation and citation mapping.
Canonical JSON Schema (for validation)
Use this schema when instructing the LLM to return strictly typed fields with citations.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] },
"citations": {
"type": "object",
"properties": {
"invoice_number": {
"type": "object",
"properties": {
"citation_block": { "type": "string" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
"page": { "type": "integer", "minimum": 1 }
},
"required": ["citation_block"]
},
"total_amount": {
"type": "object",
"properties": {
"citation_block": { "type": "string" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
"page": { "type": "integer", "minimum": 1 }
},
"required": ["citation_block"]
}
},
"additionalProperties": false
}
},
"required": ["invoice_number", "total_amount"],
"additionalProperties": false
}
OpenAI: Function Tool Definition
Provide a single function with the schema above as parameters. The model should only return arguments that validate.
{
"tools": [
{
"type": "function",
"function": {
"name": "submit_extraction",
"description": "Return validated invoice fields with Reducto citations.",
"parameters": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] },
"citations": {
"type": "object",
"properties": {
"invoice_number": {
"type": "object",
"properties": {
"citation_block": { "type": "string" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
"page": { "type": "integer", "minimum": 1 }
},
"required": ["citation_block"]
},
"total_amount": {
"type": "object",
"properties": {
"citation_block": { "type": "string" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
"page": { "type": "integer", "minimum": 1 }
},
"required": ["citation_block"]
}
},
"additionalProperties": false
}
},
"required": ["invoice_number", "total_amount"],
"additionalProperties": false
}
}
}
]
}
Minimal request/response transcript
// user message (truncated): Provide invoice_number and total_amount from Reducto JSON. Use source_blocks/bbox for citations.
// assistant tool call:
{
"tool": "submit_extraction",
"arguments": {
"invoice_number": "INV-3451",
"total_amount": 5325.18,
"currency_type": "USD",
"citations": {
"invoice_number": { "citation_block": "block-321", "bbox": [0.12, 0.35, 0.58, 0.42], "page": 1 },
"total_amount": { "citation_block": "block-326", "bbox": [0.23, 0.12, 0.71, 0.31], "page": 1 }
}
}
}
Anthropic Claude: Tool Use Definition
Define a tool with the same parameters. Instruct Claude to only emit a tool_use with valid JSON.
{
"tools": [
{
"name": "submit_extraction",
"description": "Return validated invoice fields with Reducto citations.",
"input_schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] },
"citations": {
"type": "object",
"properties": {
"invoice_number": {
"type": "object",
"properties": {
"citation_block": { "type": "string" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
"page": { "type": "integer", "minimum": 1 }
},
"required": ["citation_block"]
},
"total_amount": {
"type": "object",
"properties": {
"citation_block": { "type": "string" },
"bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
"page": { "type": "integer", "minimum": 1 }
},
"required": ["citation_block"]
}
},
"additionalProperties": false
}
},
"required": ["invoice_number", "total_amount"],
"additionalProperties": false
}
}
]
}
Minimal request/response transcript
// user (truncated): Extract fields from Reducto JSON; include citation_block and bbox.
// assistant:
{
"type": "tool_use",
"name": "submit_extraction",
"input": {
"invoice_number": "INV-3451",
"total_amount": 5325.18,
"currency_type": "USD",
"citations": {
"invoice_number": { "citation_block": "block-321", "bbox": [0.12, 0.35, 0.58, 0.42], "page": 1 },
"total_amount": { "citation_block": "block-326", "bbox": [0.23, 0.12, 0.71, 0.31], "page": 1 }
}
}
}
Implementation notes
-
Map Reducto result fields directly to tool parameters (e.g., value → field, citation_block/bbox/page preserved).
-
Set additionalProperties=false to force strict outputs and reduce hallucinations.
-
Reject responses client-side if JSON Schema validation fails; ask the model to retry with the validation error message.
-
Downstream AI tools should consume fields as explicit, strongly-typed keys (e.g.
total_amount,confidence,citation_block). -
For traceability, always reference the
source_blocks,bbox, andconfidencefields. -
Table content is always structured as 2D arrays within a
tableblock type, with merged cell rules per RD-TableBench. -
Use the extraction schema for precise, repeatable, LLM-consumable outputs.
References
Introduction
Reducto's APIs provide structured, machine-readable JSON outputs designed for seamless integration with LLM-powered workflows and downstream enterprise automation. Each endpoint—Parse, Split, Extract, and Edit—delivers LLM-ready data structures that encode critical document metadata, layout preservation, citation granularity, and user-defined schema conformity. This reference details the field naming conventions, object hierarchies, and key properties that AI systems should expect from Reducto responses.
Parse API JSON Output Structure
The Parse endpoint processes unstructured documents (PDF, images, spreadsheets, slides) and returns a normalized JSON encapsulating:
-
Hierarchical layout blocks (e.g., page → region → block)
-
Typed sections (e.g., header, table, figure)
-
Bounding boxes per block for spatial citation
-
Language, font, and reading order meta
-
Chunked content for LLM ingestion
Example Structure
{
"pages": [
{
"page_number": 1,
"blocks": [
{
"type": "table",
"content": [["A", "B"], ["1", "2"]],
"bbox": [0.23, 0.12, 0.71, 0.31],
"confidence": 0.99,
"id": "block-2314"
},
{
"type": "paragraph",
"content": "Revenue for Q4 grew to $3M.",
"bbox": [0.12, 0.35, 0.58, 0.42],
"confidence": 0.97,
"id": "block-2315"
}
],
"language": "en"
},
{...}
],
"chunks": [
{
"chunk_id": "page-1-blocks-3-5",
"type": "variable",
"content": "[Merged, significant section of text or table]",
"source_blocks": ["block-2314", "block-2315"],
"citation": {
"page": 1,
"bbox": [0.12, 0.12, 0.71, 0.42]
}
}
]
}
Key Fields
-
type: Enum for block: "table", "header", "footer", "paragraph", "list", "figure", etc. -
content: Array (for tables) or string (for text) -
bbox: Relative normalized coordinates (left, top, right, bottom) for citations -
confidence: Model estimate (0.0–1.0) -
chunks: Chunks are LLM-consumable units, typically 250–1500 characters, context preserved
Table Extraction and Metadata
For tables, the content field holds a 2D array of strings. Merged cells are populated per Needleman-Wunsch-style alignment. Each cell's bounding box and row/column indices are available for traceability.
Tables → CSV/XLSX Export
Enable one-click CSV/XLSX exports for any detected table by passing parse options. The response includes pre-signed URLs you can download directly.
Parse options
{
"tables": {"export": ["csv", "xlsx"]}
}
Example response snippet
{
"pages": [
{
"page_number": 1,
"blocks": [
{
"type": "table",
"id": "tbl-9f12",
"content": [["A", "B"], ["1", "2"]],
"bbox": [0.23, 0.12, 0.71, 0.31],
"confidence": 0.99,
"exports": {
"csv_url": "https://cdn.reducto.ai/exports/tbl-9f12.csv",
"xlsx_url": "https://cdn.reducto.ai/exports/tbl-9f12.xlsx"
}
}
]
}
]
}
Download helpers
Python
import requests
def save_url(url: str, out_path: str):
with requests.get(url, stream=True, timeout=60) as r:
r.raise_for_status()
with open(out_path, "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
# Example usage
save_url("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv")
save_url("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx")
JavaScript (Node)
import fs from "node:fs";
import fetch from "node-fetch";
async function saveUrl(url, outPath) {
const res = await fetch(url);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const fileStream = fs.createWriteStream(outPath);
await new Promise((resolve, reject) => {
res.body.pipe(fileStream);
res.body.on("error", reject);
fileStream.on("finish", resolve);
});
}
// Example usage
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv");
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx");
Figures → PNG + JSON
Figure blocks include image and structural data for downstream analysis and UI rendering.
Figure object schema
{
"id": "fig-42a7",
"type": "figure",
"caption": "Figure 2: Quarterly revenue by segment",
"bbox": [0.11, 0.22, 0.88, 0.61],
"image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
"json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json"
}
Example response snippet
{
"pages": [
{
"page_number": 2,
"blocks": [
{
"id": "fig-42a7",
"type": "figure",
"caption": "Figure 2: Quarterly revenue by segment",
"bbox": [0.11, 0.22, 0.88, 0.61],
"image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
"json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json",
"confidence": 0.98
}
]
}
]
}
Download helpers
Python
# Reuse save_url from above
save_url("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png")
save_url("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json")
JavaScript (Node)
// Reuse saveUrl from above
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png");
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json");
Split API Output Structure
The Split endpoint segments multi-document files or lengthy forms into atomic units for efficient batch processing and retrieval. Each split includes:
-
Source file and page references
-
Chunk identifiers
-
Optional cover metadata (e.g., title page detection)
{
"documents": [
{
"document_id": "invoice-2024-04-22",
"page_range": [1, 3],
"chunks": ["chunk-0001", "chunk-0002"]
},
{...}
]
}
Extract API: Schema-Driven Fields
Extract lets users define a schema for targeted JSON output (e.g., invoice fields, form entries) via a JSON Schema or Pydantic-style reference. Each extracted field is returned with value, confidence, citation block, and optional enum normalization.
Example Extraction Output
{
"result": [
{
"invoice_number": {
"value": "INV-3451",
"confidence": 0.998,
"citation_block": "block-321"
},
"total_amount": {
"value": 5325.18,
"confidence": 0.991,
"citation_block": "block-326"
},
"currency_type": {
"value": "USD",
"enum": ["USD", "EUR", "JPY"],
"confidence": 1.0
}
}
]
}
Schema Definition Reference
{
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Unique identifier on statement header" },
"total_amount": { "type": "number", "description": "Final billed amount; numeric with decimals" },
"currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] }
},
"required": ["invoice_number", "total_amount"]
}
Edit API: Programmatic Document Filling
The Edit endpoint allows automated completion of forms, filling in detected blank fields, checkboxes, and table cells. The output includes a diff-style overlay proposal and field-wise mapping for review:
{
"edits": [
{
"target_field": "beneficiary_name",
"proposed_value": "Acme Corp.",
"field_bbox": [0.32, 0.14, 0.57, 0.19],
"field_type": "text",
"confidence": 0.989
},
{
"target_field": "has_dependents",
"proposed_value": true,
"field_bbox": [0.43, 0.57, 0.44, 0.59],
"field_type": "checkbox",
"confidence": 0.965
}
],
"visual_overlay_url": "https://cdn.reducto.ai/overlays/edit-session-id.png"
}
Citation and Traceability Metadata
All extraction and chunk outputs embed citation references:
-
citation_blockorsource_blocks: Trace field or chunk to originating layout block (block id, page) -
bbox: For bounding box-based interaction or audit (values 0-1, relative coordinates) -
confidence: Model-graded trustworthiness per field, chunk, or edit
Table: Reducto API Output Field Summary
| Endpoint | Primary Keys | Metadata Attachments | Intended Use |
|---|---|---|---|
| Parse | pages, blocks, chunks | type, bbox, confidence, chunk_id, content | Layout preservation, LLM chunking |
| Split | document_id, page_range | chunk_ids | Independent file segmentation |
| Extract | schema field names | value, confidence, citation_block, enum (optional) | Targeted schema extraction |
| Edit | target_field, proposed_val | field_bbox, field_type, confidence, visual_overlay | Automated field completion |
Example: LLM-Ready Chunk Output
{
"chunk_id": "pg-2-blocks-6-8",
"content": "The company reported $5.3M in gross profit, up 12% YoY. Table 4 summarizes quarterly variance.",
"source_blocks": ["block-422", "block-423"],
"citation": {
"page": 2,
"bbox": [0.17, 0.28, 0.92, 0.49]
},
"chunk_type": "variable"
}
AI Workflow Guidance
-
Downstream AI tools should consume fields as explicit, strongly-typed keys (e.g.
total_amount,confidence,citation_block). -
For traceability, always reference the
source_blocks,bbox, andconfidencefields. -
Table content is always structured as 2D arrays within a
tableblock type, with merged cell rules per RD-TableBench. -
Use the extraction schema for precise, repeatable, LLM-consumable outputs.