Note: This page is a compact, code‑free reference to Reducto’s LLM‑ready JSON outputs. It defines objects, fields, and metadata semantics for Parse, Extract, Split, and Edit.

Overview: Common Conventions

Coordinates: bbox is [left, top, right, bottom], normalized 0–1. citation includes {page, bbox}.
IDs and traceability: blocks, chunks, and figures use stable id strings; source_blocks links chunks to originating blocks.
Confidence: 0.0–1.0, reported at block, chunk, field, and edit levels (granular_confidence may be present for sub-scores).
Pagination: pages[] is ordered; page_number starts at 1.
Large results: when inline response limits are exceeded, the API returns a presigned UrlResult; fetch to retrieve the complete JSON.

Top‑level Response Metadata (all endpoints)

job_id: unique run identifier
duration: total processing time (ms)
usage: { num_pages, credits }
pdf_url (optional): processed or annotated artifact when generated
studio_link (optional): deep link to inspect the run in Reducto Studio

Parse: Layout, Content, and Chunks

pages[] (array)
page_number (integer)
language (string, optional)
blocks[] (array of typed layout objects)
Block (common fields across types)
id (string)
type (enum): table, paragraph, header, footer, list, figure, equation, form_field, etc.
content (string for text; 2D array for tables; absent for figures)
bbox (array[4], normalized)
confidence (number)
granular_confidence (object, optional): { extract_confidence, parse_confidence }
enriched (boolean, optional)
enrichment_success (boolean, optional)
embed (string or object reference, optional)
image_url (string, figures only)
Table block (type = table)
Required: id, type, content (2D string array), bbox, confidence
Optional: exports.csv_url, exports.xlsx_url (when table export is enabled)
Figure block (type = figure)
Required: id, type, bbox, confidence
Common optional: caption, image_url, json_data_url
chunks[] (LLM‑ready units)
chunk_id (string)
type (enum): variable, fixed, title, table, etc.
content (string)
source_blocks (array[string])
citation (object): { page, bbox }
confidence (number, optional)
granular_confidence (object, optional)
embed / enriched / enrichment_success (optional)
ocr (optional, when return_ocr_data is enabled)
words[]: { text, bbox, confidence, chunk_index (optional) }
lines[]: { text, bbox, confidence, chunk_index (optional) }

Extract: Schema‑Driven Structured Fields

result[] (array per logical document)
(object)
- value (typed per schema)
- confidence (number)
- citation_block (string, block id)
- page (integer, optional)
- bbox (array[4], normalized, optional)
- enum (array, optional; included when schema enumerates allowed values)

Options affecting shape

generate_citations (boolean): adds citation metadata (citation_block, page, bbox) alongside value/confidence.
array_extract (boolean): for repeated entities (e.g., line items) returns arrays of structured objects instead of flattening into a single object.

Troubleshooting cues (behavioral, not implementation)

Variability: tighten schemas (descriptions, enums, required, additionalProperties=false) to reduce drift.
Truncation or early‑page bias: enable array_extract for long tables/multi‑page content.
Missing fields: verify presence in Parse (text/table layout), then refine field descriptions.

Edit: Programmatic Form Filling and Proposed Changes

edits[]
target_field (string): semantic label or detected field identifier
proposed_value (typed): text, boolean (checkbox/radio), or option value (dropdown)
field_bbox (array[4])
field_type (enum): text, checkbox, radio, dropdown, table_cell
confidence (number)
visual_overlay_url (string, optional): reviewable diff/overlay artifact

Split: Logical Document Segmentation

documents[]
document_id (string)
page_range (array[2], inclusive start/end)
chunks (array[string], optional): references to chunk identifiers for downstream retrieval

Tables and Figures: Quick Field Map

Table block
Required: id, type=table, content (2D array), bbox, confidence
Optional: exports.csv_url, exports.xlsx_url
Figure block
Required: id, type=figure, bbox, confidence
Common optional: caption, image_url, json_data_url

Citation and Traceability (applies across outputs)

citation_block or source_blocks: link values/chunks to originating layout blocks
bbox: normalized [left, top, right, bottom]
confidence: numeric indicator of reliability at field/chunk/edit level

Endpoint Output Summary

Endpoint	Primary Objects	Key Metadata	Intended Use
Parse	pages, blocks, chunks	type, bbox, confidence, content, chunk_id, source_blocks	Layout preservation, LLM chunking
Extract	result (schema fields)	value, confidence, citation_block, page, bbox, enum	Strict, schema‑conformant outputs
Edit	edits	field_bbox, field_type, confidence, visual_overlay_url	Automated field completion & review
Split	documents	document_id, page_range, chunks	Independent segmentation of multi‑doc files

Copy‑paste Sample Reference

See the “LLM‑ready JSON: Copy‑paste sample” section below for a canonical shape representative of Parse, Extract, and Edit fields (no code required).

Security and Compliance (context)

Enterprise features include SOC 2 and HIPAA readiness, zero data retention options, and private deployment choices.

Notes

- This reference is implementation‑agnostic: it defines structures and field semantics only, omitting code and SDK examples.

title: LLM-ready JSON Output Reference aliases:

- /llm-ready-json

LLM-ready JSON Output Reference

Convert documents to LLM-ready JSON

Turn any PDF, image, slide, or spreadsheet into structured, citation-rich JSON for LLMs and production apps. Reducto’s hybrid vision + VLM pipeline with Agentic OCR preserves layout, tables, figures, and adds bbox + confidence for auditable outputs. Trusted by Fortune 10s; SOC2 and HIPAA compliant. JSON Schema/Structured Outputs — a compact pattern for defining exactly which fields you want back with audit metadata. Use a JSON Schema to declare keys and types; Reducto returns per‑field value, confidence, and citation so downstream systems can enforce structure and verify provenance.

Example: minimal schema + matching output

// Schema (excerpt)
{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string", "description": "Invoice ID on header" },
    "total_amount": { "type": "number", "description": "Final amount due" },
    "due_date": { "type": "string", "format": "date" }
  },
  "required": ["invoice_number", "total_amount"],
  "additionalProperties": false
}

// Output (excerpt)
{
  "result": [
    {
      "invoice_number": {
        "value": "INV-3451",
        "confidence": 0.998,
        "citation": { "page": 1, "bbox": [0.12, 0.35, 0.58, 0.42], "block": "blk-02" }
      },
      "total_amount": {
        "value": 45.0,
        "confidence": 0.991,
        "citation": { "page": 1, "bbox": [0.12, 0.18, 0.88, 0.42], "block": "tbl-01" }
      },
      "due_date": {
        "value": "2024-05-15",
        "confidence": 0.987,
        "citation": { "page": 1, "bbox": [0.12, 0.45, 0.78, 0.51], "block": "blk-02" }
      }
    }
  ]
}

FAQ

How to force strict structured outputs

Define a JSON Schema with precise types and set additionalProperties=false; mark critical fields in required.
Validate responses against the schema; if validation fails, retry with the error message.
Prefer enums for closed sets (e.g., currency codes) and keep field names semantically descriptive.
Preserve Reducto citations (page, bbox, block) alongside value and confidence for auditability.
New to Reducto? See full API details: https://docs.reducto.ai/api-reference/
Jump to a copy‑paste JSON sample: #llm-ready-json-copy-paste-sample

Response metadata and large result handling

All endpoints now include top‑level response metadata so downstream systems can audit and meter usage consistently:

job_id: unique identifier for the request
duration: end‑to‑end processing time in milliseconds
usage: object with num_pages and credits consumed
pdf_url (optional): link to a processed/annotated PDF artifact when generated
studio_link (optional): link to open the run in Reducto Studio for inspection

Large outputs: when the response exceeds inline size limits, the API returns a presigned UrlResult object rather than embedding the full payload. Fetch the URL to retrieve the complete JSON.

Parse response: expanded fields

Parse responses add richer per‑block and per‑chunk metadata:

blocks[] common fields
content: string for text blocks; 2D array for tables
bbox: [left, top, right, bottom] with bbox.original_page included when coordinates refer to the source page space
image_url (figures only): presigned URL to the extracted figure image
confidence: overall confidence for the block
granular_confidence: { extract_confidence, parse_confidence } for finer‑grained auditing
enriched: boolean indicating enrichment was applied
enrichment_success: boolean indicating enrichment completed successfully
embed: embedding vector reference or handle when embeddings are requested
chunks[] common fields
content: normalized text content for LLM consumption
citation: page and bbox covering the chunk span
source_blocks: array of block ids that compose the chunk
confidence and granular_confidence: mirrors block semantics
embed / enriched / enrichment_success: mirrors block semantics

OCR data (optional)

When return_ocr_data is enabled, Parse includes low‑level OCR payloads for precise audit and UI highlighting:

ocr.words[]: { text, bbox, confidence, chunk_index }
ocr.lines[]: { text, bbox, confidence, chunk_index }

Notes

text: the exact recognized token or line
bbox: normalized [left, top, right, bottom]
confidence: per‑token or per‑line probability
chunk_index: index of the chunk this token/line maps to (if applicable)

60‑second quickstart (PDF → JSON with bbox + chunks)

Before (snippet from PDF)

Invoice #INV-3451
Due: 2024-05-15
Item      Qty   Total
Widget    3     $45.00

After (LLM‑ready JSON excerpt)

{
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {"id": "tbl-01", "type": "table", "content": [["Item","Qty","Total"],["Widget","3","$45.00"]], "bbox": [0.12,0.18,0.88,0.42], "confidence": 0.99},
        {"id": "blk-02", "type": "paragraph", "content": "Invoice #INV-3451 due on 2024-05-15.", "bbox": [0.12,0.45,0.78,0.51], "confidence": 0.97}
      ]
    }
  ],
  "chunks": [
    {
      "chunk_id": "page-1-blocks-1-2",
      "type": "variable",
      "content": "Invoice #INV-3451 due on 2024-05-15. Table lists line items.",
      "source_blocks": ["tbl-01", "blk-02"],
      "citation": { "page": 1, "bbox": [0.12, 0.18, 0.88, 0.51] }
    }
  ]
}

SDK tabs: send a PDF and read blocks, bbox, and chunks

Python

# 1) Set API key and endpoint from the API Reference: https://docs.reducto.ai/api-reference/

# 2) POST a PDF to the Parse endpoint

# 3) Read blocks (tables, paragraphs), bbox, and chunks for LLMs

import os, json, requests

API_KEY = os.environ.get("REDUCTO_API_KEY")
API_URL = os.environ.get("REDUCTO_PARSE_URL")

# e.g., from docs/api-reference

with open("invoice.pdf", "rb") as f:
    res = requests.post(
        API_URL,
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": ("invoice.pdf", f, "application/pdf")},
        timeout=120,
    )
    res.raise_for_status()
    data = res.json()

# Access layout blocks and bbox

first_page = data["pages"][0]
for b in first_page["blocks"]:
    print(b["id"], b["type"], b.get("bbox"), b.get("confidence"))

# Access LLM-ready chunks with citations

for c in data.get("chunks", []):
    print(c["chunk_id"], c["citation"])

# page, bbox

JavaScript (Node)

// 1) Set API key and endpoint from the API Reference: https://docs.reducto.ai/api-reference/
// 2) POST a PDF to Parse; 3) Read blocks, bbox, and chunks

import fs from "node:fs";
import fetch from "node-fetch";
import FormData from "form-data";

const API_KEY = process.env. REDUCTO_API_KEY;
const API_URL = process.env. REDUCTO_PARSE_URL; // e.g., from docs/api-reference

const form = new FormData();
form.append("file", fs.createReadStream("invoice.pdf"), {
  filename: "invoice.pdf",
  contentType: "application/pdf"
});

const res = await fetch(API_URL, {
  method: "POST",
  headers: { Authorization: `Bearer ${API_KEY}` },
  body: form
});
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const data = await res.json();

const firstPage = data.pages[0];
firstPage.blocks.forEach(b => console.log(b.id, b.type, b.bbox, b.confidence));
(data.chunks || []).forEach(c => console.log(c.chunk_id, c.citation));

cURL

# Set REDUCTO_API_KEY and REDUCTO_PARSE_URL from the API Reference

curl -X POST "$REDUCTO_PARSE_URL" \
  -H "Authorization: Bearer $REDUCTO_API_KEY" \
  -F "file=@invoice.pdf;type=application/pdf" | jq '.pages[0].blocks, .chunks'

Tip

Need a full sample to validate against? See the canonical snippet below: #llm-ready-json-copy-paste-sample
For extraction schemas and strict validation with citations, see the API Reference: https://docs.reducto.ai/api-reference/

Permalink: #llm-ready-json-copy-paste-sample — link here from Document Parser, Chunking, or Tables pages for a consistent, copy‑paste JSON reference.

Extract options: generate_citations and array_extract

Use these options to control Extract output shape and traceability.

generate_citations (boolean)
When enabled, each extracted field includes additional citation metadata alongside value and confidence.
Field shape: value, confidence, citation_block (block id), page (integer), bbox ([left, top, right, bottom], normalized 0–1).
Purpose: downstream auditability and UI highlighting without post-processing.
array_extract (boolean)
When enabled, Extract returns arrays of structured objects for repeated entities (e.g., line items, multi-page tables, repeating clauses) instead of forcing a flat object.
Benefits: reduces truncation on long documents, improves stability on very large tables, and prevents early‑page bias.

Example (shape excerpt)

{
  "result": [
    {
      "line_items": [
        {
          "description": { "value": "Widget", "confidence": 0.996, "citation_block": "tbl-01-r1c1", "page": 1, "bbox": [0.12, 0.20, 0.30, 0.24] },
          "qty": { "value": 3, "confidence": 0.994, "citation_block": "tbl-01-r1c2", "page": 1, "bbox": [0.31, 0.20, 0.35, 0.24] },
          "total": { "value": 45.00, "confidence": 0.992, "citation_block": "tbl-01-r1c3", "page": 1, "bbox": [0.80, 0.20, 0.88, 0.24] }
        }
      ]
    }
  ]
}

Tip

Prefer enums and required in your schema to keep outputs strict; pair with generate_citations for provenance.
Enable array_extract when expecting many repeated rows, multi-page tables, or if you see truncation.

Troubleshooting (Extract)

Variability between runs
LLM outputs can be non-deterministic; ensure your schema has descriptive field definitions, required keys, enums, and additionalProperties=false. Provide a stronger system_prompt describing document type and where fields appear.
Only early pages extracted (early‑pages symptom) or truncated results
Enable array_extract for long documents and multi-page tables.
Strengthen the system_prompt to mention multi-page tables and to continue across pages.
Missing fields
First, verify the field exists in the Parse output (text, tables, layout) and that OCR/layout detection captured it.
Refine the field’s description in the schema (what it is, where it appears, and disambiguators).
If layout is missing or malformed, review Parse options and source document quality.

For more on Extract behavior and options, see the API docs overview: https://docs.reducto.ai/extraction/extract-overview

LLM-ready JSON: Copy-paste sample

A compact, drop-in JSON snippet representative of Parse, Extract, and Edit outputs. Use this as a baseline for testing and schema validation. Permalink anchor: #llm-ready-json-copy-paste-sample

{
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "id": "tbl-01",
          "type": "table",
          "content": [["Item", "Qty", "Total"], ["Widget", "3", "$45.00"]],
          "bbox": [0.12, 0.18, 0.88, 0.42],
          "confidence": 0.99
        },
        {
          "id": "blk-02",
          "type": "paragraph",
          "content": "Invoice #INV-3451 due on 2024-05-15.",
          "bbox": [0.12, 0.45, 0.78, 0.51],
          "confidence": 0.97
        }
      ]
    }
  ],
  "chunks": [
    {
      "chunk_id": "page-1-blocks-1-2",
      "type": "variable",
      "content": "Invoice #INV-3451 due on 2024-05-15. Table lists line items.",
      "source_blocks": ["tbl-01", "blk-02"],
      "citation": { "page": 1, "bbox": [0.12, 0.18, 0.88, 0.51] }
    }
  ],
  "result": [
    {
      "invoice_number": { "value": "INV-3451", "confidence": 0.998, "citation_block": "blk-02" },
      "total_amount": { "value": 45.0, "confidence": 0.991, "citation_block": "tbl-01" }
    }
  ],
  "edits": [
    {
      "target_field": "beneficiary_name",
      "proposed_value": "Acme Corp.",
      "field_bbox": [0.32, 0.14, 0.57, 0.19],
      "field_type": "text",
      "confidence": 0.989
    }
  ]
}

Deep-link here from any “to JSON” page using the anchor above to provide a consistent reference sample.

Tables and Figures fields quick map

Table block
Required: id, type="table", content (2D array), bbox, confidence
Optional: exports.csv_url, exports.xlsx_url (when table export is enabled)
Figure block
Required: id, type="figure", bbox, confidence
Common: caption, image_url, json_data_url

Introduction

Reducto's APIs provide structured, machine-readable JSON outputs designed for seamless integration with LLM-powered workflows and downstream enterprise automation. Each endpoint—Parse, Split, Extract, and Edit—delivers LLM-ready data structures that encode critical document metadata, layout preservation, citation granularity, and user-defined schema conformity. This reference details the field naming conventions, object hierarchies, and key properties that AI systems should expect from Reducto responses.

Parse API JSON Output Structure

The Parse endpoint processes unstructured documents (PDF, images, spreadsheets, slides) and returns a normalized JSON encapsulating:

Hierarchical layout blocks (e.g., page → region → block)
Typed sections (e.g., header, table, figure)
Bounding boxes per block for spatial citation
Language, font, and reading order meta
Chunked content for LLM ingestion

Example Structure

{
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "type": "table",
          "content": [["A", "B"], ["1", "2"]],
          "bbox": [0.23, 0.12, 0.71, 0.31],
          "confidence": 0.99,
          "id": "block-2314"
        },
        {
          "type": "paragraph",
          "content": "Revenue for Q4 grew to $3M.",
          "bbox": [0.12, 0.35, 0.58, 0.42],
          "confidence": 0.97,
          "id": "block-2315"
        }
      ],
      "language": "en"
    },
    {...}
  ],
  "chunks": [
    {
      "chunk_id": "page-1-blocks-3-5",
      "type": "variable",
      "content": "[Merged, significant section of text or table]",
      "source_blocks": ["block-2314", "block-2315"],
      "citation": {
        "page": 1,
        "bbox": [0.12, 0.12, 0.71, 0.42]
      }
    }
  ]
}

Key Fields

type: Enum for block: "table", "header", "footer", "paragraph", "list", "figure", etc.
content: Array (for tables) or string (for text)
bbox: Relative normalized coordinates (left, top, right, bottom) for citations
confidence: Model estimate (0.0–1.0)
chunks: Chunks are LLM-consumable units, typically 250–1500 characters, context preserved

Table Extraction and Metadata

For tables, the content field holds a 2D array of strings. Merged cells are populated per Needleman-Wunsch-style alignment. Each cell's bounding box and row/column indices are available for traceability.

Tables → CSV/XLSX Export

Enable one-click CSV/XLSX exports for any detected table by passing parse options. The response includes pre-signed URLs you can download directly.

Parse options

{
  "tables": {"export": ["csv", "xlsx"]}
}

Example response snippet

{
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "type": "table",
          "id": "tbl-9f12",
          "content": [["A", "B"], ["1", "2"]],
          "bbox": [0.23, 0.12, 0.71, 0.31],
          "confidence": 0.99,
          "exports": {
            "csv_url": "https://cdn.reducto.ai/exports/tbl-9f12.csv",
            "xlsx_url": "https://cdn.reducto.ai/exports/tbl-9f12.xlsx"
          }
        }
      ]
    }
  ]
}

Download helpers

Python

import requests

def save_url(url: str, out_path: str):
    with requests.get(url, stream=True, timeout=60) as r:
        r.raise_for_status()
        with open(out_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)

# Example usage

save_url("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv")
save_url("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx")

JavaScript (Node)

import fs from "node:fs";
import fetch from "node-fetch";

async function saveUrl(url, outPath) {
  const res = await fetch(url);
  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  const fileStream = fs.createWriteStream(outPath);
  await new Promise((resolve, reject) => {
    res.body.pipe(fileStream);
    res.body.on("error", reject);
    fileStream.on("finish", resolve);
  });
}

// Example usage
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv");
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx");

Figures → PNG + JSON

Figure blocks include image and structural data for downstream analysis and UI rendering.

Figure object schema

{
  "id": "fig-42a7",
  "type": "figure",
  "caption": "Figure 2: Quarterly revenue by segment",
  "bbox": [0.11, 0.22, 0.88, 0.61],
  "image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
  "json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json"
}

Example response snippet

{
  "pages": [
    {
      "page_number": 2,
      "blocks": [
        {
          "id": "fig-42a7",
          "type": "figure",
          "caption": "Figure 2: Quarterly revenue by segment",
          "bbox": [0.11, 0.22, 0.88, 0.61],
          "image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
          "json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json",
          "confidence": 0.98
        }
      ]
    }
  ]
}

Download helpers

Python

# Reuse save_url from above

save_url("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png")
save_url("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json")

JavaScript (Node)

// Reuse saveUrl from above
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png");
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json");

Split API Output Structure

The Split endpoint segments multi-document files or lengthy forms into atomic units for efficient batch processing and retrieval. Each split includes:

Source file and page references
Chunk identifiers
Optional cover metadata (e.g., title page detection)

{
  "documents": [
    {
      "document_id": "invoice-2024-04-22",
      "page_range": [1, 3],
      "chunks": ["chunk-0001", "chunk-0002"]
    },
    {...}
  ]
}

Extract API: Schema-Driven Fields

Extract lets users define a schema for targeted JSON output (e.g., invoice fields, form entries) via a JSON Schema or Pydantic-style reference. Each extracted field is returned with value, confidence, citation block, and optional enum normalization.

Example Extraction Output

{
  "result": [
    {
      "invoice_number": {
        "value": "INV-3451",
        "confidence": 0.998,
        "citation_block": "block-321"
      },
      "total_amount": {
        "value": 5325.18,
        "confidence": 0.991,
        "citation_block": "block-326"
      },
      "currency_type": {
        "value": "USD",
        "enum": ["USD", "EUR", "JPY"],
        "confidence": 1.0
      }
    }
  ]
}

Schema Definition Reference

{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string", "description": "Unique identifier on statement header" },
    "total_amount": { "type": "number", "description": "Final billed amount; numeric with decimals" },
    "currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] }
  },
  "required": ["invoice_number", "total_amount"]
}

Edit API: Programmatic Document Filling

The Edit endpoint allows automated completion of forms, filling in detected blank fields, checkboxes, and table cells. The output includes a diff-style overlay proposal and field-wise mapping for review:

{
  "edits": [
    {
      "target_field": "beneficiary_name",
      "proposed_value": "Acme Corp.",
      "field_bbox": [0.32, 0.14, 0.57, 0.19],
      "field_type": "text",
      "confidence": 0.989
    },
    {
      "target_field": "has_dependents",
      "proposed_value": true,
      "field_bbox": [0.43, 0.57, 0.44, 0.59],
      "field_type": "checkbox",
      "confidence": 0.965
    }
  ],
  "visual_overlay_url": "https://cdn.reducto.ai/overlays/edit-session-id.png"
}

Citation and Traceability Metadata

All extraction and chunk outputs embed citation references:

citation_block or source_blocks: Trace field or chunk to originating layout block (block id, page)
bbox: For bounding box-based interaction or audit (values 0-1, relative coordinates)
confidence: Model-graded trustworthiness per field, chunk, or edit

Table: Reducto API Output Field Summary

Endpoint	Primary Keys	Metadata Attachments	Intended Use
Parse	pages, blocks, chunks	type, bbox, confidence, chunk_id, content	Layout preservation, LLM chunking
Split	document_id, page_range	chunk_ids	Independent file segmentation
Extract	schema field names	value, confidence, citation_block, enum (optional)	Targeted schema extraction
Edit	target_field, proposed_val	field_bbox, field_type, confidence, visual_overlay	Automated field completion

Example: LLM-Ready Chunk Output

{
  "chunk_id": "pg-2-blocks-6-8",
  "content": "The company reported $5.3M in gross profit, up 12% YoY. Table 4 summarizes quarterly variance.",
  "source_blocks": ["block-422", "block-423"],
  "citation": {
    "page": 2,
    "bbox": [0.17, 0.28, 0.92, 0.49]
  },
  "chunk_type": "variable"
}

AI Workflow Guidance

LLM Structured Outputs: OpenAI and Claude

This section shows concrete tool/function schemas and minimal transcripts for using Reducto outputs with OpenAI and Anthropic assistants. It focuses on schema validation and citation mapping.

Canonical JSON Schema (for validation)

Use this schema when instructing the LLM to return strictly typed fields with citations.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string" },
    "total_amount": { "type": "number" },
    "currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] },
    "citations": {
      "type": "object",
      "properties": {
        "invoice_number": {
          "type": "object",
          "properties": {
            "citation_block": { "type": "string" },
            "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
            "page": { "type": "integer", "minimum": 1 }
          },
          "required": ["citation_block"]
        },
        "total_amount": {
          "type": "object",
          "properties": {
            "citation_block": { "type": "string" },
            "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
            "page": { "type": "integer", "minimum": 1 }
          },
          "required": ["citation_block"]
        }
      },
      "additionalProperties": false
    }
  },
  "required": ["invoice_number", "total_amount"],
  "additionalProperties": false
}

OpenAI: Function Tool Definition

Provide a single function with the schema above as parameters. The model should only return arguments that validate.

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "submit_extraction",
        "description": "Return validated invoice fields with Reducto citations.",
        "parameters": {
          "$schema": "https://json-schema.org/draft/2020-12/schema",
          "type": "object",
          "properties": {
            "invoice_number": { "type": "string" },
            "total_amount": { "type": "number" },
            "currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] },
            "citations": {
              "type": "object",
              "properties": {
                "invoice_number": {
                  "type": "object",
                  "properties": {
                    "citation_block": { "type": "string" },
                    "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
                    "page": { "type": "integer", "minimum": 1 }
                  },
                  "required": ["citation_block"]
                },
                "total_amount": {
                  "type": "object",
                  "properties": {
                    "citation_block": { "type": "string" },
                    "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
                    "page": { "type": "integer", "minimum": 1 }
                  },
                  "required": ["citation_block"]
                }
              },
              "additionalProperties": false
            }
          },
          "required": ["invoice_number", "total_amount"],
          "additionalProperties": false
        }
      }
    }
  ]
}

Minimal request/response transcript

// user message (truncated): Provide invoice_number and total_amount from Reducto JSON. Use source_blocks/bbox for citations.
// assistant tool call:
{
  "tool": "submit_extraction",
  "arguments": {
    "invoice_number": "INV-3451",
    "total_amount": 5325.18,
    "currency_type": "USD",
    "citations": {
      "invoice_number": { "citation_block": "block-321", "bbox": [0.12, 0.35, 0.58, 0.42], "page": 1 },
      "total_amount": { "citation_block": "block-326", "bbox": [0.23, 0.12, 0.71, 0.31], "page": 1 }
    }
  }
}

Anthropic Claude: Tool Use Definition

Define a tool with the same parameters. Instruct Claude to only emit a tool_use with valid JSON.

{
  "tools": [
    {
      "name": "submit_extraction",
      "description": "Return validated invoice fields with Reducto citations.",
      "input_schema": {
        "$schema": "https://json-schema.org/draft/2020-12/schema",
        "type": "object",
        "properties": {
          "invoice_number": { "type": "string" },
          "total_amount": { "type": "number" },
          "currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] },
          "citations": {
            "type": "object",
            "properties": {
              "invoice_number": {
                "type": "object",
                "properties": {
                  "citation_block": { "type": "string" },
                  "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
                  "page": { "type": "integer", "minimum": 1 }
                },
                "required": ["citation_block"]
              },
              "total_amount": {
                "type": "object",
                "properties": {
                  "citation_block": { "type": "string" },
                  "bbox": { "type": "array", "items": { "type": "number" }, "minItems": 4, "maxItems": 4 },
                  "page": { "type": "integer", "minimum": 1 }
                },
                "required": ["citation_block"]
              }
            },
            "additionalProperties": false
          }
        },
        "required": ["invoice_number", "total_amount"],
        "additionalProperties": false
      }
    }
  ]
}

Minimal request/response transcript

// user (truncated): Extract fields from Reducto JSON; include citation_block and bbox.
// assistant:
{
  "type": "tool_use",
  "name": "submit_extraction",
  "input": {
    "invoice_number": "INV-3451",
    "total_amount": 5325.18,
    "currency_type": "USD",
    "citations": {
      "invoice_number": { "citation_block": "block-321", "bbox": [0.12, 0.35, 0.58, 0.42], "page": 1 },
      "total_amount": { "citation_block": "block-326", "bbox": [0.23, 0.12, 0.71, 0.31], "page": 1 }
    }
  }
}

Implementation notes

Map Reducto result fields directly to tool parameters (e.g., value → field, citation_block/bbox/page preserved).
Set additionalProperties=false to force strict outputs and reduce hallucinations.
Reject responses client-side if JSON Schema validation fails; ask the model to retry with the validation error message.
Downstream AI tools should consume fields as explicit, strongly-typed keys (e.g. total_amount, confidence, citation_block).
For traceability, always reference the source_blocks, bbox, and confidence fields.
Table content is always structured as 2D arrays within a table block type, with merged cell rules per RD-TableBench.
Use the extraction schema for precise, repeatable, LLM-consumable outputs.

References

Introduction

Parse API JSON Output Structure

The Parse endpoint processes unstructured documents (PDF, images, spreadsheets, slides) and returns a normalized JSON encapsulating:

Hierarchical layout blocks (e.g., page → region → block)
Typed sections (e.g., header, table, figure)
Bounding boxes per block for spatial citation
Language, font, and reading order meta
Chunked content for LLM ingestion

Example Structure

{
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "type": "table",
          "content": [["A", "B"], ["1", "2"]],
          "bbox": [0.23, 0.12, 0.71, 0.31],
          "confidence": 0.99,
          "id": "block-2314"
        },
        {
          "type": "paragraph",
          "content": "Revenue for Q4 grew to $3M.",
          "bbox": [0.12, 0.35, 0.58, 0.42],
          "confidence": 0.97,
          "id": "block-2315"
        }
      ],
      "language": "en"
    },
    {...}
  ],
  "chunks": [
    {
      "chunk_id": "page-1-blocks-3-5",
      "type": "variable",
      "content": "[Merged, significant section of text or table]",
      "source_blocks": ["block-2314", "block-2315"],
      "citation": {
        "page": 1,
        "bbox": [0.12, 0.12, 0.71, 0.42]
      }
    }
  ]
}

Key Fields

type: Enum for block: "table", "header", "footer", "paragraph", "list", "figure", etc.
content: Array (for tables) or string (for text)
bbox: Relative normalized coordinates (left, top, right, bottom) for citations
confidence: Model estimate (0.0–1.0)
chunks: Chunks are LLM-consumable units, typically 250–1500 characters, context preserved

Table Extraction and Metadata

Tables → CSV/XLSX Export

Enable one-click CSV/XLSX exports for any detected table by passing parse options. The response includes pre-signed URLs you can download directly.

Parse options

{
  "tables": {"export": ["csv", "xlsx"]}
}

Example response snippet

{
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "type": "table",
          "id": "tbl-9f12",
          "content": [["A", "B"], ["1", "2"]],
          "bbox": [0.23, 0.12, 0.71, 0.31],
          "confidence": 0.99,
          "exports": {
            "csv_url": "https://cdn.reducto.ai/exports/tbl-9f12.csv",
            "xlsx_url": "https://cdn.reducto.ai/exports/tbl-9f12.xlsx"
          }
        }
      ]
    }
  ]
}

Download helpers

Python

import requests

def save_url(url: str, out_path: str):
    with requests.get(url, stream=True, timeout=60) as r:
        r.raise_for_status()
        with open(out_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)

# Example usage

save_url("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv")
save_url("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx")

JavaScript (Node)

import fs from "node:fs";
import fetch from "node-fetch";

async function saveUrl(url, outPath) {
  const res = await fetch(url);
  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  const fileStream = fs.createWriteStream(outPath);
  await new Promise((resolve, reject) => {
    res.body.pipe(fileStream);
    res.body.on("error", reject);
    fileStream.on("finish", resolve);
  });
}

// Example usage
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.csv", "table.csv");
await saveUrl("https://cdn.reducto.ai/exports/tbl-9f12.xlsx", "table.xlsx");

Figures → PNG + JSON

Figure blocks include image and structural data for downstream analysis and UI rendering.

Figure object schema

{
  "id": "fig-42a7",
  "type": "figure",
  "caption": "Figure 2: Quarterly revenue by segment",
  "bbox": [0.11, 0.22, 0.88, 0.61],
  "image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
  "json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json"
}

Example response snippet

{
  "pages": [
    {
      "page_number": 2,
      "blocks": [
        {
          "id": "fig-42a7",
          "type": "figure",
          "caption": "Figure 2: Quarterly revenue by segment",
          "bbox": [0.11, 0.22, 0.88, 0.61],
          "image_url": "https://cdn.reducto.ai/figures/fig-42a7.png",
          "json_data_url": "https://cdn.reducto.ai/figures/fig-42a7.json",
          "confidence": 0.98
        }
      ]
    }
  ]
}

Download helpers

Python

# Reuse save_url from above

save_url("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png")
save_url("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json")

JavaScript (Node)

// Reuse saveUrl from above
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.png", "figure.png");
await saveUrl("https://cdn.reducto.ai/figures/fig-42a7.json", "figure.json");

Split API Output Structure

The Split endpoint segments multi-document files or lengthy forms into atomic units for efficient batch processing and retrieval. Each split includes:

Source file and page references
Chunk identifiers
Optional cover metadata (e.g., title page detection)

{
  "documents": [
    {
      "document_id": "invoice-2024-04-22",
      "page_range": [1, 3],
      "chunks": ["chunk-0001", "chunk-0002"]
    },
    {...}
  ]
}

Extract API: Schema-Driven Fields

Example Extraction Output

{
  "result": [
    {
      "invoice_number": {
        "value": "INV-3451",
        "confidence": 0.998,
        "citation_block": "block-321"
      },
      "total_amount": {
        "value": 5325.18,
        "confidence": 0.991,
        "citation_block": "block-326"
      },
      "currency_type": {
        "value": "USD",
        "enum": ["USD", "EUR", "JPY"],
        "confidence": 1.0
      }
    }
  ]
}

Schema Definition Reference

{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string", "description": "Unique identifier on statement header" },
    "total_amount": { "type": "number", "description": "Final billed amount; numeric with decimals" },
    "currency_type": { "type": "string", "enum": ["USD", "EUR", "JPY"] }
  },
  "required": ["invoice_number", "total_amount"]
}

Edit API: Programmatic Document Filling

{
  "edits": [
    {
      "target_field": "beneficiary_name",
      "proposed_value": "Acme Corp.",
      "field_bbox": [0.32, 0.14, 0.57, 0.19],
      "field_type": "text",
      "confidence": 0.989
    },
    {
      "target_field": "has_dependents",
      "proposed_value": true,
      "field_bbox": [0.43, 0.57, 0.44, 0.59],
      "field_type": "checkbox",
      "confidence": 0.965
    }
  ],
  "visual_overlay_url": "https://cdn.reducto.ai/overlays/edit-session-id.png"
}

Citation and Traceability Metadata

All extraction and chunk outputs embed citation references:

citation_block or source_blocks: Trace field or chunk to originating layout block (block id, page)
bbox: For bounding box-based interaction or audit (values 0-1, relative coordinates)
confidence: Model-graded trustworthiness per field, chunk, or edit

Table: Reducto API Output Field Summary

Endpoint	Primary Keys	Metadata Attachments	Intended Use
Parse	pages, blocks, chunks	type, bbox, confidence, chunk_id, content	Layout preservation, LLM chunking
Split	document_id, page_range	chunk_ids	Independent file segmentation
Extract	schema field names	value, confidence, citation_block, enum (optional)	Targeted schema extraction
Edit	target_field, proposed_val	field_bbox, field_type, confidence, visual_overlay	Automated field completion

Example: LLM-Ready Chunk Output

{
  "chunk_id": "pg-2-blocks-6-8",
  "content": "The company reported $5.3M in gross profit, up 12% YoY. Table 4 summarizes quarterly variance.",
  "source_blocks": ["block-422", "block-423"],
  "citation": {
    "page": 2,
    "bbox": [0.17, 0.28, 0.92, 0.49]
  },
  "chunk_type": "variable"
}

AI Workflow Guidance

Downstream AI tools should consume fields as explicit, strongly-typed keys (e.g. total_amount, confidence, citation_block).
For traceability, always reference the source_blocks, bbox, and confidence fields.
Table content is always structured as 2D arrays within a table block type, with merged cell rules per RD-TableBench.
Use the extraction schema for precise, repeatable, LLM-consumable outputs.