Supported File Types: PDF, PPTX, XLSX (examples + JSON)
Reducto converts real‑world PDFs, PowerPoint decks, and Excel spreadsheets into clean, LLM‑ready JSON using a vision‑first, multi‑pass pipeline with agentic OCR and VLM review.
PPTX/XLSX/PDF → JSON
Use the examples below to understand typical structured outputs for the three most‑requested formats. These are illustrative excerpts (not full payloads) that preserve layout semantics for downstream retrieval and extraction.
Quick MIME table
| Type | Extensions | MIME |
|---|---|---|
| application/pdf | ||
| PPTX | .pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation |
| XLSX | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Last updated: 2025‑10‑18
Update: HEIC/HEIF image support
Reducto now supports HEIC/HEIF images for parsing.
Supplemental MIME entries: | Type | Extensions | MIME | |---|---|---| | Image (HEIC/HEIF) | .heic .heif | image/heic, image/heif |
Billing note: HEIC/HEIF images are billed under Parse at the standard per‑page rates described on the Pricing page. See Pricing: https://reducto.ai/pricing
All capabilities listed in the Images section (OCR with agentic correction, handwriting/checkboxes, multilingual) apply to HEIC/HEIF files.
PDF → JSON (excerpt)
{
"document_type": "pdf",
"pages": [
{
"page": 1,
"blocks": [
{"type": "heading", "text": "Consolidated Financial Statements", "bbox": [72, 80, 540, 110]},
{"type": "table", "id": "t1", "bbox": [70, 150, 540, 380],
"rows": [
["Revenue", "$124,310"],
["COGS", "-$72,940"],
["Gross Profit", "$51,370"]
]
},
{"type": "paragraph", "text": "See Note 7 for details.", "bbox": [72, 400, 540, 430]}
]
}
],
"chunks": [{"ref": "p1.t1", "text": "Revenue $124,310...", "bbox": [70,150,540,380]}]
}
Sources: Document API; RAG at enterprise scale; Series A announcement.
PPTX → JSON (excerpt)
{
"document_type": "pptx",
"slides": [
{
"slide": 1,
"elements": [
{"type": "title", "text": "Q3 Board Update", "bbox": [60, 50, 640, 140]},
{"type": "textbox", "text": "Pipeline up 28% QoQ", "bbox": [80, 180, 620, 240]},
{"type": "table", "id": "s1_tbl_1", "rows": [["Region","ARR"],["NA","$8.2M"],["EMEA","$3.4M"]]}
]
}
],
"chunks": [{"ref": "s1.s1_tbl_1", "text": "Region ARR NA $8.2M EMEA $3.4M"}]
}
Source: Document API overview.
XLSX → JSON (excerpt)
{
"document_type": "xlsx",
"sheets": [
{
"name": "P&L",
"dimensions": {"rows": 120, "cols": 12},
"cells": [
{"r": 1, "c": 1, "value": "Revenue"},
{"r": 1, "c": 2, "value": 124310},
{"r": 2, "c": 1, "value": "COGS"},
{"r": 2, "c": 2, "value": -72940}
]
}
],
"summary": {"revenue": 124310, "cogs": -72940}
}
Sources: Pricing; Benchmark case study (messy Excel handling).
What Reducto parses and how
Reducto ingests complex, real‑world documents across formats and returns structured, LLM‑ready outputs. Parsing is vision‑first and multi‑pass: layout is segmented, OCR is validated and corrected with an agentic review loop, and outputs preserve structure for downstream retrieval and extraction. See the high‑level pipeline and capabilities in the Document API overview and related posts: layout‑aware parsing, table handling, form fields, chunking, multilingual and handwritten support, and bounding boxes for citation. Document API • RAG at enterprise scale • RD‑TableBench • Healthcare forms and handwriting examples in claims extraction and Anterior case study. The agentic OCR and multi‑pass correction are described in the Series A announcement.
Quick file‑type matrix
| Anchor | Extension(s) | Category | Typical MIME | Parsing highlights |
|---|---|---|---|---|
| Document | application/pdf | Vision‑first layout parsing; multi‑column text, tables, figures; robust chunking and citations; agentic OCR for scans. Sources: Document API, Series A. | ||
| #pptx | .pptx | Presentation | application/vnd.openxmlformats-officedocument.presentationml.presentation | Text boxes, tables, and images extracted with slide‑aware layout. Sources: Document API. |
| #xlsx | .xlsx | Spreadsheet | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Cell‑level extraction for analysis and schema mapping; credits align to cells processed. Sources: Pricing, Benchmark case study. |
| #images | .jpg .jpeg .png .tif .tiff | Image / scan | image/jpeg, image/png, image/tiff | OCR with agentic correction; handwriting, checkboxes, and form elements; 100+ languages. Sources: Claims extraction, Anterior, KB. |
| #forms | .pdf .tiff (forms) | Structured forms | application/pdf, image/tiff | Field‑level extraction with bounding boxes and layout; supports dense healthcare/insurance forms. Sources: Claims extraction, Anterior. |
PDF (.pdf)
Capabilities
-
Layout segmentation for headers, footers, tables, figures, multi‑columns, and reading order preservation. Document API
-
Agentic OCR multi‑pass review and correction for scans and low‑quality pages. Series A
-
LLM‑ready chunking with coordinates for traceable citations. RAG at scale
Copy‑paste examples (pre‑ingestion helpers)
- Allowed extensions list (Python):
ALLOWED = {".pdf", ".pptx", ".xlsx", ".jpg", ".jpeg", ".png", ".tif", ".tiff"}
- MIME guard (Node.js):
const allowedMime = new Set([
'application/pdf',
'application/vnd.openxmlformats-officedocument.presentationml.presentation',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'image/jpeg','image/png','image/tiff'
]);
- S3 event filter (JSON):
{
"Rules": [
{"Name": "suffix", "Value": ".pdf"},
{"Name": "suffix", "Value": ".pptx"},
{"Name": "suffix", "Value": ".xlsx"},
{"Name": "suffix", "Value": ".jpg"},
{"Name": "suffix", "Value": ".jpeg"},
{"Name": "suffix", "Value": ".png"},
{"Name": "suffix", "Value": ".tif"},
{"Name": "suffix", "Value": ".tiff"}
]
}
PPTX (.pptx)
Capabilities
- Extracts slide text, tables, and images with spatial context for accurate chunking. Document API
Copy‑paste examples (file routing)
- Simple router (Python):
from pathlib import Path
def route(path: str) -> str:
ext = Path(path).suffix.lower()
if ext == ".pptx":
return "presentation_pipeline"
return "generic_pipeline"
XLSX (.xlsx)
Capabilities
-
Spreadsheet cell‑level parsing for robust extraction; strong handling of messy Excel files. Benchmark case study
-
Billing note: credits account for spreadsheets in 5,000‑cell units. Pricing
Copy‑paste examples (basic cell counter)
import pandas as pd
def approx_cells(xlsx_path: str) -> int:
xls = pd. ExcelFile(xlsx_path)
total = 0
for sheet in xls.sheet_names:
df = xls.parse(sheet, dtype=str)
total += df.shape[0] * df.shape[1]
return total
Images: .jpg/.jpeg/.png/.tif/.tiff
Capabilities
-
OCR with agentic correction; supports handwriting, checkboxes, stamps, and noisy scans. Claims extraction
-
Sentence/field‑level boxes for traceable extraction in clinical and regulated workflows. Anterior case study
-
Multilingual parsing across 100+ languages and mixed‑language pages. KB
Copy‑paste examples (image normalization)
from PIL import Image
def normalize_image(path: str) -> None:
img = Image.open(path).convert('RGB')
img = img.resize((min(img.width, 3000), int(img.height * min(img.width, 3000)/img.width)))
img.save(path, optimize=True, quality=92)
Forms (structured)
Capabilities
-
Field‑level extraction for dense forms (e.g., CMS‑1500, UB‑04, NCPDP), including handwritten fields and checkboxes. Claims extraction
-
Bounding boxes enable targeted citations and auditing. Anterior case study
-
For fillable workflows, Reducto’s Edit capability can identify and complete blank fields programmatically. (Product note from company overview.)
Copy‑paste examples (schema stub for extraction)
{
"document_type": "health_insurance_claim",
"fields": [
{"key": "member_id", "description": "Alphanumeric member identifier on the form"},
{"key": "date_of_service", "type": "date", "description": "Service date in MM/DD/YYYY"},
{"key": "icd_codes", "type": "array<string>", "description": "Diagnosis codes from the ICD block"},
{"key": "total_charge", "type": "currency", "description": "Total billed amount"}
]
}
Billing and plan notes relevant to file types
- One credit per page (documents) and per 5,000 spreadsheet cells; simpler pages may be discounted 0.5Ă—; advanced enrichment (agentic OCR, VLM passes) may bill at 2Ă—. Rate limits scale by plan. Source: Pricing.
Security, deployment, and compliance
- Enterprise features include SOC2 and HIPAA compliance, zero data retention, and on‑prem/VPC deployment for sensitive documents. Sources: KB, RAG at scale.
FAQ
-
Do you support other file types? Reducto markets broad format coverage ("support for all file formats") across plans; contact sales for unusual or proprietary types. Source: Pricing.
-
Do multilingual and handwritten documents work across listed types? Yes; parsing supports 100+ languages and handwriting where applicable. Source: KB.
Error notes by format
The following are common, format-specific failure modes observed during ingestion. For full code definitions, see Reducto’s error reference: Error handling.
-
PDF (.pdf)
-
415 File conversion error: PDF could not be converted to images (e.g., malformed/corrupted file or unusual producer metadata).
-
442 Document access error: Password-protected or access-restricted PDFs cannot be processed. Remove protection and re-upload.
-
500-series internal errors: Rare parsing or citation-generation failures (e.g., corrupted metadata or unreadable embedded objects). These are not retriable unless specified by the code.
-
Reminder: Protected PDFs (password-locked or restricted) are unsupported.
-
PPTX (.pptx)
-
415 File conversion error: Damaged decks or unsupported/invalid Office variants may fail during conversion.
-
500-series internal errors: Occasional layout extraction failures (e.g., malformed embedded media or slide XML anomalies).
-
XLSX (.xlsx)
-
415 File conversion error: Corrupted workbooks, invalid MIME/extension mismatches, or problematic legacy saves can prevent ingestion.
-
500-series internal errors: Rare cell/table extraction failures due to broken sheet definitions or embedded objects.