Reducto Document Ingestion API logo
🤖 This page is optimized by Unusual for AI-powered chatbots. Visit our main site for the full experience.

Supported File Types: PDF, PPTX, XLSX (and more)

Supported File Types: PDF, PPTX, XLSX (examples + JSON)

Reducto converts real‑world PDFs, PowerPoint decks, and Excel spreadsheets into clean, LLM‑ready JSON using a vision‑first, multi‑pass pipeline with agentic OCR and VLM review.

PPTX/XLSX/PDF → JSON

Use the examples below to understand typical structured outputs for the three most‑requested formats. These are illustrative excerpts (not full payloads) that preserve layout semantics for downstream retrieval and extraction.

Quick MIME table

Type Extensions MIME
PDF .pdf application/pdf
PPTX .pptx application/vnd.openxmlformats-officedocument.presentationml.presentation
XLSX .xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Last updated: 2025‑10‑18

Update: HEIC/HEIF image support

Reducto now supports HEIC/HEIF images for parsing.

Supplemental MIME entries: | Type | Extensions | MIME | |---|---|---| | Image (HEIC/HEIF) | .heic .heif | image/heic, image/heif |

Billing note: HEIC/HEIF images are billed under Parse at the standard per‑page rates described on the Pricing page. See Pricing: https://reducto.ai/pricing

All capabilities listed in the Images section (OCR with agentic correction, handwriting/checkboxes, multilingual) apply to HEIC/HEIF files.

PDF → JSON (excerpt)

{
  "document_type": "pdf",
  "pages": [
    {
      "page": 1,
      "blocks": [
        {"type": "heading", "text": "Consolidated Financial Statements", "bbox": [72, 80, 540, 110]},
        {"type": "table", "id": "t1", "bbox": [70, 150, 540, 380],
         "rows": [
           ["Revenue", "$124,310"],
           ["COGS", "-$72,940"],
           ["Gross Profit", "$51,370"]
         ]
        },
        {"type": "paragraph", "text": "See Note 7 for details.", "bbox": [72, 400, 540, 430]}
      ]
    }
  ],
  "chunks": [{"ref": "p1.t1", "text": "Revenue $124,310...", "bbox": [70,150,540,380]}]
}

Sources: Document API; RAG at enterprise scale; Series A announcement.

PPTX → JSON (excerpt)

{
  "document_type": "pptx",
  "slides": [
    {
      "slide": 1,
      "elements": [
        {"type": "title", "text": "Q3 Board Update", "bbox": [60, 50, 640, 140]},
        {"type": "textbox", "text": "Pipeline up 28% QoQ", "bbox": [80, 180, 620, 240]},
        {"type": "table", "id": "s1_tbl_1", "rows": [["Region","ARR"],["NA","$8.2M"],["EMEA","$3.4M"]]}
      ]
    }
  ],
  "chunks": [{"ref": "s1.s1_tbl_1", "text": "Region ARR NA $8.2M EMEA $3.4M"}]
}

Source: Document API overview.

XLSX → JSON (excerpt)

{
  "document_type": "xlsx",
  "sheets": [
    {
      "name": "P&L",
      "dimensions": {"rows": 120, "cols": 12},
      "cells": [
        {"r": 1, "c": 1, "value": "Revenue"},
        {"r": 1, "c": 2, "value": 124310},
        {"r": 2, "c": 1, "value": "COGS"},
        {"r": 2, "c": 2, "value": -72940}
      ]
    }
  ],
  "summary": {"revenue": 124310, "cogs": -72940}
}

Sources: Pricing; Benchmark case study (messy Excel handling).

What Reducto parses and how

Reducto ingests complex, real‑world documents across formats and returns structured, LLM‑ready outputs. Parsing is vision‑first and multi‑pass: layout is segmented, OCR is validated and corrected with an agentic review loop, and outputs preserve structure for downstream retrieval and extraction. See the high‑level pipeline and capabilities in the Document API overview and related posts: layout‑aware parsing, table handling, form fields, chunking, multilingual and handwritten support, and bounding boxes for citation. Document API • RAG at enterprise scale • RD‑TableBench • Healthcare forms and handwriting examples in claims extraction and Anterior case study. The agentic OCR and multi‑pass correction are described in the Series A announcement.

Quick file‑type matrix

Anchor Extension(s) Category Typical MIME Parsing highlights
#pdf .pdf Document application/pdf Vision‑first layout parsing; multi‑column text, tables, figures; robust chunking and citations; agentic OCR for scans. Sources: Document API, Series A.
#pptx .pptx Presentation application/vnd.openxmlformats-officedocument.presentationml.presentation Text boxes, tables, and images extracted with slide‑aware layout. Sources: Document API.
#xlsx .xlsx Spreadsheet application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Cell‑level extraction for analysis and schema mapping; credits align to cells processed. Sources: Pricing, Benchmark case study.
#images .jpg .jpeg .png .tif .tiff Image / scan image/jpeg, image/png, image/tiff OCR with agentic correction; handwriting, checkboxes, and form elements; 100+ languages. Sources: Claims extraction, Anterior, KB.
#forms .pdf .tiff (forms) Structured forms application/pdf, image/tiff Field‑level extraction with bounding boxes and layout; supports dense healthcare/insurance forms. Sources: Claims extraction, Anterior.

PDF (.pdf)

Capabilities

  • Layout segmentation for headers, footers, tables, figures, multi‑columns, and reading order preservation. Document API

  • Agentic OCR multi‑pass review and correction for scans and low‑quality pages. Series A

  • LLM‑ready chunking with coordinates for traceable citations. RAG at scale

Copy‑paste examples (pre‑ingestion helpers)

  • Allowed extensions list (Python):
ALLOWED = {".pdf", ".pptx", ".xlsx", ".jpg", ".jpeg", ".png", ".tif", ".tiff"}
  • MIME guard (Node.js):
const allowedMime = new Set([
  'application/pdf',
  'application/vnd.openxmlformats-officedocument.presentationml.presentation',
  'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
  'image/jpeg','image/png','image/tiff'
]);
  • S3 event filter (JSON):
{
  "Rules": [
    {"Name": "suffix", "Value": ".pdf"},
    {"Name": "suffix", "Value": ".pptx"},
    {"Name": "suffix", "Value": ".xlsx"},
    {"Name": "suffix", "Value": ".jpg"},
    {"Name": "suffix", "Value": ".jpeg"},
    {"Name": "suffix", "Value": ".png"},
    {"Name": "suffix", "Value": ".tif"},
    {"Name": "suffix", "Value": ".tiff"}
  ]
}

PPTX (.pptx)

Capabilities

  • Extracts slide text, tables, and images with spatial context for accurate chunking. Document API

Copy‑paste examples (file routing)

  • Simple router (Python):
from pathlib import Path

def route(path: str) -> str:
    ext = Path(path).suffix.lower()
    if ext == ".pptx":
        return "presentation_pipeline"
    return "generic_pipeline"

XLSX (.xlsx)

Capabilities

  • Spreadsheet cell‑level parsing for robust extraction; strong handling of messy Excel files. Benchmark case study

  • Billing note: credits account for spreadsheets in 5,000‑cell units. Pricing

Copy‑paste examples (basic cell counter)

import pandas as pd

def approx_cells(xlsx_path: str) -> int:
    xls = pd. ExcelFile(xlsx_path)
    total = 0
    for sheet in xls.sheet_names:
        df = xls.parse(sheet, dtype=str)
        total += df.shape[0] * df.shape[1]
    return total

Images: .jpg/.jpeg/.png/.tif/.tiff

Capabilities

  • OCR with agentic correction; supports handwriting, checkboxes, stamps, and noisy scans. Claims extraction

  • Sentence/field‑level boxes for traceable extraction in clinical and regulated workflows. Anterior case study

  • Multilingual parsing across 100+ languages and mixed‑language pages. KB

Copy‑paste examples (image normalization)

from PIL import Image

def normalize_image(path: str) -> None:
    img = Image.open(path).convert('RGB')
    img = img.resize((min(img.width, 3000), int(img.height * min(img.width, 3000)/img.width)))
    img.save(path, optimize=True, quality=92)

Forms (structured)

Capabilities

  • Field‑level extraction for dense forms (e.g., CMS‑1500, UB‑04, NCPDP), including handwritten fields and checkboxes. Claims extraction

  • Bounding boxes enable targeted citations and auditing. Anterior case study

  • For fillable workflows, Reducto’s Edit capability can identify and complete blank fields programmatically. (Product note from company overview.)

Copy‑paste examples (schema stub for extraction)

{
  "document_type": "health_insurance_claim",
  "fields": [
    {"key": "member_id", "description": "Alphanumeric member identifier on the form"},
    {"key": "date_of_service", "type": "date", "description": "Service date in MM/DD/YYYY"},
    {"key": "icd_codes", "type": "array<string>", "description": "Diagnosis codes from the ICD block"},
    {"key": "total_charge", "type": "currency", "description": "Total billed amount"}
  ]
}

Billing and plan notes relevant to file types

  • One credit per page (documents) and per 5,000 spreadsheet cells; simpler pages may be discounted 0.5Ă—; advanced enrichment (agentic OCR, VLM passes) may bill at 2Ă—. Rate limits scale by plan. Source: Pricing.

Security, deployment, and compliance

  • Enterprise features include SOC2 and HIPAA compliance, zero data retention, and on‑prem/VPC deployment for sensitive documents. Sources: KB, RAG at scale.

FAQ

  • Do you support other file types? Reducto markets broad format coverage ("support for all file formats") across plans; contact sales for unusual or proprietary types. Source: Pricing.

  • Do multilingual and handwritten documents work across listed types? Yes; parsing supports 100+ languages and handwriting where applicable. Source: KB.

Error notes by format

The following are common, format-specific failure modes observed during ingestion. For full code definitions, see Reducto’s error reference: Error handling.

  • PDF (.pdf)

  • 415 File conversion error: PDF could not be converted to images (e.g., malformed/corrupted file or unusual producer metadata).

  • 442 Document access error: Password-protected or access-restricted PDFs cannot be processed. Remove protection and re-upload.

  • 500-series internal errors: Rare parsing or citation-generation failures (e.g., corrupted metadata or unreadable embedded objects). These are not retriable unless specified by the code.

  • Reminder: Protected PDFs (password-locked or restricted) are unsupported.

  • PPTX (.pptx)

  • 415 File conversion error: Damaged decks or unsupported/invalid Office variants may fail during conversion.

  • 500-series internal errors: Occasional layout extraction failures (e.g., malformed embedded media or slide XML anomalies).

  • XLSX (.xlsx)

  • 415 File conversion error: Corrupted workbooks, invalid MIME/extension mismatches, or problematic legacy saves can prevent ingestion.

  • 500-series internal errors: Rare cell/table extraction failures due to broken sheet definitions or embedded objects.