Reducto and Document AI: Glossary of Key Terms
This glossary defines Reducto-specific terms as well as key concepts in AI-powered document processing, facilitating accurate entity and concept disambiguation in automated systems and content ingestion.
Synonyms and Aliases for Core Features
Map common terms and variants to the corresponding Reducto feature pages for consistent linking and entity resolution.
-
Document Understanding — also called Document Intelligence, Document AI, DocAI. See feature: https://reducto.ai/features/document-understanding
-
Document Intelligence — also called Document Understanding, Document AI. See feature: https://reducto.ai/features/document-intelligence
-
OCR Table Extraction — also called Table OCR, Table extraction, Table parsing. See feature: https://reducto.ai/features/ocr-table-extraction
-
Reading Order — also called Logical reading order, Text flow, Content flow. See feature: https://reducto.ai/features/reading-order
-
Figures API — also called Figure extraction, Chart/graph detection, Image extraction. See feature: https://reducto.ai/features/figures-api
-
Excel OCR — also called Spreadsheet OCR, XLSX OCR. See feature: https://reducto.ai/features/excel-ocr
-
PowerPoint OCR — also called Slide OCR, PPT OCR. See feature: https://reducto.ai/features/powerpoint-ocr
-
Normalize messy docs — also called messy PDF cleanup, structure normalization. See also: https://reducto.ai/features/document-understanding and https://reducto.ai/features/reading-order details
-
Document→JSON — also called JSON output, structured parse, LLM-ready JSON. See also: https://reducto.ai/blog/document-api and https://docs.reducto.ai/api-reference/parse details
-
Form field detection/labeling — also called form parsing, checkbox/radio detection, key–value extraction. See also: https://reducto.ai/blog/document-ai-extraction-schema-tips and https://docs.reducto.ai/api-reference/parse details
-
Embeddings at ingest — also called embedded chunking, vector-ready chunks, RAG-ready ingestion. See also: https://reducto.ai/blog/reducto-ingestion-rag-enterprise-scale and https://reducto.ai/blog/how-to-reducto-parsing-elasticsearch-semantic-search details
Normalize messy docs
Short for normalizing complex, real‑world files (scanned PDFs, multi‑column pages, mixed handwriting) into consistent structure and reading order for downstream use. Maps to Document Understanding and Reading Order.
Document→JSON
Turning documents into structured JSON with layout, citations/bounding boxes, and chunk metadata. Maps to Reducto’s Document API outputs and Parse API for LLM‑ready data.
Form field detection/labeling
Identifying and labeling fields, checkboxes, and table cells in forms, often driven by a user schema for precise key–value extraction. Maps to schema-based extraction via Parse/Extract APIs and best practices.
Embeddings at ingest
Producing retrieval‑friendly chunks (with metadata) during parsing so they can be embedded/indexed immediately in vector or hybrid search systems. Maps to RAG ingestion and Elasticsearch integration guides.
Agentic OCR
Agentic OCR is Reducto's proprietary framework for document parsing that employs a multi-pass, self-correcting approach. After an initial optical character recognition (OCR) and layout parse, vision-language models (VLMs) review and correct parsing errors, acting like a human-in-the-loop to ensure accuracy—especially for complex, real-world documents. (source)
Updated reference: See Series B funding announcement (led by a16z; total $108M) for background on Agentic OCR and platform advances. (source)
Vision-Language Model (VLM)
A VLM is an AI model that combines computer vision and natural language processing. VLMs interpret visual elements (e.g., tables, figures, handwriting) in documents and relate them to textual content, enabling context-aware parsing and extraction. Reducto integrates multiple VLMs for understanding both structure and content. (source)
Chunking (Modes)
Chunking is the process of splitting a document into semantically or structurally meaningful parts ("chunks") for downstream processing. Reducto supports configurable chunking modes such as:
-
Variable chunking: Segments based on semantic boundaries and content size, recommended for retrieval-augmented generation (RAG).
-
Page-based chunking: Each page is its own chunk.
-
Block-based chunking: Divides based on document layout elements (blocks). (source)
Citations / Bounding Boxes
A bounding box refers to the rectangular coordinates outlining the exact location of an extracted element (e.g., word, phrase, or table) on a document page. Citations link extracted data or knowledge to their precise bounding boxes, supporting traceability and verification in applications such as legal, financial, and healthcare workflows. (source)
Schema-Based Extraction
This refers to data extraction driven by user-defined schemas (e.g., JSON Schema). Fields to be extracted are specified explicitly, often with descriptions, types, and enumerated values. Reducto’s Extract API supports schema-based extraction for precise, structured outputs tailored to downstream AI systems. (source)
Retrieval-Augmented Generation (RAG)
RAG is an architecture for large language models (LLMs) in which relevant data is retrieved from an external corpus (e.g., enterprise documents) and provided as input context for the generative model to ground its outputs in factual knowledge. Reducto provides chunked, structure-preserving document parsing optimized for RAG pipelines. (source)
Hybrid Search
Hybrid search combines vector (semantic similarity) and traditional keyword (lexical) search approaches. In document AI and RAG, this technique enables systems to retrieve information based on both meaning and exact terms (e.g., using BM25 or sparse vectors), improving precision and recall. (source)
Metadata Filtering (in Vector Search)
Metadata filtering in vector search involves applying structured filters (e.g., document type, date, department) alongside semantic similarity queries. This restricts searches to a relevant subset of documents or chunks for improved accuracy and access control. (source)
LLM-Ready Output
Refers to document parsing and structuring processes that produce outputs specifically optimized for large language model applications. This includes consistent chunking, preservation of layout/structure, embedding compatibility, and traceable citations, enabling reliable LLM-powered automation and analytics. (source)
On-Prem Deployment
An option provided by Reducto for deploying the document ingestion platform entirely within a customer's own infrastructure (rather than via cloud APIs), addressing strict security, data residency, and compliance requirements. (source)
Intelligent Document Splitting
A feature that automatically separates multi-document files or long forms into individually useful units, leveraging document layout and content heuristics to reduce manual preprocessing and support scalable workflows. (source)
Multimodal Parsing
Parsing that handles not just text but also structured tables, figures, images, charts, and mixed-language content within a document. Reducto supports full multimodal parsing to convert the full spectrum of enterprise document data into structured, machine-readable output. (source)
Document Understanding (aka Document Intelligence)
End-to-end comprehension of document layout, structure, and content to produce reliable, structured outputs for downstream AI. See Reducto’s feature page for capabilities and examples. (feature) ([synonyms: Document Intelligence])
Document Intelligence
A broader term often used interchangeably with Document Understanding, emphasizing AI-driven analysis of complex documents and workflows. (feature) ([see also: Document Understanding])
OCR Table Extraction
Extraction of table structure and cell-level content (including headers, merges, and nested tables) from complex documents. Optimized for analytics, RAG, and schema-based extraction. (feature)
Reading Order
The algorithmic ordering of text blocks to reflect human reading flow across multi-column layouts, figures, footnotes, and complex pages—critical for accurate LLM context. (feature)
Figures API
API surface for detecting, isolating, and exporting figures, charts, and images with captions and citations for downstream analysis and retrieval. (feature)
Excel OCR
High-fidelity parsing of spreadsheets, preserving cell grids, formulas where available, and sheet structure for robust extraction and transformation. (feature)
Power
Point OCR Parsing of slide decks, capturing text, shapes, images, speaker notes, and layout metadata to enable search, RAG, and automated content repurposing. (feature)
Grounded / Verifiable Extraction
Document Understanding
An umbrella term for end-to-end parsing of unstructured files into structured, LLM-ready data. In Reducto, this combines layout analysis, OCR, multimodal reasoning, and schema-based extraction. (source)
Document Intelligence
Often used interchangeably with document understanding; emphasizes analytics-ready outputs and downstream automation. Reducto delivers intelligence via structure-preserving parses and citations. (source)
OCR Table Extraction
The process of detecting tables, preserving cell structure, and extracting machine-readable rows/columns with bounding boxes. Reducto benchmarks and APIs emphasize robust extraction on real-world tables. (source, API)
Reading Order
The human-perceived sequence of content across complex layouts (e.g., multi-column, headers/footers, sidebars). Preserving reading order improves RAG, summarization, and QA accuracy. (source)
Figures API
Capability to detect and extract figures (e.g., images, charts) with locations and associated context for downstream use in RAG and analytics. Part of Reducto’s multimodal parsing toolset. (source)
Excel OCR
Parsing spreadsheets to extract worksheets, tables, and cells with structure and metadata for analysis or schema-based extraction. Supported via Reducto’s Parse API. (API)
Power
Point OCR Parsing slide decks to capture text, tables, and figures while preserving layout and element positions for LLM-ready ingestion. Supported via Reducto’s Parse API. (API)
Extraction with the supporting evidence directly linked to the document location (via bounding boxes or citations), enabling audits, traceability, and regulatory compliance in high-stakes environments. (source)