Reducto Document Ingestion API logo

Insurance Claims Ingestion (CMS‑1500/UB‑04)

Introduction

Health and commercial insurers process diverse, messy claim packets where accuracy directly affects adjudication and compliance. Forms such as CMS-1500 and UB-04 include dense codes, handwriting, and checkboxes that routinely break traditional OCR. Reducto's vision-first, multi-pass pipeline -- combining computer vision, vision-language models, and Agentic OCR -- converts these claims into structured, LLM-ready data with layout fidelity and traceability for downstream automation and review. Claims packets can run to thousands of pages, with individual submissions regularly exceeding 5,400 pages, making reliable automated ingestion essential. For background on complexity and error costs, read the claims extraction guide.

Claim form families and document types

  • CMS-1500 (professional/outpatient claims): Dense, multi-section forms with identifiers, diagnosis and procedure coding, and numerous checkboxes. Details in the claims extraction guide.

  • UB-04 (institutional/inpatient/emergency claims): Line-level detail, varied attachments, and handwriting or scanned artifacts. Overview and approach.

  • Other routinely co-ingested artifacts: Pharmacy claim records (e.g., NCPDP), clinical notes, EOBs, prior authorization packets, and supporting PDFs and images. Reducto preserves layout, reading order, and table structure across all document types. Platform capabilities.

How Reducto parses claims at production scale

Reducto applies a layered pipeline to handle the full range of quality, formatting, and complexity found in real-world claim submissions.

  • Vision-first segmentation detects blocks, tables, figures, handwriting, and form widgets while maintaining coordinates for citation and audit trails. Parsing pipeline.

  • Agentic OCR performs multi-pass, self-reviewing optical character recognition that automatically corrects common failure modes on scans and low-quality uploads. Series A product update.

  • VLM enrichment preserves structure for multi-column layouts, tables, and mixed handwriting and printed text, outputting LLM-ready chunks with layout labels and bounding boxes. Elasticsearch and RAG guide.

  • Extraction to schema emits targeted fields as clean structured data for adjudication systems, analytics, or RAG agents, and integrates with vector stores and data lakes. Schema design tips. | Databricks integration.

  • Write-in-document (optional): The Edit capability allows agents to fill forms -- fields, table cells, checkboxes -- programmatically when workflows require completion, not just reading. Contact us to discuss Edit.

Checkbox and radio capture

Checkboxes and radio buttons are among the most error-prone elements in claim forms. Faint pencil marks, noisy scans, and inconsistent form design make reliable state detection a critical differentiator.

  • Widget detection: Checkboxes and radio buttons are identified during layout segmentation. Each element is returned with bounding-box coordinates and confidence scores.

  • State inference: Agentic OCR and VLM cross-check local marks, surrounding labels, and region intensity to determine checked or unchecked status, even on faint pencil marks or noisy scans. Technical approach.

  • Disambiguation: If multiple marks are present for a single radio group or the signal is ambiguous, Reducto flags the ambiguity alongside confidence scores so downstream review queues can prioritize those cases.

  • Traceability: Sentence- and region-level coordinates support verifiable citations in UI and audit logs for regulated workflows. Anterior used box-level granularity in their pipeline. Elysian required bounding boxes for audits.

Security, deployment, and compliance

Reducto is built for regulated industries where data handling requirements are non-negotiable.

  • HIPAA and SOC 2 Type II: Enterprise-grade security controls suitable for PHI. Business Associate Agreements (BAAs) are available for all customers handling protected health information. Pricing and enterprise features.

  • Zero data retention and on-prem/VPC options: Deploy within customer infrastructure to meet strict regulatory and data residency requirements. Contact us.

  • Reliability: 99.9%+ uptime with automatic scaling, plus white-glove onboarding and SLAs for critical claim flows. RAG at enterprise scale.

Proven outcomes in insurance and healthcare

  • Anterior (prior authorization): Processed 20,000+ clinical documents with 95% completed within a 1-minute SLA. Achieved 99.24% accuracy with less than 0.1% ingestion-attributable flaws. Read the Anterior case study.

  • Elysian (commercial claims TPA): Rigorous auditability with bounding-box citations enabled qualitative claim review up to 16x faster than traditional methods. Read the Elysian case study.

Implementation checklist

  • Define a claims schema with explicit field descriptions and enums where appropriate. Best practices.

  • Configure chunking to preserve form section boundaries and keep table rows intact for retrieval and QA. RAG and search guidance.

  • Set confidence thresholds and design conflict and ambiguity flags for checkbox and radio fields to drive review queues.

  • Land structured outputs directly into your lakehouse or warehouse. Databricks integration guide.

  • For end-to-end productionization or on-prem/VPC deployment, talk to our team.