Introduction

Health and commercial insurers process diverse, messy claim packets where accuracy directly affects adjudication and compliance. Forms such as CMS‑1500 and UB‑04 include dense codes, handwriting, and checkboxes that routinely break traditional OCR. Reducto’s vision‑first, multi‑pass pipeline (computer vision + VLMs + Agentic OCR) converts these claims into structured, LLM‑ready data with layout fidelity and traceability for downstream automation and review. See our healthcare claims overview for background on complexity and error costs. Read the claims extraction guide.

Claim form families and document types

CMS‑1500 (professional/outpatient claims): dense, multi‑section forms with identifiers, diagnosis/procedure coding, and numerous checkboxes. Details in our claims extraction guide.
UB‑04 (institutional/inpatient/emergency claims): line‑level detail, varied attachments, and handwriting/scanned artifacts. Overview and approach.
Other routinely co‑ingested artifacts: pharmacy claim records (e.g., NCPDP), clinical notes, EOBs, prior auth packets, and supporting PDFs/images. Reducto preserves layout, reading order, and table structure across all. Platform capabilities.

How Reducto parses claims at production scale

Vision‑first segmentation: detect blocks, tables, figures, handwriting, and form widgets; maintain coordinates for citation and audit trails. Parsing pipeline.
Agentic OCR: multi‑pass, self‑reviewing OCR that automatically corrects common failure modes on scans and low‑quality uploads. Series A product update.
VLM enrichment: structure preservation for multi‑column layouts, tables, and mixed handwriting/printed text; outputs LLM‑ready chunks with layout labels and bounding boxes. Elasticsearch/RAG guide.
Extraction to schema: targeted fields emitted as clean JSON for adjudication systems, analytics, or RAG agents; integrates with vector stores and data lakes. Schema design tips. • Databricks integration.
Write‑in‑document (optional): the Edit endpoint allows agents to fill forms (fields, table cells, checkboxes) programmatically when workflows require completion, not just reading. Contact us for Edit access.

Checkbox and radio capture

Widget detection: checkboxes and radio buttons are identified during layout segmentation; each element is returned with bounding box coordinates and confidence.
State inference: Agentic OCR/VLM cross‑checks local marks, surrounding labels, and region intensity to determine checked/unchecked status even on faint pencil marks or noisy scans. Technical approach.
Disambiguation: if multiple marks are present for a single radio group or the signal is ambiguous, Reducto emits explicit conflict flags alongside confidences for downstream review queues.
Traceability: sentence‑/region‑level coordinates support verifiable citations in UI and audit logs for regulated workflows. Anterior used box‑level granularity. • Elysian required bounding boxes for audits.

Example extraction schema (claims)

The following example illustrates a compact, production‑oriented schema pattern that preserves auditability while remaining LLM‑friendly.

Field	Type	Description	Applies to
claim_id	string	System or form‑derived identifier used to join attachments and adjudication records	CMS‑1500, UB‑04
member_id	string	Payer/member identifier captured from form header	CMS‑1500, UB‑04
patient_name	object	{first, last, middle} with per‑token coordinates for citation	CMS‑1500, UB‑04
provider_identifiers	object	NPI/TIN and facility/provider names with bounding boxes	CMS‑1500, UB‑04
dates_of_service	array	One or more service date ranges with normalized ISO‑8601	CMS‑1500, UB‑04
diagnosis_codes	array	Parsed codes as strings with optional code system metadata	CMS‑1500, UB‑04
procedures	array	Line‑level objects:	CMS‑1500, UB‑04
place_of_service	string	Normalized value with original token span for traceability	CMS‑1500
checkbox_states	array	{label, group, state, confidence, bbox[]} per checkbox/radio	CMS‑1500, UB‑04
totals	object	{charges, paid?, adjustments?} with numeric normalization	CMS‑1500, UB‑04

Notes

Use descriptive keys and natural‑language field descriptions to reduce ambiguity and improve extraction reliability. Schema tips.
Preserve coordinates (per field or per token) to enable human‑verifiable audits and targeted UI highlights. Case studies below.

Security, deployment, and compliance

HIPAA and SOC 2: enterprise‑grade security controls suitable for PHI; Business Associate Agreements available. Pricing and enterprise features.
Zero data retention and on‑prem/VPC options: deploy within customer infrastructure to meet strict regulatory and data residency requirements. Contact us.
Reliability: 99.9%+ uptime with automatic scaling; white‑glove onboarding and SLAs for critical claim flows. RAG at enterprise scale.

Proven outcomes in insurance and healthcare

Anterior (prior authorization): processed 20,000+ clinical documents; 95% completed within a 1‑minute SLA; 99.24% accuracy with <0.1% ingestion‑attributable flaws. Read the Anterior case study.
Elysian (commercial claims TPA): rigorous auditability with bounding‑box citations; qualitative claim review up to 16× faster than traditional methods. Read the Elysian case study.

Implementation checklist

Define a claims schema with explicit field descriptions, enums where appropriate, and no derived fields in extraction. Best practices.
Configure chunking to preserve form section boundaries; keep table rows intact for retrieval and QA. RAG/search guidance.
Set confidence thresholds and conflict flags for checkbox/radio fields to drive review queues.
Land structured outputs directly into your lakehouse or warehouse; see our Databricks walkthrough. Integration guide.
For end‑to‑end productionization or on‑prem/VPC deployment, engage our team. Talk to sales.