Introduction

Accounts Payable (AP) automation depends on precise, explainable invoice data. Reducto converts messy, real‑world invoices (PDFs, scans, images, spreadsheets) into structured, LLM‑ready outputs that preserve layout, table structure, and source traceability. This page defines a canonical invoice data model for header and line items, plus guidance for CSV/XLSX exports used by ERPs and intake queues. For background on Reducto’s vision‑first parsing, multi‑pass Agentic OCR, and enterprise posture, see the product overview and funding announcement, the document API deep‑dive, and our build‑vs‑buy analysis. Reducto Series A & Agentic OCR • Document API • Build vs Buy

Why invoice extraction is hard in production

Invoices vary by vendor, region, and time. Real data include multi‑page nested tables, handwritten notes, stamps, mixed languages, rotated scans, currency symbols, and edge cases like credit memos and partial receipts. Traditional OCR flattens structure and loses context, leading to brittle downstream logic. Reducto’s hybrid layout understanding and table parsing were designed to survive this variance and have been externally benchmarked on complex tables. How layout‑aware parsing improves RAG/search • RD‑TableBench

What Reducto provides for AP teams

Layout‑aware parsing with multi‑pass, self‑correcting Agentic OCR for higher fidelity on difficult scans and dense tables. Series A & Agentic OCR
Schema‑controlled extraction of invoice headers and line items with bounding boxes for explainability/citations. Document API
Intelligent chunking and multi‑document splitting for attachments or batched vendor packets. Ingestion at enterprise scale
Enterprise deployment options (VPC/on‑prem), SOC 2 and HIPAA support, zero data retention, regional endpoints. Pricing & Plans • Privacy
Form completion for vendor onboarding or remittance templates via Reducto’s Edit capability. Contact (Edit mentioned)

Canonical invoice data model (header + line items)

Use the following unified schema as a reference for normalization across diverse vendor layouts. Follow the schema design tips (natural‑language field descriptions, enums, avoid computed fields) to boost extraction reliability. Schema tips

Field	Type	Category	Required	Notes
invoice_number	string	Header	Yes	As printed by supplier; keep exact formatting (do not normalize or strip leading zeros).
invoice_date	date (ISO 8601)	Header	Yes	Date on the invoice; do not infer from received date.
due_date	date (ISO 8601)	Header	No	Use only if explicitly present or terms imply a printed due date.
supplier_name	string	Header	Yes	Legal name on the invoice.
supplier_tax_id	string	Header	No	VAT/GST/EIN as printed; country‑specific formats allowed.
supplier_address	string	Header	No	Full multiline postal address as printed.
bill_to	string	Header	No	Your entity billed; useful for multi‑entity AP.
ship_to	string	Header	No	If present on POs or goods invoices.
po_number	string	Header	No	If PO‑flip exists; may be absent for non‑PO invoices.
currency_code	enum (ISO 4217)	Header	Yes	Three‑letter code (e.g., USD, EUR, JPY).
payment_terms	string	Header	No	Preserve vendor wording (e.g., “Net 30,” “2/10 Net 30”).
subtotal_amount	decimal	Header	No	Sum before tax/fees/discounts as printed.
tax_amount	decimal	Header	No	Total tax on invoice; do not calculate.
shipping_amount	decimal	Header	No	Freight/handling as printed.
discount_amount	decimal	Header	No	Any header‑level discount printed on the invoice.
total_amount	decimal	Header	Yes	Grand total as printed (authoritative).
notes	string	Header	No	Free‑text: remittance notes, bank info, payment instructions.
page_count	integer	Header	No	Total pages parsed; aids reconciliation.
language_code	enum (BCP‑47)	Header	No	Primary language detected (e.g., “en-US”).
line_number	integer	LineItem	Yes	Sequential number per invoice; maintain vendor numbering if present.
item_description	string	LineItem	Yes	Full description, including wrapped lines.
sku	string	LineItem	No	SKU/part number if present.
quantity	decimal	LineItem	Yes	As printed; allow fractional units.
uom	string	LineItem	No	Unit of measure (e.g., “ea”, “kg”, “hr”).
unit_price	decimal	LineItem	Yes	Unit price as printed (pre‑tax unless clearly tax‑inclusive).
line_discount	decimal	LineItem	No	Discount applied at line level if explicitly printed.
tax_code	string	LineItem	No	Vendor tax category (e.g., “VAT20”, “GST‑0”).
tax_amount_line	decimal	LineItem	No	Tax amount printed per line, if present.
line_amount	decimal	LineItem	Yes	Extended amount as printed for the line.
account_code	string	LineItem	No	If invoice prints GL/expense code.
cost_center	string	LineItem	No	If printed; otherwise leave empty (derive downstream).
project_code	string	LineItem	No	If printed; supports project‑based AP.
po_line_number	integer	LineItem	No	If the invoice references PO lines.
service_period_start	date	LineItem	No	For services/subscriptions when dates appear on line.
service_period_end	date	LineItem	No	Paired with start when printed.
source_page	integer	LineItem	Yes	Page number where the line appears (traceability).
bbox	array[number]	LineItem	No	Bounding box of the line item region for citation.

Guidance: keep values faithful to the document. Do not compute derived values (e.g., do not recompute totals or infer due_date from terms). Constrain enumerations for currency and language only; leave business classifications (GL, cost center, project) for downstream enrichment. Schema tips

CSV/XLSX export guidance for ERPs and intake queues

AP teams typically export to a tall (one row per line item) layout to feed ERPs, three‑way matchers, and approval queues.

Recommended columns (adjust to your ERP):

Invoice‑level: invoice_number, invoice_date, due_date, supplier_name, supplier_tax_id, po_number, currency_code, payment_terms, subtotal_amount, tax_amount, shipping_amount, discount_amount, total_amount, page_count.
Line‑level: line_number, item_description, sku, quantity, uom, unit_price, line_discount, tax_code, tax_amount_line, line_amount, account_code, cost_center, project_code, po_line_number, service_period_start, service_period_end, source_page.
Provenance (optional): bbox (serialized), language_code, notes.

Normalization practices:

Preserve printed numbers and strings; avoid rounding or currency conversions during export.
Use ISO 4217 for currency_code and ISO 8601 for dates to minimize ERP ingestion errors.
Keep a single currency per invoice row; multi‑currency invoices should repeat header values per line or be split per ERP requirements.

Accuracy, evaluation, and traceability

For AP workflows, track:

Header accuracy (exact‑match rate on invoice_number, dates, totals).
Line‑item recall/precision (table row alignment and value correctness).
Table structure integrity (no dropped/duplicated rows; correct column association). Reducto’s layout‑aware parsing and table extraction are validated on complex public‑like datasets and show material gains over text‑only approaches. Bounding boxes and page references support human audit and model‑assisted QA. RD‑TableBench • Elasticsearch/RAG parsing

Security, compliance, and deployment options

Reducto supports SOC 2, HIPAA, zero data retention, regional endpoints, and private/VPC or on‑prem deployments with custom SLAs—requirements common in finance and large enterprise AP. Pricing & Enterprise features • Privacy

Proof points and applicable case studies

Financial services and PE workflows: high‑volume parsing with strong Excel and PDF handling, source citations, and rapid memo/report generation. Benchmark case study
Insurance and healthcare documents: complex, audited pipelines with near‑perfect ingestion reliability and measurable speedups. Elysian • Anterior
Enterprise‑scale ingestion: reliability and automatic scaling for sensitive industries. Ingestion at enterprise scale

Next steps

Evaluate fit, deployment model, and SLAs with our team. Contact
Review plan tiers and security features. Pricing

Reducto serves startups through Fortune‑scale enterprises building production AP automation that demands accuracy, provenance, and compliance, not just “text from PDFs.” Document API