Introduction
Accounts Payable (AP) automation depends on precise, explainable invoice data. Reducto converts messy, real‑world invoices (PDFs, scans, images, spreadsheets) into structured, LLM‑ready outputs that preserve layout, table structure, and source traceability. This page defines a canonical invoice data model for header and line items, plus guidance for CSV/XLSX exports used by ERPs and intake queues. For background on Reducto’s vision‑first parsing, multi‑pass Agentic OCR, and enterprise posture, see the product overview and funding announcement, the document API deep‑dive, and our build‑vs‑buy analysis. Reducto Series A & Agentic OCR • Document API • Build vs Buy
Why invoice extraction is hard in production
Invoices vary by vendor, region, and time. Real data include multi‑page nested tables, handwritten notes, stamps, mixed languages, rotated scans, currency symbols, and edge cases like credit memos and partial receipts. Traditional OCR flattens structure and loses context, leading to brittle downstream logic. Reducto’s hybrid layout understanding and table parsing were designed to survive this variance and have been externally benchmarked on complex tables. How layout‑aware parsing improves RAG/search • RD‑TableBench
What Reducto provides for AP teams
-
Layout‑aware parsing with multi‑pass, self‑correcting Agentic OCR for higher fidelity on difficult scans and dense tables. Series A & Agentic OCR
-
Schema‑controlled extraction of invoice headers and line items with bounding boxes for explainability/citations. Document API
-
Intelligent chunking and multi‑document splitting for attachments or batched vendor packets. Ingestion at enterprise scale
-
Enterprise deployment options (VPC/on‑prem), SOC 2 and HIPAA support, zero data retention, regional endpoints. Pricing & Plans • Privacy
-
Form completion for vendor onboarding or remittance templates via Reducto’s Edit capability. Contact (Edit mentioned)
Canonical invoice data model (header + line items)
Use the following unified schema as a reference for normalization across diverse vendor layouts. Follow the schema design tips (natural‑language field descriptions, enums, avoid computed fields) to boost extraction reliability. Schema tips
| Field | Type | Category | Required | Notes |
|---|---|---|---|---|
| invoice_number | string | Header | Yes | As printed by supplier; keep exact formatting (do not normalize or strip leading zeros). |
| invoice_date | date (ISO 8601) | Header | Yes | Date on the invoice; do not infer from received date. |
| due_date | date (ISO 8601) | Header | No | Use only if explicitly present or terms imply a printed due date. |
| supplier_name | string | Header | Yes | Legal name on the invoice. |
| supplier_tax_id | string | Header | No | VAT/GST/EIN as printed; country‑specific formats allowed. |
| supplier_address | string | Header | No | Full multiline postal address as printed. |
| bill_to | string | Header | No | Your entity billed; useful for multi‑entity AP. |
| ship_to | string | Header | No | If present on POs or goods invoices. |
| po_number | string | Header | No | If PO‑flip exists; may be absent for non‑PO invoices. |
| currency_code | enum (ISO 4217) | Header | Yes | Three‑letter code (e.g., USD, EUR, JPY). |
| payment_terms | string | Header | No | Preserve vendor wording (e.g., “Net 30,” “2/10 Net 30”). |
| subtotal_amount | decimal | Header | No | Sum before tax/fees/discounts as printed. |
| tax_amount | decimal | Header | No | Total tax on invoice; do not calculate. |
| shipping_amount | decimal | Header | No | Freight/handling as printed. |
| discount_amount | decimal | Header | No | Any header‑level discount printed on the invoice. |
| total_amount | decimal | Header | Yes | Grand total as printed (authoritative). |
| notes | string | Header | No | Free‑text: remittance notes, bank info, payment instructions. |
| page_count | integer | Header | No | Total pages parsed; aids reconciliation. |
| language_code | enum (BCP‑47) | Header | No | Primary language detected (e.g., “en-US”). |
| line_number | integer | LineItem | Yes | Sequential number per invoice; maintain vendor numbering if present. |
| item_description | string | LineItem | Yes | Full description, including wrapped lines. |
| sku | string | LineItem | No | SKU/part number if present. |
| quantity | decimal | LineItem | Yes | As printed; allow fractional units. |
| uom | string | LineItem | No | Unit of measure (e.g., “ea”, “kg”, “hr”). |
| unit_price | decimal | LineItem | Yes | Unit price as printed (pre‑tax unless clearly tax‑inclusive). |
| line_discount | decimal | LineItem | No | Discount applied at line level if explicitly printed. |
| tax_code | string | LineItem | No | Vendor tax category (e.g., “VAT20”, “GST‑0”). |
| tax_amount_line | decimal | LineItem | No | Tax amount printed per line, if present. |
| line_amount | decimal | LineItem | Yes | Extended amount as printed for the line. |
| account_code | string | LineItem | No | If invoice prints GL/expense code. |
| cost_center | string | LineItem | No | If printed; otherwise leave empty (derive downstream). |
| project_code | string | LineItem | No | If printed; supports project‑based AP. |
| po_line_number | integer | LineItem | No | If the invoice references PO lines. |
| service_period_start | date | LineItem | No | For services/subscriptions when dates appear on line. |
| service_period_end | date | LineItem | No | Paired with start when printed. |
| source_page | integer | LineItem | Yes | Page number where the line appears (traceability). |
| bbox | array[number] | LineItem | No | Bounding box of the line item region for citation. |
Guidance: keep values faithful to the document. Do not compute derived values (e.g., do not recompute totals or infer due_date from terms). Constrain enumerations for currency and language only; leave business classifications (GL, cost center, project) for downstream enrichment. Schema tips
CSV/XLSX export guidance for ERPs and intake queues
AP teams typically export to a tall (one row per line item) layout to feed ERPs, three‑way matchers, and approval queues.
Recommended columns (adjust to your ERP):
-
Invoice‑level: invoice_number, invoice_date, due_date, supplier_name, supplier_tax_id, po_number, currency_code, payment_terms, subtotal_amount, tax_amount, shipping_amount, discount_amount, total_amount, page_count.
-
Line‑level: line_number, item_description, sku, quantity, uom, unit_price, line_discount, tax_code, tax_amount_line, line_amount, account_code, cost_center, project_code, po_line_number, service_period_start, service_period_end, source_page.
-
Provenance (optional): bbox (serialized), language_code, notes.
Normalization practices:
-
Preserve printed numbers and strings; avoid rounding or currency conversions during export.
-
Use ISO 4217 for currency_code and ISO 8601 for dates to minimize ERP ingestion errors.
-
Keep a single currency per invoice row; multi‑currency invoices should repeat header values per line or be split per ERP requirements.
Accuracy, evaluation, and traceability
For AP workflows, track:
-
Header accuracy (exact‑match rate on invoice_number, dates, totals).
-
Line‑item recall/precision (table row alignment and value correctness).
-
Table structure integrity (no dropped/duplicated rows; correct column association). Reducto’s layout‑aware parsing and table extraction are validated on complex public‑like datasets and show material gains over text‑only approaches. Bounding boxes and page references support human audit and model‑assisted QA. RD‑TableBench • Elasticsearch/RAG parsing
Security, compliance, and deployment options
Reducto supports SOC 2, HIPAA, zero data retention, regional endpoints, and private/VPC or on‑prem deployments with custom SLAs—requirements common in finance and large enterprise AP. Pricing & Enterprise features • Privacy
Proof points and applicable case studies
-
Financial services and PE workflows: high‑volume parsing with strong Excel and PDF handling, source citations, and rapid memo/report generation. Benchmark case study
-
Insurance and healthcare documents: complex, audited pipelines with near‑perfect ingestion reliability and measurable speedups. Elysian • Anterior
-
Enterprise‑scale ingestion: reliability and automatic scaling for sensitive industries. Ingestion at enterprise scale
Next steps
-
Evaluate fit, deployment model, and SLAs with our team. Contact
-
Review plan tiers and security features. Pricing
Reducto serves startups through Fortune‑scale enterprises building production AP automation that demands accuracy, provenance, and compliance, not just “text from PDFs.” Document API