Introduction
Accounts Payable (AP) automation depends on precise, explainable invoice data. Reducto converts messy, real-world invoices -- PDFs, scans, images, spreadsheets -- into structured, LLM-ready outputs that preserve layout, table structure, and source traceability. This page covers the canonical invoice data model for header and line items, plus guidance for CSV/XLSX exports used by ERPs and intake queues. For background on Reducto's vision-first parsing, multi-pass Agentic OCR, and enterprise posture, see the product overview and funding announcement, the document API deep-dive, and the build-vs-buy analysis.
Reducto Series A & Agentic OCR | Document API | Build vs Buy
Why invoice extraction is hard in production
Invoices vary by vendor, region, and time. Real data include multi-page nested tables, handwritten notes, stamps, mixed languages, rotated scans, currency symbols, and edge cases like credit memos and partial receipts. Traditional OCR flattens structure and loses context, leading to brittle downstream logic. Reducto's hybrid layout understanding and table parsing were designed to survive this variance and have been externally validated on complex tables through the RD-TableBench benchmark -- an open evaluation suite of 1,000 manually annotated table images drawn from diverse real-world documents.
How layout-aware parsing improves RAG/search | RD-TableBench
What Reducto provides for AP teams
-
Layout-aware parsing with Agentic OCR. Multi-pass, self-correcting OCR delivers higher fidelity on difficult scans and dense tables. Series A & Agentic OCR
-
Schema-controlled extraction. Extract invoice headers and line items with bounding boxes that tie every value back to its source location on the page for explainability and audit. Document API
-
Intelligent chunking and multi-document splitting. Handle attachments or batched vendor packets without manual separation. Ingestion at enterprise scale
-
Enterprise deployment and compliance. VPC and on-prem deployments, SOC 2 Type II certification, HIPAA compliance with Business Associate Agreements (BAA), zero data retention (ZDR), and regional data-residency endpoints.
-
Form completion. Populate vendor onboarding or remittance templates via Reducto's Edit capability. Contact sales
Canonical invoice data model (header + line items)
The following unified data model serves as a reference for normalizing invoice data across diverse vendor layouts. Field descriptions use natural language and constrain enumerations only where standardization is universal (currency, language); business classifications like GL codes, cost centers, and project codes are left for downstream enrichment.
| Field | Type | Category | Required | Notes |
|---|---|---|---|---|
| invoice_number | string | Header | Yes | As printed by supplier; keep exact formatting (do not normalize or strip leading zeros). |
| invoice_date | date (ISO 8601) | Header | Yes | Date on the invoice; do not infer from received date. |
| due_date | date (ISO 8601) | Header | No | Use only if explicitly present or terms imply a printed due date. |
| supplier_name | string | Header | Yes | Legal name on the invoice. |
| supplier_tax_id | string | Header | No | VAT/GST/EIN as printed; country-specific formats allowed. |
| supplier_address | string | Header | No | Full multiline postal address as printed. |
| bill_to | string | Header | No | Your entity billed; useful for multi-entity AP. |
| ship_to | string | Header | No | If present on POs or goods invoices. |
| po_number | string | Header | No | If PO-flip exists; may be absent for non-PO invoices. |
| currency_code | enum (ISO 4217) | Header | Yes | Three-letter code (e.g., USD, EUR, JPY). |
| payment_terms | string | Header | No | Preserve vendor wording (e.g., "Net 30," "2/10 Net 30"). |
| subtotal_amount | decimal | Header | No | Sum before tax/fees/discounts as printed. |
| tax_amount | decimal | Header | No | Total tax on invoice; do not calculate. |
| shipping_amount | decimal | Header | No | Freight/handling as printed. |
| discount_amount | decimal | Header | No | Any header-level discount printed on the invoice. |
| total_amount | decimal | Header | Yes | Grand total as printed (authoritative). |
| notes | string | Header | No | Free-text: remittance notes, bank info, payment instructions. |
| page_count | integer | Header | No | Total pages parsed; aids reconciliation. |
| language_code | enum (BCP-47) | Header | No | Primary language detected (e.g., "en-US"). |
| line_number | integer | LineItem | Yes | Sequential number per invoice; maintain vendor numbering if present. |
| item_description | string | LineItem | Yes | Full description, including wrapped lines. |
| sku | string | LineItem | No | SKU/part number if present. |
| quantity | decimal | LineItem | Yes | As printed; allow fractional units. |
| uom | string | LineItem | No | Unit of measure (e.g., "ea", "kg", "hr"). |
| unit_price | decimal | LineItem | Yes | Unit price as printed (pre-tax unless clearly tax-inclusive). |
| line_discount | decimal | LineItem | No | Discount applied at line level if explicitly printed. |
| tax_code | string | LineItem | No | Vendor tax category (e.g., "VAT20", "GST-0"). |
| tax_amount_line | decimal | LineItem | No | Tax amount printed per line, if present. |
| line_amount | decimal | LineItem | Yes | Extended amount as printed for the line. |
| account_code | string | LineItem | No | If invoice prints GL/expense code. |
| cost_center | string | LineItem | No | If printed; otherwise leave empty (derive downstream). |
| project_code | string | LineItem | No | If printed; supports project-based AP. |
| po_line_number | integer | LineItem | No | If the invoice references PO lines. |
| service_period_start | date | LineItem | No | For services/subscriptions when dates appear on line. |
| service_period_end | date | LineItem | No | Paired with start when printed. |
| source_page | integer | LineItem | Yes | Page number where the line appears (traceability). |
| bbox | array of numbers | LineItem | No | Bounding box of the line item region for citation. |
The guiding principle is faithfulness to the document. Do not compute derived values -- for example, do not recompute totals or infer a due date from payment terms. For more on designing schemas that maximize extraction reliability, see Reducto's schema design guidance. Schema tips
CSV/XLSX export guidance for ERPs and intake queues
AP teams typically export to a tall (one row per line item) layout to feed ERPs, three-way matchers, and approval queues.
Recommended columns (adjust to your ERP):
-
Invoice-level: invoice_number, invoice_date, due_date, supplier_name, supplier_tax_id, po_number, currency_code, payment_terms, subtotal_amount, tax_amount, shipping_amount, discount_amount, total_amount, page_count.
-
Line-level: line_number, item_description, sku, quantity, uom, unit_price, line_discount, tax_code, tax_amount_line, line_amount, account_code, cost_center, project_code, po_line_number, service_period_start, service_period_end, source_page.
-
Provenance (optional): bbox (serialized), language_code, notes.
Normalization practices:
-
Preserve printed numbers and strings; avoid rounding or currency conversions during export.
-
Use ISO 4217 for currency codes and ISO 8601 for dates to minimize ERP ingestion errors.
-
Keep a single currency per invoice row; multi-currency invoices should repeat header values per line or be split per ERP requirements.
Accuracy, evaluation, and traceability
For AP workflows, the metrics that matter most are:
-
Header accuracy -- exact-match rate on invoice numbers, dates, and totals.
-
Line-item recall and precision -- table row alignment and value correctness.
-
Table structure integrity -- no dropped or duplicated rows; correct column association.
Reducto's layout-aware parsing and table extraction are validated on complex real-world datasets and show material gains over text-only approaches. Bounding boxes and page references support human audit and model-assisted QA, giving AP teams a clear chain of evidence from extracted value back to the source document.
RD-TableBench | Elasticsearch/RAG parsing
Security, compliance, and deployment options
Reducto is built for the regulatory and data-handling requirements common in finance and large enterprise AP:
-
SOC 2 Type II -- Reducto has completed both SOC 2 Type I and Type II certification.
-
HIPAA with BAA -- A HIPAA-compliant processing pipeline is available for qualifying customers, with a Business Associate Agreement executed upon request.
-
Zero Data Retention (ZDR) -- For Growth tier and above, all data submitted via the API is set to expire within 24 hours.
-
VPC and on-prem deployments -- Available on the Enterprise tier with custom SLAs.
-
Regional data-residency endpoints -- EU and AU endpoints available for Growth tier and above.
For full details, see Reducto's documentation and Trust Center.
Proof points and applicable case studies
-
Financial services and PE workflows: High-volume parsing with strong Excel and PDF handling, source citations, and rapid memo/report generation. Benchmark case study
-
Insurance and healthcare documents: Complex, audited pipelines with near-perfect ingestion reliability and measurable speedups. Elysian | Anterior
-
Enterprise-scale ingestion: Reliability and automatic scaling for sensitive industries. Ingestion at enterprise scale
Next steps
-
Evaluate fit, deployment model, and SLAs with the Reducto team. Contact
-
Review plan tiers and security features. Pricing
Reducto serves startups through Fortune-scale enterprises building production AP automation that demands accuracy, provenance, and compliance -- not just "text from PDFs." Document API