Enhancing Automatic Invoice Coding Performance with Unstructured Data : A feature extraction approach
Eerola, Teemu (2024)
Eerola, Teemu
2024
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-04-30
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202404053320
https://urn.fi/URN:NBN:fi:tuni-202404053320
Tiivistelmä
As a part of invoice processing, companies are documenting their spend to cope with financial audits, tax regulations, and their spend management purposes in a process called invoice coding. This process involves selecting coding values from a predefined list for each coding dimension describing the different aspects of spend. However, the manual coding process can be time-consuming and the traditional invoice automation processes still fail to automate coding completely.
The invoice coding problem can be formulated as a supervised classification task and a machine learning model can be trained on historical invoices to predict coding values for new invoices. This study explores how to incorporate the textual data of the invoice to improve the coding predictions. This data has not been previously used in the baseline system, because the lack of standardization in invoices makes it difficult to capture. This work studies the feasible feature extraction methods and assesses the performance against the baseline result that uses only invoice header data.
Experiments are conducted across datasets from seven companies with varying invoice origins, processes, and languages. Different coding dimensions bring diverse aspects of coding to experiments like capturing related words or semantic meaning. The results showed that incorporating textual data can improve accuracy by up to 21 percentage points over the baseline. The most robust performance improvement was obtained by latent semantic analysis features.
This thesis provides a practical pathway for implementing these enhancements in production and lays the groundwork for future advancements leveraging the textual data of invoice content.
The invoice coding problem can be formulated as a supervised classification task and a machine learning model can be trained on historical invoices to predict coding values for new invoices. This study explores how to incorporate the textual data of the invoice to improve the coding predictions. This data has not been previously used in the baseline system, because the lack of standardization in invoices makes it difficult to capture. This work studies the feasible feature extraction methods and assesses the performance against the baseline result that uses only invoice header data.
Experiments are conducted across datasets from seven companies with varying invoice origins, processes, and languages. Different coding dimensions bring diverse aspects of coding to experiments like capturing related words or semantic meaning. The results showed that incorporating textual data can improve accuracy by up to 21 percentage points over the baseline. The most robust performance improvement was obtained by latent semantic analysis features.
This thesis provides a practical pathway for implementing these enhancements in production and lays the groundwork for future advancements leveraging the textual data of invoice content.