Algorithmic extraction of data in tables in PDF documents
Nurminen, Anssi (2013)
Nurminen, Anssi
2013
Tietotekniikan koulutusohjelma
Luonnontieteiden tiedekunta - Faculty of Natural Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2013-05-08
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201305231166
https://urn.fi/URN:NBN:fi:tty-201305231166
Tiivistelmä
Tables are an intuitive and universally used way of presenting large sets of experimental results and research findings, and as such, they are the majority source of significant data in scientific publications. As no universal standardization exists for the format of the reported data and the table layouts, two highly flexible algorithms are created to (i) detect tables within documents and to (ii) recognize table column and row structures. These algorithms enable completely automated extraction of tabular data from PDF documents.
PDF was chosen as the preferred target format for data extraction because of its popularity and the availability of research publications as natively digital PDF documents, almost without exceptions. The extracted data is made available in HTML and XML formats. These two formats were chosen because of their flexibility and ease of use for further processing.
The software application that was created as a part of this thesis work enables future research to take full advantage of existing research and results, by enabling gathering of large volumes of data from various sources for a more profound statistical analysis.
PDF was chosen as the preferred target format for data extraction because of its popularity and the availability of research publications as natively digital PDF documents, almost without exceptions. The extracted data is made available in HTML and XML formats. These two formats were chosen because of their flexibility and ease of use for further processing.
The software application that was created as a part of this thesis work enables future research to take full advantage of existing research and results, by enabling gathering of large volumes of data from various sources for a more profound statistical analysis.