Information extraction from patent office actions using NLP techniques
Gaur, Vishal (2020)
Gaur, Vishal
2020
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2020-11-20
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202010277549
https://urn.fi/URN:NBN:fi:tuni-202010277549
Tiivistelmä
Patent Processing consists of various important tasks such as filling application, searching for prior art, responding to office actions and many more. Researchers mostly focused on working with prior arts and patent application but neglect analysing office actions. This thesis is devoted to classifying and extracting relevant information from office actions. Since office actions have not been in mainstream research, a new dataset of 2300 office actions was created for this research. Office actions were processed using the Tesseract OCR engine to create a raw dataset. The raw dataset was cleaned, pre-processed, and manually labelled into two classes, relevant (1) and irrelevant (0). The cleaned and processed dataset had almost 30000 irrelevant paragraphs and around 6000 relevant paragraphs. The text classification approach was experimented on the processed dataset to find the relevant and irrelevant information.
Since the dataset was created for this research, there were no baseline results. To establish a baseline, the Gaussian Naïve Bayes classifier was used. Random Forest, AdaBoost and XGBoost were chosen as models for evaluation during the experiments. In text classification, researchers apply various traditional algorithms such as SVM, Decision trees along with state-ofthe-art methods such as CNN, DNN etc. Random Forest algorithm used in this research is based on decision trees. AdaBoost and XGBoost utilize a gradient boosting technique, that is the dataset is trained using multiple iterations of classifiers. Each iteration of a classifier adjusts the feature weights to improve the accuracy score. The classifier with the best accuracy is opted by the algorithm as it has the most optimized feature weights. The word embedding technique also played an important role in text classification. Therefore, in this thesis, Count Vectorizer and TF-IDF Vectorizer were used to investigate the results on the dataset. Count Vectorizer focuses only on the total frequency of the vocabulary of the dataset. The TF-IDF Vectorizer utilizes frequency of the vocabulary and combines it with the inverse frequency of the words in the document.
The experiments were carried out using the Spyder framework. The baseline results for the research were 93.17% and 87.78% for Count Vectorizer and TF-IDF respectively. The experiments showed XGBoost as the best classifier with the highest accuracy of 96.50% using the TFIDF Vectorizer. The lowest accuracy achieved was 95.69% for AdaBoost using the TF-IDF Vectorizer. The F1-score for XGBoost was the highest with a value of 0.866 followed by Random Forest with 0.850 and AdaBoost with 0.842. Although the accuracy score for the AdaBoost model was higher using Count Vectorizer, the F1-scores for all the models were better for the TF-IDF Vectorizer.
By analysing the results from the classifiers, it was clear that the paragraphs with contradictory keywords were more prone to misclassification. However, the unevenness of the dataset could also be responsible as the number of irrelevant paragraphs greatly overshadows the relevant ones.
Since the dataset was created for this research, there were no baseline results. To establish a baseline, the Gaussian Naïve Bayes classifier was used. Random Forest, AdaBoost and XGBoost were chosen as models for evaluation during the experiments. In text classification, researchers apply various traditional algorithms such as SVM, Decision trees along with state-ofthe-art methods such as CNN, DNN etc. Random Forest algorithm used in this research is based on decision trees. AdaBoost and XGBoost utilize a gradient boosting technique, that is the dataset is trained using multiple iterations of classifiers. Each iteration of a classifier adjusts the feature weights to improve the accuracy score. The classifier with the best accuracy is opted by the algorithm as it has the most optimized feature weights. The word embedding technique also played an important role in text classification. Therefore, in this thesis, Count Vectorizer and TF-IDF Vectorizer were used to investigate the results on the dataset. Count Vectorizer focuses only on the total frequency of the vocabulary of the dataset. The TF-IDF Vectorizer utilizes frequency of the vocabulary and combines it with the inverse frequency of the words in the document.
The experiments were carried out using the Spyder framework. The baseline results for the research were 93.17% and 87.78% for Count Vectorizer and TF-IDF respectively. The experiments showed XGBoost as the best classifier with the highest accuracy of 96.50% using the TFIDF Vectorizer. The lowest accuracy achieved was 95.69% for AdaBoost using the TF-IDF Vectorizer. The F1-score for XGBoost was the highest with a value of 0.866 followed by Random Forest with 0.850 and AdaBoost with 0.842. Although the accuracy score for the AdaBoost model was higher using Count Vectorizer, the F1-scores for all the models were better for the TF-IDF Vectorizer.
By analysing the results from the classifiers, it was clear that the paragraphs with contradictory keywords were more prone to misclassification. However, the unevenness of the dataset could also be responsible as the number of irrelevant paragraphs greatly overshadows the relevant ones.