Fraudulent Respondent Recognition in Adolescents’ Health Survey Data : Comparison of Data Analysis Methods
Myöhänen, Anna (2021)
Myöhänen, Anna
2021
Master's Programme in Computational Big Data Analytics
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2021-11-24
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202111238607
https://urn.fi/URN:NBN:fi:tuni-202111238607
Tiivistelmä
When responding a pseudonymous survey, the participants are supposed to answer truthfully. Nevertheless, a setting with no personal connection to the researchers opens a possibility to give false answers, if a respondent intends to do that. If the research problem requires selecting a small fraction of the participants, even a low rate of false responses may harm the analysis results.
The possibility of intentional fraud is not an extensively studied topic in the context of adolescents’ health surveys. However, when conducting statistical analyses with data that contains fraudulent respondents, excluding them may be compulsory for the success of the study. This work attempts to find an automatized solution for this task. A real research dataset (N=7675) of 15—16-year-old respondents was used to try data analysis methods for recognition. The goal was to find out which method is the best for finding susceptible respondents.
Questionnaire data is a special kind of data for outlier detection. In this thesis, Mahalanobis distance, isolation forest, and DBSCAN clustering algorithm are tested for finding the susceptible respondents from the data. The approach of simple rules is a method which is based on judging reasonability of the answers according to pre-defined rules. The gained classification results by each method are compared to hand-made classification that was made for an earlier study. The methods are tested with the same variable groups to gain comparable results.
According to the used research design, it cannot be clearly said if the hand-made classification is dominated by the variables used for classification by the simple rules, or if those variables are the best for recognizing susceptible responses. If the hand-made classification is not questioned, the approach of simple rules provides the highest sensitivity when comparing among the health variables 2 group, which is tested with all methods. Isolation forest offers the second-best sensitivity for that group. When comparing the analysis results by the other variable groups, isolation forest provides higher sensitivity values for all tested groups than Mahalanobis distance. The results suggest that the more variables are used for an analysis, the higher is the achieved sensitivity.
Even though simple rules gave higher sensitivity than isolation forest, it is not the most recommendable method of these. By using more variables, isolation forest offered almost as high sensitivity as simple rules by its limited number of variables, by smaller effort of implementing. The results suggest that increasing the number of variables might improve the sensitivity even more. Therefore, isolation forest seems to be the most applicable solution for this task.
The possibility of intentional fraud is not an extensively studied topic in the context of adolescents’ health surveys. However, when conducting statistical analyses with data that contains fraudulent respondents, excluding them may be compulsory for the success of the study. This work attempts to find an automatized solution for this task. A real research dataset (N=7675) of 15—16-year-old respondents was used to try data analysis methods for recognition. The goal was to find out which method is the best for finding susceptible respondents.
Questionnaire data is a special kind of data for outlier detection. In this thesis, Mahalanobis distance, isolation forest, and DBSCAN clustering algorithm are tested for finding the susceptible respondents from the data. The approach of simple rules is a method which is based on judging reasonability of the answers according to pre-defined rules. The gained classification results by each method are compared to hand-made classification that was made for an earlier study. The methods are tested with the same variable groups to gain comparable results.
According to the used research design, it cannot be clearly said if the hand-made classification is dominated by the variables used for classification by the simple rules, or if those variables are the best for recognizing susceptible responses. If the hand-made classification is not questioned, the approach of simple rules provides the highest sensitivity when comparing among the health variables 2 group, which is tested with all methods. Isolation forest offers the second-best sensitivity for that group. When comparing the analysis results by the other variable groups, isolation forest provides higher sensitivity values for all tested groups than Mahalanobis distance. The results suggest that the more variables are used for an analysis, the higher is the achieved sensitivity.
Even though simple rules gave higher sensitivity than isolation forest, it is not the most recommendable method of these. By using more variables, isolation forest offered almost as high sensitivity as simple rules by its limited number of variables, by smaller effort of implementing. The results suggest that increasing the number of variables might improve the sensitivity even more. Therefore, isolation forest seems to be the most applicable solution for this task.