Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • Opinnäytteet - ylempi korkeakoulututkinto
  • Näytä viite
  •   Etusivu
  • Trepo
  • Opinnäytteet - ylempi korkeakoulututkinto
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

Synthetic Data for Minority Class Boosting in Money Laundering Transactions : A comparative study with ensemble classifiers

Gallage, Hashan (2026)

 
Avaa tiedosto
GallageHashan.pdf (1.431Mt)
Lataukset: 



Gallage, Hashan
2026

Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2026-04-21
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202604163934
Tiivistelmä
Money laundering, the process of concealing the origins of illicit financial flows, presents significant financial, legal, and reputational risks to banks, making effective anti-money-laundering (AML) practices essential. On the other hand, to adhere to the risk-based approach mandated under the Bank Secrecy Act (BSA), financial institutions are obligated to maintain robust AML compliance procedures and report any suspected instances of money laundering, terrorist financing, or other illicit financial activities to the relevant regulatory authorities. The main internal controls in banks on anti-money laundering risk assessments are Customer risk rating and Transaction monitoring. In modern banking operations, given that financial institutions process millions of transactions each day, the scale of transaction monitoring amplifies the requirement for maintaining effective and sustainable anti-money-laundering surveillance as every alert generated by the monitoring system requires manual review by compliance analysts.

Although traditional AML surveillance relies heavily on rule based systems derived from expert knowledge or careful data analysis, such systems are often too rigid to capture complex and evolving laundering behaviors and are prone to generating excessive false alerts. Machine learning (ML) based approaches, which leverage statistical patterns and nonlinear relationships in the data, offer a more adaptive and scalable alternative to conventional rule-based systems. By modeling transaction behavior holistically, rather than relying on static if then rules, ML classifiers have the potential to improve detection accuracy and reduce unnecessary alerts, thus supporting both operational efficiency and regulatory effectiveness. Excessive false positives impose substantial operational burdens, consuming significant investigative time and resources. Hence, the dual objective of minimizing false positives while maximizing true detections is critical to the design of modern transaction monitoring systems.

Access to granular transactional data needed to develop advanced models is severely restricted by confidentiality, banking secrecy, and data-protection regulations. Consequently, this study utilizes two publicly available synthetic AML datasets. Furthermore, real world AML datasets inherently suffer from severe class imbalance, since genuinely identified money laundering cases account for only a very small proportion of the overall transaction volume. To address this challenge, this study systematically investigates whether augmentation of minority class with synthetic data can improve classification performance of tree based ensembles under severe class imbalance. We further examine how varying the amount of generated synthetic data influences model performance with employed three data synthesizers, namely Gaussian Copula (GC), Conditional Tabular Generative Adversarial Networks(CTGAN),andTabularVariationalAutoencoder(TVAE).Toassesstherealismofgenerated data, we also conduct diagnostic tests with structural fidelity checks against the real data, and quality tests based on distributional similarity from a statistical perspective. We then evaluate the downstream effect of augmentation on multiple ensemble based classifiers using Geometric mean (G-mean). The G-mean was emphasized as it jointly rewards high sensitivity (detecting illicit transactions) and high specificity (controlling false alerts), making it particularly appropriate for highly imbalanced settings. From the results of experiments carried out with synthetic data augmentation, we found that XGBoost classifier provided better performance across all synthesizers. In addition, it was observed that the addition of increasing amounts of synthetic data resulted in a progressive decline in classifier performance.
Kokoelmat
  • Opinnäytteet - ylempi korkeakoulututkinto [42258]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste