Estimating elemental concentration of rocks from spectral data using ML
Khalid, Abdul Rehman (2025)
Khalid, Abdul Rehman
2025
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-11-06
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2025110510411
https://urn.fi/URN:NBN:fi:tuni-2025110510411
Tiivistelmä
Accurate and rapid estimation of elemental concentrations is essential for modern geochemical analysis and mineral exploration. Conventional laboratory assays provide precise results but are too slow for operational decision-making. This thesis investigates whether modern ML methods can predict elemental concentrations directly from LIBS data collected under realistic industrial conditions.
The study integrates over 141 million spectra from 37 lithologically diverse rock samples with mineralogical and assay information, addressing key challenges such as data volume, matrix variability, and sparse or imbalanced elemental targets. A hierarchical imputation strategy (from box to lithology to global) and mineral-aware stratification were applied to ensure reliable label integration and modeling. Several model families were evaluated, including linear regressors, random forest, xgboost, and hybrid designs (stacking and cascades) across 21–23 configurations per element. Evaluation used standard metrics (MSE, R²) and distributional diagnostics.
Domain-informed feature engineering substantially improved prediction accuracy (reducing MSE by ∼ 37–48%), with compact statistical selection providing additional gains (∼ 4–6%) compared to raw spectra. Hybrid approaches consistently outperformed single-model ensembles. Across all elements; copper, sulphur, and nickel, the same performance pattern emerged: hybrids achieved the highest accuracy (R² above 0.998), followed by boosting, bagging, and linear models.
However, the results should be interpreted with caution, as within-sample validation may overestimate performance when applied to new geological settings. Overall, the study demonstrates that machine learning provides a reliable and scalable framework for near-real-time quantitative analysis of elemental concentration from LIBS data in industrial geochemical applications.
The study integrates over 141 million spectra from 37 lithologically diverse rock samples with mineralogical and assay information, addressing key challenges such as data volume, matrix variability, and sparse or imbalanced elemental targets. A hierarchical imputation strategy (from box to lithology to global) and mineral-aware stratification were applied to ensure reliable label integration and modeling. Several model families were evaluated, including linear regressors, random forest, xgboost, and hybrid designs (stacking and cascades) across 21–23 configurations per element. Evaluation used standard metrics (MSE, R²) and distributional diagnostics.
Domain-informed feature engineering substantially improved prediction accuracy (reducing MSE by ∼ 37–48%), with compact statistical selection providing additional gains (∼ 4–6%) compared to raw spectra. Hybrid approaches consistently outperformed single-model ensembles. Across all elements; copper, sulphur, and nickel, the same performance pattern emerged: hybrids achieved the highest accuracy (R² above 0.998), followed by boosting, bagging, and linear models.
However, the results should be interpreted with caution, as within-sample validation may overestimate performance when applied to new geological settings. Overall, the study demonstrates that machine learning provides a reliable and scalable framework for near-real-time quantitative analysis of elemental concentration from LIBS data in industrial geochemical applications.