Forecasting short-term crude oil price using global news
Ylinen, Hanna (2025)
Ylinen, Hanna
2025
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-05-27
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202505276204
https://urn.fi/URN:NBN:fi:tuni-202505276204
Tiivistelmä
The price of crude oil, particularly the Brent benchmark originating from Europe, fluctuates unpredictably, although it is influenced to some degree by political, economic, and environmental factors. This raises two questions: how well do machine learning methods forecast Brent price in the short term based on historical price data, and can the inclusion of global event information mined from international news data improve the prediction performance?
Previous research shows that supervised machine learning methods, such as neural networks, provide effective techniques for price forecasting when combined with additional explanatory fac tors, including news. In this thesis, a local machine-learning environment was set up to forecast the price of Brent oil. The data ranged from 2022 to 2024, and the news was provided by The Guardian. Numerical indicators of probable influence on the Brent price from news headlines were extracted by prompting large language models. Prompt engineering techniques were applied to the pre-trained transformers Phi-4 and Qwen2.5, demonstrating logical reasoning and contextual understanding which requires a broad general knowledge.
The optimized time-series-only model was statistically analyzed and was found to perform at a similar level as the heuristic process of Martingale. This indicated that the historical price data lacked sufficiently learnable patterns.
During the training processes, the various machine learning models implemented in this thesis tended to overfit. In regression tasks, the predictions exhibited phase delays, defaulting to predict the last observed value, for models of time-series-only and a combination of price and news. The models without price data failed to learn any patterns. In classification tasks, the time-series only model achieved an accuracy of 35.51 %, indicating performance close to a random chance. The highest classification accuracy was obtained with time series combined with Phi-4 outputs, resulting in 38.37 % accuracy and 35.05 F1-score.
Overall, this study concludes that the time-series data used in this thesis lacked learnable patterns, while the information mined from the news demonstrated potential meaningfulness; their contribution to performance improvement stayed low in these conditions.
Previous research shows that supervised machine learning methods, such as neural networks, provide effective techniques for price forecasting when combined with additional explanatory fac tors, including news. In this thesis, a local machine-learning environment was set up to forecast the price of Brent oil. The data ranged from 2022 to 2024, and the news was provided by The Guardian. Numerical indicators of probable influence on the Brent price from news headlines were extracted by prompting large language models. Prompt engineering techniques were applied to the pre-trained transformers Phi-4 and Qwen2.5, demonstrating logical reasoning and contextual understanding which requires a broad general knowledge.
The optimized time-series-only model was statistically analyzed and was found to perform at a similar level as the heuristic process of Martingale. This indicated that the historical price data lacked sufficiently learnable patterns.
During the training processes, the various machine learning models implemented in this thesis tended to overfit. In regression tasks, the predictions exhibited phase delays, defaulting to predict the last observed value, for models of time-series-only and a combination of price and news. The models without price data failed to learn any patterns. In classification tasks, the time-series only model achieved an accuracy of 35.51 %, indicating performance close to a random chance. The highest classification accuracy was obtained with time series combined with Phi-4 outputs, resulting in 38.37 % accuracy and 35.05 F1-score.
Overall, this study concludes that the time-series data used in this thesis lacked learnable patterns, while the information mined from the news demonstrated potential meaningfulness; their contribution to performance improvement stayed low in these conditions.
