Generalizing Pareto optimal policies in multi-objective reinforcement learning: An empirical study of hypernetworks
Heiskanen, Santeri (2024)
Heiskanen, Santeri
2024
Tietotekniikan DI-ohjelma - Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2024-10-04
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202409058563
https://urn.fi/URN:NBN:fi:tuni-202409058563
Tiivistelmä
Multi-objective reinforcement learning is a branch of reinforcement learning considering multiple conflicting objectives simultaneously. Many existing multi-objective approaches suffer from poor sample efficiency. Often, this is due to the inability to utilize already learned skills when considering a new trade-off between the objectives. Motivated by the recent success of hypernetworks, this work couples an existing multi-objective reinforcement learning algorithm with a hypernetwork to quantify how soft information sharing affects sample efficiency.
The theoretical section of the thesis explains the basics of multi-objective reinforcement learning and hypernetworks. The underlying assumptions and the used algorithm with relevant background theory are described in detail. An actor-critic model is explored, where the critic is modeled as a contextual bandit using the hypernetwork. The hypernetwork is conditioned on the state and trade-off between objectives, while two critic network configurations with different inputs are considered to understand the information flow in the system.
The proposed method is evaluated in three robot control tasks with varying difficulty levels. The results indicate that while the hypernetworks can improve the convergence speed of the agents, the final performance is unaffected. A trade-off of the approach is increased run-to-run fluctuation caused by the agent's tendency to concentrate on the most effortless objective while disregarding the others. A straightforward proof-of-concept experiment showcased that this can be alleviated by accounting for the unbalanced objectives during the training. The development of a general algorithm taking into account variance in the objective difficulty is left for future research. Overall, the hypernetworks showed promising performance, with evident flaws that require attention going forward.
The theoretical section of the thesis explains the basics of multi-objective reinforcement learning and hypernetworks. The underlying assumptions and the used algorithm with relevant background theory are described in detail. An actor-critic model is explored, where the critic is modeled as a contextual bandit using the hypernetwork. The hypernetwork is conditioned on the state and trade-off between objectives, while two critic network configurations with different inputs are considered to understand the information flow in the system.
The proposed method is evaluated in three robot control tasks with varying difficulty levels. The results indicate that while the hypernetworks can improve the convergence speed of the agents, the final performance is unaffected. A trade-off of the approach is increased run-to-run fluctuation caused by the agent's tendency to concentrate on the most effortless objective while disregarding the others. A straightforward proof-of-concept experiment showcased that this can be alleviated by accounting for the unbalanced objectives during the training. The development of a general algorithm taking into account variance in the objective difficulty is left for future research. Overall, the hypernetworks showed promising performance, with evident flaws that require attention going forward.