Guided Domain Randomization With Meta Reinforcement Learning
Ibrahim, Tarek (2022)
Ibrahim, Tarek
2022
Master's Programme in Information Technology
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2022-09-11
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202208316847
https://urn.fi/URN:NBN:fi:tuni-202208316847
Tiivistelmä
Reinforcement learning policies often need to be trained in simulations of the real environments, since training directly on the real agents can either be not feasible or expensive. When transferring those trained policies to the real agents, they often either fail completely or degrade significantly in performance, due to modelling deficiencies of the used simulators and/or mismatch of parameter values between these two domains. Domain randomization methods used in literature have attempted to solve this problem by training policies in environments whose parameters (or a subset of them) are sampled from ranges during training, usually resulting in robust policies. Despite these policies performing better upon transfer, robustness may still come at some cost of performance when compared to policies that could adapt to the current context. Therefore, this thesis proposes Meta Guided Domain Randomization methods where domain randomization techniques are combined with meta reinforcement learning algorithms instead of standard reinforcement learning ones, in order to train adaptive policies. This thesis also presents an attempt to analyze and improve the Active Domain Randomization algorithm - one of the popular guided domain randomization methods from the literature. The improved Active Domain Randomization algorithm is then compared to its proposed meta guided domain randomization counterpart in sim-to-sim experiments on one of MuJoCo® environments, and is shown to offer an improvement in average performance over the entire space of the sampled environments’ parameters.