# Intermediate goal generation for off-policy reinforcement learning methods

##### Heiskanen, Santeri (2022)

Heiskanen, Santeri

2022

Tieto- ja sähkötekniikan kandidaattiohjelma - Bachelor's Programme in Computing and Electrical Engineering

Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

##### Hyväksymispäivämäärä

2022-05-17**Julkaisun pysyvä osoite on**

https://urn.fi/URN:NBN:fi:tuni-202205104688

##### Tiivistelmä

Reinforcement learning is a branch of machine learning, which has seen increasing interest in the last few years. Reinforcement learning tries to teach an agent to complete a task in an optimal way. The optimality is defined through reward function that signals the desired behaviour to the agent. However, defining reward function is often difficult, which has led to numerous attempts to automatize the process so that there is no need to define such function. In this work, an existing goal generation algorithm for an on-policy reinforcement learning method is applied to an off-policy method. This work examines if the algorithm works with the off-policy method without large modifications, and if the algorithm’s efficiency improves when used in conjunction with the off-policy method.

The theoretical part explains the basics of generative adversarial networks, which are a key part of the algorithm. Additionally, fundamental concepts related to reinforcement learning are presented. The used algorithm is described in detail. Generic, simple reward function is used in-place of task specific, complex reward function. To avoid the problem where the agent receives little to no rewards, the generative adversarial network is used to generate intermediate goals for the agent. The intermediate goals are used to guide the agent towards completing the original task.

To evaluate the performance of the algorithm, simple, yet illustrative task is used. In this task, a spider robot tries to reach another end of a U-shaped maze. The results from the experiment show that the algorithm requires modifications before it can be applied to off-policy methods successfully. The study concludes that more robust initialization procedure should be constructed to ensure that enough rewards are received in the beginning of the algorithm. Additionally, the exploration policy of the off-policy method is found to be insufficient for this task. The efficiency of the algorithm with off-policy methods is left as open question, as the experiment did not provide results that could answer the question directly. However, the experiment showed that the efficiency of the algorithm is poor overall, and thus more work is required to make the algorithm suitable for real-world problems.

The theoretical part explains the basics of generative adversarial networks, which are a key part of the algorithm. Additionally, fundamental concepts related to reinforcement learning are presented. The used algorithm is described in detail. Generic, simple reward function is used in-place of task specific, complex reward function. To avoid the problem where the agent receives little to no rewards, the generative adversarial network is used to generate intermediate goals for the agent. The intermediate goals are used to guide the agent towards completing the original task.

To evaluate the performance of the algorithm, simple, yet illustrative task is used. In this task, a spider robot tries to reach another end of a U-shaped maze. The results from the experiment show that the algorithm requires modifications before it can be applied to off-policy methods successfully. The study concludes that more robust initialization procedure should be constructed to ensure that enough rewards are received in the beginning of the algorithm. Additionally, the exploration policy of the off-policy method is found to be insufficient for this task. The efficiency of the algorithm with off-policy methods is left as open question, as the experiment did not provide results that could answer the question directly. However, the experiment showed that the efficiency of the algorithm is poor overall, and thus more work is required to make the algorithm suitable for real-world problems.

##### Kokoelmat

- Kandidaatintutkielmat [8159]