## Reinforcement Q-learning for Model-Free Optimal Control: Real-Time Implementation and Challenges

##### Tiistola, Sini (2019)

Tiistola, Sini

2019

Automaatiotekniikan DI-ohjelma - Degree Programme in Automation Engineering

Tekniikan ja luonnontieteiden tiedekunta - Faculty of Engineering and Natural Sciences

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

##### Hyväksymispäivämäärä

2019-09-04**Julkaisun pysyvä osoite on**

http://urn.fi/URN:NBN:fi:tuni-201908162931

##### Tiivistelmä

Traditional feedback control methods are often model-based and the mathematical system models need to be identified before or during control. A reinforcement learning method called Q-learning can be used for model-free state feedback control. In theory, the optimal adaptive control is learned online without a system model with Q-learning. This data-driven learning is based on the output or state measurements and input control actions of the system. Theoretical results are promising, but the real-time applications are not widely used. Real-time implementation can be difficult because of e.g. hardware restrictions and stability issues.

This research aimed to determine whether a set of already existing Q-learning algorithms is capable of learning the optimal control in real-time applications. Both batch offline and adaptive online algorithms were chosen for this study. The selected Q-learning algorithms were implemented in a marginally stable linear system and an unstable nonlinear system using the Quanser QUBE™-Servo 2 experiment with an inertia disk and an inverted pendulum attachments. The results learned from the real-time system were compared to the theoretical Q-learning results when a simulated system model was used.

The results proved that the chosen algorithms solve the Linear Quadratic Regulator (LQR) problem with the theoretical linear system model. The algorithm chosen for the nonlinear system approximated the Hamilton-Jacobi-Bellman equation solution with the theoretical model, when the inverted pendulum was balanced upright. The results also showed that some challenges caused by the real-time system can be avoided by a proper selection of control noise. These include e.g. constrained input voltage and measurement disturbances such as small measurement noise and quantization due to measurement resolution. In the best, but rare, adaptive test cases, a near optimal policy was learned online for the linear real-time system. However, learning is reliable only with some batch learning methods. Lastly, some suggestions for improvements were proposed for Q-learning to be more suitable for real-time applications.

This research aimed to determine whether a set of already existing Q-learning algorithms is capable of learning the optimal control in real-time applications. Both batch offline and adaptive online algorithms were chosen for this study. The selected Q-learning algorithms were implemented in a marginally stable linear system and an unstable nonlinear system using the Quanser QUBE™-Servo 2 experiment with an inertia disk and an inverted pendulum attachments. The results learned from the real-time system were compared to the theoretical Q-learning results when a simulated system model was used.

The results proved that the chosen algorithms solve the Linear Quadratic Regulator (LQR) problem with the theoretical linear system model. The algorithm chosen for the nonlinear system approximated the Hamilton-Jacobi-Bellman equation solution with the theoretical model, when the inverted pendulum was balanced upright. The results also showed that some challenges caused by the real-time system can be avoided by a proper selection of control noise. These include e.g. constrained input voltage and measurement disturbances such as small measurement noise and quantization due to measurement resolution. In the best, but rare, adaptive test cases, a near optimal policy was learned online for the linear real-time system. However, learning is reliable only with some batch learning methods. Lastly, some suggestions for improvements were proposed for Q-learning to be more suitable for real-time applications.