Abstract:
This thesis investigates offline quantum reinforcement learning (QRL) with variational quantum circuits (VQCs) and metaheuristic optimization. O!ine reinforcement learning (RL) provides a realistic training paradigm in which agents learn entirely from fixed datasets instead of online interaction, making it particularly suited for reproducible studies and controlled comparisons. For the offline training, we created a dataset for the CartPole-v1 environment by combining random, medium, and expert policies, resulting in 525,000 transitions with diverse state–action coverage. On this dataset, we evaluated the effectiveness of four gradient-free metaheuristic optimizers: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Simulated Annealing (SA), and Tabu Search (TS). We trained a DQN agent with a 4-qubit, 2-layer VQC. Their performance is compared to a gradient-based gradient descent (GD) baseline with Adam optimizer. Each optimizer undergoes per-factor hyperparameter tuning, followed by an optimizer comparison on a single dataset pass.
Results show that all metaheuristics substantially outperform our GD baseline, with SA achieving the highest final performance, followed by TS, GA, and PSO. These findings demonstrate that gradient-free optimization offers clear advantages over gradient descent for VQCs, especially when learning from offline datasets, where optimization must proceed under limited data access and without environment interaction. By decoupling training from online interaction, the offline setting enables a rigorous comparison of optimizers and provides a practical path toward scaling QRL experiments under realistic resource constraints. This is particularly important in domains where online interactions are costly or safety-critical. Therefore, this study establishes offline QRL with metaheuristic optimization strategies as a promising research direction, while also highlighting limitations such as distribution shift and restricted convergence when training on a single dataset pass.
Author:
Frederik Bickel
Advisors:
Michael Kölle, Julian Hager, Claudia Linnhoff-Popien
Student Thesis | Published September 2025 | Copyright © QAR-Lab
Direct Inquiries to this work to the Advisors