Visual Explanation With Action Query Transformer in Deep Reinforcement Learning and Visual Feedback via Augmented Reality
- Author
- Hidenori Itaya, Wantao Yin, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura
- Publication
- IEEE Access, 2025
Download: PDF (English)
Deep Reinforcement Learning (DRL) agents possess powerful control capabilities and have potential applications in robotics and other fields. However, the closed box properties of DRL agent models still make it difficult to interpret their decision-making processes. One research area that addresses this challenge is eXplainable Reinforcement Learning (XRL). Conventional visual explanation methods in XRL focus only on the rationale behind a single selected action and are insufficient for a more comprehensive analysis of an agent’s decision making. In addition, visualizing attention only as an image is problematic in real-world environments such as robotics because it does not clearly map attention to physical space, limiting user understanding. To overcome these limitations, we propose Action Q-Transformer (AQT), an XRL method that uses a transformer encoder-decoder architecture with action information as a query. In this way, AQT calculates explicit attentions for each action that an agent may select, resulting in a DRL agent model that is easier to interpret. Furthermore, we introduce a visual feedback method using augmented reality (AR) to project these attentions directly into the physical environment. Through experiments on the Atari 2600 video game strategy task and a robot control task in an indoor environment, we demonstrate that AQT can analyze agent decision making in detail. Also, user evaluation in a robot task confirms that AR-based visual feedback effectively improves the understanding of the agent’s behavior.