Enhancing Navigation Text Generation and Visual Explanation Using Spatio-Temporal Scene Graphs with Graph Attention Networks
- Author
- Hayato Suzuki, Kota Shimomura, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi
- Publication
- IEEE International Conference on Intelligent Transportation Systems (ITSC), 2025
Download: PDF (English)
Navigation systems are widely used in modern vehicles. However, conventional approaches that rely on static map information struggle to adapt to dynamic changes in the surrounding environment. To address this limitation, human-like guidance has attracted increasing attention as a method that leverages image recognition to interpret driving scenes and generate natural-language navigation in a human-like manner. Scene graphs, which structurally represent relationships among objects, have proven effective for this task. However, existing methods often rely on high-dimensional visual features, posing challenges in terms of interpretability and scalability. In this study, we propose a novel approach for generating navigation text by constructing a spatio-temporal scene graph using only object positions and class labels as node information. This enables a more compact and interpretable graph representation. The proposed system generates natural navigation using a graph-to-text model based on Graph Attention Networks (GAT). Furthermore, we incorporate vehicle motion information at intersections into the graph and introduce mechanisms to enhance attention to important nodes, enabling visual interpretation of the model’s decision-making process through attention visualization. Experimental results show that the proposed method achieves better performance than existing Convolutional Neural Network (CNN)- and Transformer-based approaches, particularly in the integration of long-term temporal information for text generation.