Dept. of Robotics Science and Technology,
Chubu University

Deep Learning Conference

Embedding Human Knowledge into Spatio-Temproal Attention Branch Network in Video Recognition via Temporal Attention

Author
Saki Noguchi, Yuzhi Shi, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi
Publication
The 34th British Machine Vision Conference, 2023

Download: PDF (English)

When recognizing object or motion, humans can accurately judge the necessary areas for recognition. In contrast, recognition using a deep learning model is based on its training data and may fail to focus on the correct regions. In image recognition, it has been shown that visualizing the basis of decisions and embedding human knowledge into deep neural networks are effective in addressing this issue. However, in video recognition, there is no visualization method enabling us to embed human knowledge. We propose spatio-temporal attention branch network (ST-ABN) for video recognition, which provides visual explanations for both spatial and temporal attentions. One of the features of the ST-ABN is that its attention output can be modified on the basis of human knowledge and used for recognition. However, since a video consists of a large number of frame images, modifying spatial attentions similar to image recognition is costly. Therefore, we manually modify temporal attentions to embed human knowledge into the ST-ABN. Experimental results with Something-Something v.2 indicate that the ST-ABN provides visual explanation for both spatial and temporal information and improves recognition performance. The results also indicate the effectiveness of embedding human knowledge into the ST-ABN and the positive changes in spatial attentions by modifying temporal attentions.

Previous Next