機械知覚&ロボティクスグループ
中部大学

Deep Learning 国際会議

Multi-scale Cell-based Layout Representation for Document Understanding

Author
Yuzhi Shi, Mijung Kim, Yeongnam Chae
Publication
IEEE/CVF Winter Conference on Applications of Computer Vision, 2023

Download: PDF (English)

Deep learning techniques have achieved remarkable progress in document understanding. Most models use coordinates to represent absolute or relative spatial information of components, but they are difficult to represent latent rules in the document layout. This makes learning layout representation to be more difficult. Unlike the previous researches which have employed the coordinate system, graph or grid to represent the document layout, we propose a novel layout representation, the cell-based layout, to provide easy-to-understand spatial information for backbone models. In line with human reading habits, it uses cell information, i.e. row and column index, to represent the position of components in a document, and makes the document layout easier to understand. Furthermore, we proposed the multi-scale layout to represent the hierarchical structure of layout, and developed a data augmentation method to improve the performance. Experiment results show that our method achieves the state-of-the-art performance in textbased tasks, including form understanding and receipt understanding, and improves the performance in image-based task such as document image classification. We released the code in the repo.

前の研究 次の研究