학술논문

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

Document Type

Working Paper

Author

Lee, Chen-Yu; Li, Chun-Liang; Zhang, Hao; Dozat, Timothy; Perot, Vincent; Su, Guolong; Zhang, Xiang; Sohn, Kihyuk; Glushnev, Nikolai; Wang, Renshen; Ainslie, Joshua; Long, Shangbang; Qin, Siyang; Fujii, Yasuhisa; Hua, Nan; Pfister, Tomas

Source

Subject

Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Machine Learning

Language

Abstract

The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
Comment: Accepted to ACL 2023

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송