학술논문

Self-Supervised Hypergraph Learning for Enhanced Multimodal Representation
Document Type
Periodical
Source
IEEE Access Access, IEEE. 12:20830-20839 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Encoding
Road transportation
Web sites
Visualization
Video on demand
Social networking (online)
Robustness
Self-supervised learning
Neural networks
Multisensory integration
Multimodal
micro-video
self-supervised learning
hypergraph neural networks
Language
ISSN
2169-3536
Abstract
Hypergraph neural networks have gained substantial popularity in capturing complex correlations between data items in multimodal datasets. In this study, we propose a novel approach called the self-supervised hypergraph learning (SHL) framework that focuses on extracting hypergraph features to improve multimodal representation. Our method utilizes a dual embedding strategy and leverages SHL to improve the accuracy and robustness of the model. To achieve this, we employ a hypergraph learning framework to extract global context effectively by capturing rich inter-modal dependencies. Additionally, we introduce a novel self-supervised learning (SSL) component that utilizes the interaction graph data, thereby strengthening the robustness of the model. By jointly optimizing hypergraph feature extraction and SSL, SHL significantly improves the performance of multimodal representation tasks. To validate the effectiveness of our approach, we construct two comprehensive multimodal micro-video recommendation datasets using publicly available data (TikTok and MovieLens-10M). Prior to dataset creation, we meticulously handle invalid entries and outliers and complete missing mode information using external auxiliary sources, such as YouTube. These datasets are made publicly available to the research community for evaluation purposes. Experimental results on the above recommendation datasets demonstrate that the proposed SHL approach outperforms state-of-the-art baselines, highlighting its superior performance in multimodal representation tasks.