학술논문
A Multi-Modal ELMo Model for Image Sentiment Recognition of Consumer Data
Document Type
Periodical
Author
Source
IEEE Transactions on Consumer Electronics IEEE Trans. Consumer Electron. Consumer Electronics, IEEE Transactions on. 70(1):3697-3708 Feb, 2024
Subject
Language
ISSN
0098-3063
1558-4127
1558-4127
Abstract
Recent advancements in consumer electronics as well as imaging technology have generated abundant multimodal data for consumer-centric AI applications. Effective analysis and utilization of such heterogeneous data hold great potential for consumption decisions. Hence, effective analysis of multi-modal consumer-generated content is a prominent research topic in the field of customer-centric artificial intelligence (AI). However, two key challenges that arise in this task are multi-modal representation and fusion. To address these issues, we propose a multi-modal embedding from the language model (MELMo) enhanced decision-making model. The main idea is to extend the ELMo to a multi-modal scenario by designing a deep contextualized visual embedding from the language model (VELMo) and modeling multi-modal fusion at the decision level by using the cross-modal attention mechanism. In addition, we also designed a novel multi-task decoder to learn the shared knowledge from related tasks. We evaluate our approach on two benchmark datasets, CMU-MOSI and CMU-MOSEI, and show that MELMo outperforms state-of-the-art approaches. The F1 scores on the CMU-MOSI and CMU-MOSEI datasets reach 86.1% and 85.2%, respectively, representing an improvement of approximately 1.0% and 1.3% over the state-of-the-art system, providing an effective technique for multimodal consumer analytics in electronics and beyond.