학술논문

A Hybrid Transfer Learning Architecture Based Image Captioning Model for Assisting Visually Impaired
Document Type
Conference
Source
2023 IEEE 3rd Applied Signal Processing Conference (ASPCON) Applied Signal Processing Conference (ASPCON), 2023 IEEE 3rd. :211-215 Nov, 2023
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
General Topics for Engineers
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Visualization
Transfer learning
Semantics
Signal processing
Feature extraction
Decoding
Task analysis
Deep Learning
image captioning
natural language processing
transfer Learning
visually impaired
Language
Abstract
The task of automatically presenting the subject matter of a picture without human involvement is a highly intricate task. The issue at hand is commonly addressed through the utilization of computer vision and natural language processing techniques. Firstly, the image content must be analyzed and interpreted. Subsequently, this understanding needs to be transformed into grammatically accurate and contextually meaningful sentences. An issue that arises when employing deep learning (DL) based model that produces insufficient understanding of semantic context along with the encoder’s inability to transmit essential visual information to the decoding unit in an efficient manner. To mitigate this issue, a hybrid transfer learning scheme has been proposed that is capable of generating image captions with a high degree of accuracy and minimal occurrence of errors. In this article, Inception module has been fused with residual network for extracting significant features from the images for encoding. Gated recurrent unit has been used as the decoder by producing appropriate captions for the images. The novelty lies in the inclusion of an attention layer that facilitates the acquisition of a linguistically relevant and syntactically coherent representation by the model. The attention interface is situated between the encoder and decoder components. Experimental study reveals the effectiveness of the proposed hybrid DL framework over existing DL methods. The proposed image captioning model can further be implemented for assisting the visually impaired.