학술논문

Evaluate The Image Captioning Technique Using State-of-the-art, Attention And Non-Attention Models To Generate Human Like Captions
Document Type
Conference
Source
2023 16th International Conference on Developments in eSystems Engineering (DeSE) Developments in eSystems Engineering (DeSE), 2023 16th International Conference on. :48-53 Dec, 2023
Subject
Bioengineering
Computing and Processing
Engineering Profession
General Topics for Engineers
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Training
Deep learning
Analytical models
Benchmark testing
Feature extraction
Transformers
Generative adversarial networks
BERT
GAN
WGAN
COCO
Pycocotools
Language
Abstract
By utilizing computer vision along with Natural Language Processing (NLP), Image Captioning generates descriptive text for an image with the goal of emulating human description. This technology traditionally employed deep learning models, specifically an encoder-decoder architecture. The encoder evaluates an image, while the decoder generates the caption. This study highlights the comparison between Transformer-based attention and GAN networks. In this research the benchmark dataset MS COCO is employed for training and evaluation. Since the dataset is large and the library “pycoctools” is updated to allow only a certain number of images for the train, evaluation, and test images of the COCO. The two models that were used for this comparison were based on BERT and the other WGAN. The study uses ResNet to get image features and word embedding to extract caption features. The models were trained on two dataset size, one on 1k images and 5k captions and another on 3k images and 15k captions, to view the impact on training. Both models were compared on parameters such as accuracy, losses, and bleu score. Finally, the experiments conducted shows that the on the basic of loss, accuracy and bleu score the BERT-based attention models performs much better as compared to the WGAN based model. The BERT-based models reached up to an accuracy of 96.9% and the better bleu score. The WGAN was only able to learn very little after the train was completed. The resultant caption generated were meaningful and comparative to the ground truth in case of BERT based model. While the WGAN generated caption made little sense.