학술논문

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

Document Type

Working Paper

Author

Ohi, Masanari; Kaneko, Masahiro; Okazaki, Naoaki; Inoue, Nakamasa

Source

Subject

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition

Language

Abstract

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria prioritized can differ depending on the task, making it challenging for current metrics to adapt to multi-task scenarios. To address this limitation, we propose HarmonicEval, a reference-free comprehensive evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four multi-modal tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송