학술논문
A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks
Document Type
Working Paper
Author
Source
Subject
Language
Abstract
Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4\%) and NLVRv2 (+6.2\%) for VisProg and GQA (+6.5\%) and NLVRv2 (+4.0\%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.