학술논문

CoCV: Heterogeneous Processors Collaboration Mechanism for End-to-End Execution of Intelligent Computer Vision Tasks on Mobile Devices
Document Type
Conference
Source
2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS) ICPADS Parallel and Distributed Systems (ICPADS), 2023 IEEE 29th International Conference on. :2507-2514 Dec, 2023
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Computer vision
Scheduling algorithms
Graphics processing units
Prototypes
Object detection
Parallel processing
Mobile handsets
CPU&GPU Co-execution
Mobile devices
Heterogeneous Computing
Language
ISSN
2690-5965
Abstract
Object detection, image classification, and various other computer vision tasks have become prevalent on mobile devices. These computer vision tasks are typically executed with three stages: pre-processing, inference, and post-processing. Mobile SoC, serving as the computing unit on mobile devices, typically consists of heterogeneous processors like CPU, GPU, and NPU. However, during the execution of a computer vision task, current available frameworks only achieve the parallelism of CPU and GPU in the inference stage. While during pre- and post-processing, only CPU is used, leaving GPU and NPU on the SoC to be idle. For mobile applications that require low latency, the overhead of pre-processing and post-processing stages often account for more than 50% of the total latency, which becoming a performance bottleneck of the entire task. To reduce latency, it is imperative to fully utilize the idle heterogeneous processors (GPU, NPU) on the SoC and achieve heterogeneous processors parallelism in all three stages during execution. In this paper, we propose CoCV, a heterogeneous processor parallel computing system for computer vision tasks on mobile devices. In CoCV, we are the first to build an image processing operator library for heterogeneous parallel computing on mobile devices. Besides, we design a task allocation scheduling algorithm to guide the partitioning of processing tasks during execution, which ensuring a relatively balanced workload between different processors. A cross-stage operator chaining technique is also proposed to reduce the data sharing overhead among different processors during execution. We build a prototype system and evaluate it with different computer vision tasks. The results show up to 33% latency reduction for end-to-end tasks and 2.32× speedup compared with the current best solution.