학술논문

Light Field Salient Object Detection With Sparse Views via Complementary and Discriminative Interaction Network
Document Type
Periodical
Source
IEEE Transactions on Circuits and Systems for Video Technology IEEE Trans. Circuits Syst. Video Technol. Circuits and Systems for Video Technology, IEEE Transactions on. 34(2):1070-1085 Feb, 2024
Subject
Components, Circuits, Devices and Systems
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Light fields
Feature extraction
Arrays
Object detection
Cameras
Streaming media
Three-dimensional displays
Light field
salient object detection
sparse views
complementary and discriminative interaction
Language
ISSN
1051-8215
1558-2205
Abstract
4D light field data record the scene from multiple views, thus implicitly providing beneficial depth cue for salient object detection in challenging scenes. Existing light field salient object detection (LF SOD) methods usually use a large number of views to improve the detection accuracy. However, using so many views for LF SOD brings difficulties to its practical applications. Considering that adjacent views in a light field are actually with very similar contents, in this work, we propose defining a more efficient pattern of input views, i. e., key sparse views, and design a network to effectively explore the depth cue from sparse views for LF SOD. Specifically, we firstly introduce a low rank-based statistical analysis to the existing LF SOD datasets, which allows us to conclude a fixed yet universal pattern for our key sparse views, including the number and positions of views. These views maintain the sufficient depth cue, but greatly lower the number of views to be captured and processed, facilitating practical applications. Then, we propose an effective solution with a key Complementary and Discriminative Interaction Module (CDIM) for LF SOD from key sparse views, named CDINet. The CDINet follows a two-stream structure to extract the depth cue from the light field stream (i. e., sparse views) and the appearance cue from the RGB stream (i. e., center view), generating features and initial saliency maps for each stream. The CDIM is tailored for inter-stream interaction of both these features and saliency maps, using the depth cue to complement the missing salient regions in RGB stream and discriminate the background distraction, to enhance the final saliency map further. Extensive experiments on three LF multi-view datasets demonstrate that our CDINet not only outperforms the state-of-the-art 2D methods, but also achieves competitive performance as compared with the state-of-the-art 3D and 4D methods. The code and results of our method are available at https://github.com/GilbertRC/LFSOD-CDINet.