학술논문

Structural and Pixel Relation Modeling for Semisupervised Instrument Segmentation From Surgical Videos
Document Type
Periodical
Source
IEEE Transactions on Instrumentation and Measurement IEEE Trans. Instrum. Meas. Instrumentation and Measurement, IEEE Transactions on. 73:1-14 2024
Subject
Power, Energy and Industry Applications
Components, Circuits, Devices and Systems
Surgery
Instruments
Transformers
Videos
Semantics
Task analysis
Data models
Confusion region
contrastive learning
semisupervised learning
structural and pixel relation
surgical instrument segmentation
Language
ISSN
0018-9456
1557-9662
Abstract
Automatic instrument segmentation from surgical videos via deep learning has drawn increasing attention recently. However, interferences, such as blood or illumination, induce confusion of targets, which can be further exacerbated by the lack of labeled data, making accurate segmentation of instruments very challenging. Previous methods rarely pay attention to analyze confusion regions. In this article, we introduce a semisupervised framework to perform instrument segmentation task with sparse annotated surgical videos, where structural and pixel relations are modeled to address the confusion region issue in the surgical scene. For structural relation, we propose a semantic regularization module to constrain the relative distance of structurewise features, via a spatial–temporal transformer (STTransformer) equipped with a 3-D relative distance regression (RDR) mechanism that can be performed in a self-supervised manner. For pixel relation, we propose a confusion-aware contrastive learning strategy to build relations between nonconfusion and confusion regions, where the pixel embeddings from confusion regions are matched with high-quality labeled pixel embeddings that are stored in a classwise memory bank. We extensively validate our method on EndoVis17 and CaDIS datasets that are two public surgical video benchmarks, and experimental results demonstrate the effectiveness of our method on utilizing sparse annotated frames.