학술논문

ModeRNN: Harnessing Spatiotemporal Mode Collapse in Unsupervised Predictive Learning
Document Type
Periodical
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Trans. Pattern Anal. Mach. Intell. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 45(11):13281-13296 Nov, 2023
Subject
Computing and Processing
Bioengineering
Spatiotemporal phenomena
Predictive models
Visualization
Training
Data models
Probabilistic logic
Bars
Predictive learning
mode collapse
spatiotemporal modeling
recurrent neural networks
Language
ISSN
0162-8828
2160-9292
1939-3539
Abstract
Learning predictive models for unlabeled spatiotemporal data is challenging in part because visual dynamics can be highly entangled, especially in real scenes. In this paper, we refer to the multi-modal output distribution of predictive learning as spatiotemporal modes . We find an experimental phenomenon named spatiotemporal mode collapse (STMC) on most existing video prediction models, that is, features collapse into invalid representation subspaces due to the ambiguous understanding of mixed physical processes. We propose to quantify STMC and explore its solution for the first time in the context of unsupervised predictive learning. To this end, we present ModeRNN, a decoupling-aggregation framework that has a strong inductive bias of discovering the compositional structures of spatiotemporal modes between recurrent states. We first leverage a set of dynamic slots with independent parameters to extract individual building components of spatiotemporal modes. We then perform a weighted fusion of slot features to adaptively aggregate them into a unified hidden representation for recurrent updates. Through a series of experiments, we show high correlation between STMC and the fuzzy prediction results of future video frames. Besides, ModeRNN is shown to better mitigate STMC and achieve the state of the art on five video prediction datasets.