Abstract:
Forecasting the future state of a scene is a key computer vision task needed to build systems capable of proactive perception and decision-making in changing environments. This work addresses the problem of forecasting future scene graphs, where, given a video and a sequence of past graphs, one must predict objects and their relations in subsequent frames. Unlike existing approaches limited to static perception, the proposed method, GraphCast, takes into account semantic vision-language features of objects and their temporal dynamics. We introduce a model architecture based on object-centric encoding with a foundation transformer model, interaction modeling via a biaffine relation classification head, and a specialized object presence classifier. In addition, a temporal convolution module is used to extract features and improve robustness to noise. Experiments on the STAR and Action Genome datasets demonstrate that the proposed architecture outperforms existing baselines.
Keywords:scene graph forecasting, video understanding, spatio-temporal reasoning, neural network.