Abstract:
Multi-view stereo (MVS) networks have recently achieved remarkable progress in dense 3D reconstruction, yet they remain fundamentally limited by reliance on photometric cues. As a result, current methods fail in textureless, reflective, or non-Lambertian regions. At the same time, commodity time-of-flight (ToF) sensors provide geometric depth information that is complementary but low-resolution and noisy. In this work study a possibility to use 3D features extracted from depth data to overcome MVS limitations. For this we develop RGB-D MVSNet, an end-to-end architecture that integrates a depth-fusion encoder with a modern learning-based MVS backbone. Our method constructs a unified feature volume from both photometric and geometric features, which is then fused and regularized in a with common decoder. We evaluate the approach on the challenging Sk3D dataset containing synchronized RGB, ToF depth, and high-quality structured-light scans. Experiments demonstrate that our method improves accuracy and completeness metrics over the RGB-only baseline and achieves some qualitative improvements in reconstructing textureless and glossy regions. Additional experiments with high-quality depth input show that the method is capable of eliminating typical artifacts with better input depth quality. These results indicate that integrating geometric cues into MVS pipelines is a promising direction towards more robust, generalizable 3D reconstruction.