GaussFiller: Unleashing VLM-Expert Guidance for 3D Scene Completion with 3D Gaussian Splatting

Yuhan Ping 1   Cheng Lin 1   Yuan Liu 1   Zhiyang Dou 1   Jia Pan 1   Wenping Wang 2  

Abstract


Due to unavoidable view constraints and occlusions during capture, achieving visually complete scene reconstruction has always been a challenging task. We introduce GaussFiller, a novel framework designed for complete 3D scene reconstruction from sparse RGB-D inputs. Unlike existing mesh-based representations, our approach uses 3D Gaussian Splatting (3DGS), a more flexible and efficient representation for scene reconstruction. We argue that the scene completion problem involves complex semantic and visual reasoning processes. Therefore, we effectively leverage the powerful reasoning capabilities of Vision Language Models (VLMs) as expert guidance through two novel mechanisms. First, given an incomplete 3DGS scene, we propose a Global-Local Alignment strategy to select the optimal completion viewpoints. This strategy utilizes the semantic reasoning ability of VLMs to analyze contextual relationships between the target scene and potential candidate views, providing semantic feedback. Second, after completing the selected views, we introduce a VLM-Coherence Test-Time Scaling module, which integrates a context-similarity-based selection strategy to identify the single best-inpainted result that exhibits the highest semantic coherence. Experiments indicate that our method can reconstruct plausible and visually satisfying complete 3D scenes from limited RGB-D data, offering significant benefits for downstream applications such as interior design and decoration.

Pipeline


The overview of our pipeline. Given a set of sparse RGBD images as input, we first reconstruct an initial 3D Gaussian scene. Then we strategically deploy two VLM-Expert guidance mechanisms to achieve scene completion from the initialization. (1) Global-Local Alignment. We leverage LLaVA-OneVision to provide a comprehensive textual prior to filter initial local camera pose candidates that capture hole regions, which are coarsely generated based on geometric rules, selecting informative and semantically aligned poses for afterwards Fooocus inpainting. (2) VLM-Coherence Test-Time Scaling. We utilize LLaVA to stabilize diffusion inpainting by selecting the result with the highest semantic coherence to the scene, ensuring final geometric and appearance fidelity. Finally, we retrain and refine the initial scene with these hole-filled images to get a complete 3D scene (Note that the roof is removed for better visualization).

Results on Replica


Qualitative comparisons on the Replica datasets.

Results on ScanNet++


Qualitative comparisons on the ScanNet++ datasets.

Results on In the wild data


Results on the in the wild data.

}

This page is Zotero translator friendly. Page last updated