Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

Abstract

Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/https://github.com/SJTU-DeepVisionLab/LaGa.

What is View-Dependent Semantics?

“It's a range viewed in face and peaks viewed from the side,
横看成岭侧成峰，
Assuming different shapes viewed from far and wide.
远近高低各不同。”

overview

A passport viewed from the front clearly reveals its title, while from the back or side, it becomes unrecognizable. This phenomenon highlights a fundamental gap between 2D and 3D understanding. Simply projecting 2D semantics onto 3D Gaussians results in incomplete or inaccurate semantic assignments, as each Gaussian inherits semantics visible only from specific viewpoints.

How to Tackle View-Dependent Semantics?

overview

To address the challenge of view-dependent semantics, we propose LaGa, a simple yet effective framework for 3D scene understanding that explicitly considers the multi-view ambiguity inherent in 3D semantics:

✔ 3D Scene Decomposition
We introduce a contrastive learning framework that decomposes the 3D scene by uncovering multi-view relationships between objects and their semantic representations.
✔ View-Aggregated Semantic Representation
We adopt an adaptive K-means clustering algorithm to distill a compact set of semantic descriptors from multi-view observations of each object, and reweight them based on their alignment with global object semantics and intra-cluster consistency.

Performance

LaGa achieves state-of-the-art performance on the LERF-OVS dataset, significantly outperforming previous methods by a large margin. ‘†’ indicates results from OpenGaussian, ‘‡’ indicates our reimplementation, and ‘*’ denotes unrefereed preprints.

overview

Interactive GUI

We provide an intuitive and user-friendly interactive GUI for LaGa. For installation instructions and usage details, please visit our GitHub repository.

Citation

@inproceedings{laga,
      title={Tackling View-Dependent Semantics in 3D Language Gaussian Splatting}, 
      author={Jiazhong Cen and Xudong Zhou and Jiemin Fang and Changsong Wen and Lingxi Xie and Xiaopeng Zhang and Wei Shen and Qi Tian},
      booktitle    = {ICML},
      year         = {2025},
}