ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Shuting He1*, Guangquan Jie2*, Changshuo Wang1, Yun Zhou1, Shuming Hu1, Guanbin Li3, Henghui Ding2✉
(* indicates equal contribution, ✉ indicates corresponding author)
1Shanghai University of Finance and Economics, 2Fudan University, 3Sun Yat-sen University
ICML 2025 (Oral)

TL;DR: We introduce R3DGS — the task of segmenting objects in a 3D Gaussian scene from a natural-language referring expression — release the first dataset (Ref-LERF), and propose ReferSplat, which explicitly models 3D Gaussians with language in a spatially aware paradigm and reaches state-of-the-art on both R3DGS and 3D open-vocabulary segmentation.

R3DGS task overview

Figure 1. R3DGS segments objects in a 3D Gaussian scene from a natural-language description — including objects that may be occluded or invisible in a novel view.

Abstract

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI.

To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at github.com/heshuting555/ReferSplat.

Motivation

Given a 3D Gaussian scene and a natural-language expression such as “the mug behind the laptop on the left”, R3DGS asks a model to return the 3D Gaussians belonging to the referred object. Unlike open-vocabulary 3D segmentation — which matches isolated concept words to 3D regions — referring expressions encode spatial relationships and discriminative attributes, and the target may be occluded or not visible in any single view. Existing pipelines fall short on both fronts.

Pipeline comparison: existing open-vocabulary 3DGS vs. ReferSplat

Figure 2. Comparison of (a) the existing open-vocabulary 3DGS segmentation pipeline and (b) the proposed ReferSplat for R3DGS.

Method

ReferSplat attaches a language-aware referring feature to each 3D Gaussian, building a 3D Gaussian Referring Field. Segmentation masks are obtained by modulating these referring features with the word features fw, supervised with generated pseudo masks. A Position-aware Cross-Modal Interaction module further grounds the referring expression on the rendered output, jointly reasoning about object semantics and inter-object spatial relations directly in the 3D Gaussian representation.

ReferSplat method overview

Figure 3. Overview of ReferSplat — 3D Gaussian Referring Fields followed by Position-aware Cross-Modal Interaction.

Ref-LERF Dataset

We construct Ref-LERF, the first dataset for Referring 3D Gaussian Splatting Segmentation. It extends LERF scenes with natural-language referring expressions and per-Gaussian object annotations, enabling evaluation of language-conditioned 3D segmentation under spatial reasoning.

Ref-LERF dataset analysis

Figure 4. Dataset analysis of Ref-LERF. Expressions are rich in spatial language (placed, near, next, behind, under) and substantially longer than the LERF-OVS expressions used in prior open-vocabulary 3D benchmarks.

Results

ReferSplat establishes state-of-the-art results on the new R3DGS benchmark and also improves over prior work on standard 3D open-vocabulary segmentation. See the paper for full quantitative comparisons and ablations.

Qualitative R3DGS comparison on Ref-LERF

Figure 5. Qualitative R3DGS comparison on Ref-LERF. Blue masks highlight spatial descriptions in the referring expression. Grounded-SAM, LangSplat and GaussianGrouping struggle to localize the correct object, while ReferSplat closely matches the ground truth.

BibTeX

@inproceedings{he2025refersplat,
  title     = {ReferSplat: Referring Segmentation in 3D Gaussian Splatting},
  author    = {He, Shuting and Jie, Guangquan and Wang, Changshuo and Zhou, Yun and Hu, Shuming and Li, Guanbin and Ding, Henghui},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2025}
}