Most recent 3D instance segmentation methods are open vocabulary, offering greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, i.e., the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene."
We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images.
We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings.
Given the point cloud \(\mathcal{P}\) of a 3D scene and the corresponding set of \(N\) posed images \(\mathcal{V} = \{I_n\}_{n=1}^N\), our method predicts 3D instance masks with their associated semantic labels without knowing a predefined vocabulary. Our method first utilizes a large vision-language assistant and an open-vocabulary 2D instance segmentation model to identify and ground objects on each posed image \(I_n\), forming the scene vocabulary \(\mathcal{C}\) while mitigating the risk of hallucination by the vision-language assistant.
Meanwhile, we partition the 3D scene \(\mathcal{P}\) into geometrically-coherent superpoints \(\mathcal{Q}\), to serve as initial seeds for 3D instance proposals. Then, with the semantic-aware instance masks from multi-view images, we propose a novel procedure in representing superpoints and guiding their merging into 3D instance masks, using both the grounded semantic labels and their instance masks.
By projecting each 3D superpoint onto image planes and checking its overlapping with 2D instance masks, we aggregate semantic labels from multiple views within each superpoint. Once each superpoint is associated with a semantic label, we perform superpoint merging to form 3D instance masks via spectral clustering. This involves defining an affinity matrix among superpoints constructed by both mask coherence scores computed with the 2D instance masks and semantic coherence scores computed with the per-superpoint textual embeddings.
Finally, for each 3D instance proposal, we obtain the text-aligned representation by aggregating the CLIP visual representation of multi-scale object crops from multi-view images. We further enrich this vision-based representation with textual representation derived from the merged superpoints. This text-aligned mask representation enables the semantic assignment to instance masks with the scene vocabulary \(\mathcal{C}\).