PoVo

Given the point cloud \(\mathcal{P}\) of a 3D scene and the corresponding set of \(N\) posed images \(\mathcal{V} = \{I_n\}_{n=1}^N\), our method predicts 3D instance masks with their associated semantic labels without knowing a predefined vocabulary. Our method first utilizes a large vision-language assistant and an open-vocabulary 2D instance segmentation model to identify and ground objects on each posed image \(I_n\), forming the scene vocabulary \(\mathcal{C}\) while mitigating the risk of hallucination by the vision-language assistant.

Meanwhile, we partition the 3D scene \(\mathcal{P}\) into geometrically-coherent superpoints \(\mathcal{Q}\), to serve as initial seeds for 3D instance proposals. Then, with the semantic-aware instance masks from multi-view images, we propose a novel procedure in representing superpoints and guiding their merging into 3D instance masks, using both the grounded semantic labels and their instance masks.

By projecting each 3D superpoint onto image planes and checking its overlapping with 2D instance masks, we aggregate semantic labels from multiple views within each superpoint. Once each superpoint is associated with a semantic label, we perform superpoint merging to form 3D instance masks via spectral clustering. This involves defining an affinity matrix among superpoints constructed by both mask coherence scores computed with the 2D instance masks and semantic coherence scores computed with the per-superpoint textual embeddings.

Finally, for each 3D instance proposal, we obtain the text-aligned representation by aggregating the CLIP visual representation of multi-scale object crops from multi-view images. We further enrich this vision-based representation with textual representation derived from the merged superpoints. This text-aligned mask representation enables the semantic assignment to instance masks with the scene vocabulary \(\mathcal{C}\).

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

We introduce a vocabulary-free approach to address 3D instance segmentation that leverages language and vision assistants, moving beyond the limitations of open-vocabulary approaches.

Abstract

Method