OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging

Yijie Tang1,*    Jiazhao Zhang3,*    Yuqing Lan1    Yulan Guo4    Dezun Dong1    Chenyang Zhu1,† Kai Xu1,2,†
1National University of Defense Technology    2Xiangjiang Laboratory   
3CFCS, School of CS, Peking University      4Sun Yat-sen University

*Equal Contribution    Corresponding Authors
CVPR 2025

Real-world Demos

Contributions

  • OnlineAnySeg propose an efficient data structure for organizing sequential 2D masks, which can incrementally maintain the spatial associations between all the masks in real-time.
  • OnlineAnySeg designs a zero-shot online mask merging strategy. By leveraging spatial overlap and multimodal similarity through collaborative filtering, this approach eliminates the dependency on training data, enabling it to maintain good performance even in incomplete scanned scenes.
  • OnlineAnySeg performs comparably with existing offline methods and gains notable improvements over the SOTA online method on the publicly available benchmark, running at ~15 FPS.

Abstract

Online 3D instance segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential—yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from O(n2)O(n^2) to O(n)O(n). Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, 3D instance segmentation with leading efficiency.

Method Overview

The Overview of OnelineAnySeg. (a) A posed RGB-D stream is input to our method sequentially. (b) A series of 2D masks are generated by VFM from the input color image and back-projected into 3D space, establishing associations with the VoxelHashing scene representation. Meanwhile, semantic and geometric features of the masks are extracted from pre-trained feature extractors and, together with mask overlap associations, serve as the core criteria for the Mask Merging process. (c) The final prediction of 3D instances is then output.

Experimental Results

Results on Public Datasets

BibTeX

@article{tang2025onlineanyseg,
            title={OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging},
            author={Tang, Yijie and Zhang, Jiazhao and Lan, Yuqing and Guo, Yulan and Dong, Dezun and Zhu, Chenyang and Xu, Kai},
            journal={arXiv preprint arXiv:2503.01309},
            year={2025}
          }