Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding (TSP3D)

paper

2025·arXiv·arxiv.org/abs/2502.10392

Authors

Wenxuan Guo·Xiuwei Xu·Ziwei Wang·Jianjiang Feng·Jie Zhou·Jiwen Lu

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety discussions around persistent memory in deployed LLMs, as inconsistent or unreliable memory retention can affect model trustworthiness, consistency, and alignment with user intent over time.

Paper Details

Citations

0 influential

Year

2025

arXiv:2502.10392 DOI:10.1109/CVPR52734.2025.00347 Semantic Scholar

Metadata

Importance: 45/100arxiv preprintprimary source

Abstract

In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with $+1.13$ lead of Acc@0.5 on ScanRefer, and $+2.6$ and $+3.2$ leads on NR3D and SR3D respectively. The code is available at \href{https://github.com/GWxuan/TSP3D}{https://github.com/GWxuan/TSP3D}.

Summary

MemoryArena introduces a benchmark designed to evaluate the long-term memory capabilities of large language models across diverse conversational and task settings. It assesses how well models retain, retrieve, and utilize information over extended interactions, addressing a key limitation in current LLM evaluation frameworks. The benchmark provides structured tasks and metrics to compare memory performance across different architectures and approaches.

Key Points

•Proposes MemoryArena, a comprehensive benchmark specifically targeting long-term memory evaluation in LLMs across multi-session and extended dialogue scenarios.
•Identifies and categorizes different types of memory tasks (e.g., episodic, semantic, procedural) to systematically assess model memory retention and retrieval.
•Reveals significant gaps in current LLM memory capabilities, showing most models struggle to reliably retain information across long conversational contexts.
•Provides standardized evaluation metrics and datasets to enable reproducible comparison of memory-augmented and standard LLM architectures.
•Highlights implications for AI safety and alignment, as reliable memory is critical for consistent, trustworthy behavior in deployed AI assistants.

Cited by 1 page

Page	Type	Quality
Large Language Models	Capability	60.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202661 KB

[2502.10392] Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

 
 
 Wenxuan Guo 1 ∗ *  Xiuwei Xu 1 ∗ *  Ziwei Wang 2  Jianjiang Feng 1 † \dagger  Jie Zhou 1  Jiwen Lu 1 
 1 Tsinghua University    2 Nanyang Technological University
 {gwx22,xxw21}@mails.tsinghua.edu.cn  ziwei.wang@ntu.edu.sg 
 {jfeng,jzhou,lujiwen}@tsinghua.edu.cn 
 
 

 
 Abstract

 In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route.
However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features.
To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion.
Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead.
Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with + 1.13 +1.13 lead of Acc@0.5 on ScanRefer, and + 2.6 +2.6 and + 3.2 +3.2 leads on NR3D and SR3D respectively. The code is available at https://github.com/GWxuan/TSP3D .

 
 5 5 footnotetext: Equal contribution. † Corresponding author. 
 
 
 1 Introduction

 
 Incorporating multi-modal information to guide 3D visual perception is a promising direction. In these years, 3D visual grounding (3DVG), also known as 3D instance referencing, has been paid increasing attention as a fundamental multi-modal 3D perception task. The aim of 3DVG is to locate an object in the scene with a free-form query description. 3DVG is challenging since it requires understanding of both 3D scene and language description. Recently, with the development of 3D scene perception and vision-language models, 3DVG methods have shown remarkable progress  [ 16 , 22 ] . However, with 3DVG being widely applied in fields like robotics and AR / VR where inference speed is the main bottleneck, how to construct efficient real-time 3DVG model remains a challenging problem.

 
 
 Figure 1 : Comparison of 3DVG methods on Sc

... (truncated, 61 KB total)

Resource ID: 0b54496953b7c462 | Stable ID: sid_8MJ20IfcuY