VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping

IRIS

Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.

VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping / Polonara, M., Yang, X., Carbonari, L., Zhang, X.. - In: ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING. - ISSN 0736-5845. - 100:(2026). [10.1016/j.rcim.2026.103244]

VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping

Polonara M.^Co-primo;Yang X.^Co-primo;Carbonari L.;Zhang X.

2026-01-01

Abstract

Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Rivista su cui è pubblicata l'opera
	
				ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.rcim.2026.103244
			
	Parole chiave
	
				3D grasping pose; Autonomous grasping; Point cloud augmentation; Robotic action; Vision-language model
			
	Dati FAIR della ricerca
	
	sub-section
	
	URL
	
									https://github.com/AU-DK-Roboti cs/VL-GRiP3
								
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Polonara_VL-GRiP3-hierarchical-pipeline-leveraging_2026.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza d'uso: Creative commons Dimensione 3 MB Formato Adobe PDF Visualizza/Apri	3 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/358013

Citazioni

ND

0

0

social impact