Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.

VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping / Polonara, M.; Yang, X.; Carbonari, L.; Zhang, X.. - In: ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING. - ISSN 0736-5845. - 100:(2026). [10.1016/j.rcim.2026.103244]

VL-GRiP3: A hierarchical pipeline leveraging vision-language models for autonomous robotic 3D grasping

Polonara M.;Carbonari L.;
2026-01-01

Abstract

Autonomous grasping has long been a central topic in robotics, yet deployment in small and medium-sized enterprises (SMEs) is still hindered by low-level robot programming and the lack of natural language interaction. Recent Vision-Language-Action models (VLAs) allow robots to interpret natural language commands for intuitive interaction and control, but they still exhibit output uncertainty and are not yet well suited to directly generating reliable, precise actions in safety-critical industrial contexts. To address this gap, we present VL-GRiP3, a hierarchical Vision-Language model (VLM)-enabled pipeline for autonomous 3D robotic grasping that bridges natural language interaction and accurate, reliable manipulation in SME settings. The framework decomposes language understanding, perception, and action planning in a transparent modular architecture, improving flexibility and interpretability. Within this architecture, a single VLM backbone handles natural language interpretation, target perception, and high-level action planning. CAD-augmented point cloud registration then mitigates occlusions in single RGB-D views while keeping hardware cost low, and an M2T2-based grasp planner predicts accurate 3D grasp poses that explicitly account for complex object geometry from the augmented point cloud, enabling reliable manipulation of irregular industrial parts. Experiments show that our fine-tuned VLM modules achieve segmentation performance comparable to YOLOv8n, and VL-GRiP3 attains a 94.67% success rate over 150 randomized grasping trials. A comparative evaluation against state-of-the-art end-to-end VLAs further indicates that our modular, CAD-augmented design with explicit 3D grasp pose prediction yields more reliable and controllable behavior for SME manufacturing applications.
2026
3D grasping pose; Autonomous grasping; Point cloud augmentation; Robotic action; Vision-language model
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/358013
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact