This work presents a novel approach for robotic manipulation that integrates Paligemma, a vision-language model, with a grasping model to generate effective grasping poses. The system leverages Paligemma’s segmentation capabilities to extract object masks from visual data based on natural language prompts. This enables flexible object recognition without requiring extensive manual programming or pre-defined object models. The extracted masks are combined with depth data from multiple RGB-D cameras to reconstruct detailed point clouds that provide a precise representation of the workspace. These point clouds are then processed by the grasping model, which predicts optimal grasping poses for successful manipulation. By combining advanced segmentation with grasp planning, the system effectively handles standard manipulation tasks, demonstrating strong adaptability even in scenarios involving objects with challenging geometries or partially occluded surfaces. Additionally, the proposed pipeline can benefit from fine-tuning Paligemma to improve segmentation accuracy, particularly for custom or rarely encountered objects, further enhancing its applicability in complex industrial environments. The system’s ability to interpret natural language instructions and adapt to novel objects enhances its usability, making it particularly effective in dynamic or unstructured environments. This approach improves the efficiency of robotic manipulation tasks, especially when rapid adaptation and minimal re-programming are required.
A VISION-LANGUAGE MODEL APPROACH FOR OBJECT SEGMENTATION AND ROBOTIC GRASPING / Polonara, M.; Frezza, A.; Palmieri, G.; Carbonari, L.. - 5:(2025). ( ASME 2025 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, IDETC-CIE 2025 Hilton Anaheim, usa 2025) [10.1115/DETC2025-168923].
A VISION-LANGUAGE MODEL APPROACH FOR OBJECT SEGMENTATION AND ROBOTIC GRASPING
Polonara M.;Frezza A.;Palmieri G.;Carbonari L.
2025-01-01
Abstract
This work presents a novel approach for robotic manipulation that integrates Paligemma, a vision-language model, with a grasping model to generate effective grasping poses. The system leverages Paligemma’s segmentation capabilities to extract object masks from visual data based on natural language prompts. This enables flexible object recognition without requiring extensive manual programming or pre-defined object models. The extracted masks are combined with depth data from multiple RGB-D cameras to reconstruct detailed point clouds that provide a precise representation of the workspace. These point clouds are then processed by the grasping model, which predicts optimal grasping poses for successful manipulation. By combining advanced segmentation with grasp planning, the system effectively handles standard manipulation tasks, demonstrating strong adaptability even in scenarios involving objects with challenging geometries or partially occluded surfaces. Additionally, the proposed pipeline can benefit from fine-tuning Paligemma to improve segmentation accuracy, particularly for custom or rarely encountered objects, further enhancing its applicability in complex industrial environments. The system’s ability to interpret natural language instructions and adapt to novel objects enhances its usability, making it particularly effective in dynamic or unstructured environments. This approach improves the efficiency of robotic manipulation tasks, especially when rapid adaptation and minimal re-programming are required.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


