This work presents a new architecture for an assistive mobile robot designed to support the elderly and individuals with disabilities in performing daily indoor tasks. The proposed framework integrates multimodal perception, language-based reasoning, and safety-aware action planning to enable natural and effective two-way communication between humans and robots. At its core, the system utilizes large language models (LLMs) for dialogue management, contextual understanding, and reasoning over fused sensory inputs, including vision, speech, and proprioceptive data. By combining speech recognition, object detection, and local memory modules, the robot not only interprets explicit user commands but also infers implicit intentions, predicts missing information, and requests clarifications when necessary. A dedicated safety layer filters and validates action sequences before execution, ensuring reliability and user safety. The architecture further incorporates short- and long-term memory structures, enabling the robot to maintain a dialogue history and semantic knowledge of the environment. This bidirectional interaction model allows the robot to generate both natural conversational responses and executable action plans in a context-aware manner. Preliminary implementation and testing demonstrate promising performance, bridging the gap between conversational AI and embodied robotic action in real-life assistive scenarios.

Toward Embodied Intelligence: An Architecture for Natural Dialogue and Action Execution in Assistive Robots / Omer, K.; Monteriu', A.. - (2026), pp. 42-47. ( 12th International Conference on Automation, Robotics and Applications, ICARA 2026 tur 2026) [10.1109/ICARA69401.2026.11480310].

Toward Embodied Intelligence: An Architecture for Natural Dialogue and Action Execution in Assistive Robots

Omer K.;Monteriu' A.
2026-01-01

Abstract

This work presents a new architecture for an assistive mobile robot designed to support the elderly and individuals with disabilities in performing daily indoor tasks. The proposed framework integrates multimodal perception, language-based reasoning, and safety-aware action planning to enable natural and effective two-way communication between humans and robots. At its core, the system utilizes large language models (LLMs) for dialogue management, contextual understanding, and reasoning over fused sensory inputs, including vision, speech, and proprioceptive data. By combining speech recognition, object detection, and local memory modules, the robot not only interprets explicit user commands but also infers implicit intentions, predicts missing information, and requests clarifications when necessary. A dedicated safety layer filters and validates action sequences before execution, ensuring reliability and user safety. The architecture further incorporates short- and long-term memory structures, enabling the robot to maintain a dialogue history and semantic knowledge of the environment. This bidirectional interaction model allows the robot to generate both natural conversational responses and executable action plans in a context-aware manner. Preliminary implementation and testing demonstrate promising performance, bridging the gap between conversational AI and embodied robotic action in real-life assistive scenarios.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/357536
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact