Recent advances in artificial intelligence, computer vision, and computer graphics have allowed artificial systems to evolve from passive visual perception toward a deeper, structured understanding of three-dimensional environments. This evolution has transformed the concept of spatial intelligence, shifting it from a notion rooted purely in human reasoning to the computational domain. In this context, spatial intelligence is defined as the capability of an artificial system to perceive, represent, interpret, and act upon three-dimensional environments by integrating visual, spatial, and semantic information across the full pipeline. This progress has been shaped by the convergence of multiple disciplines, including computer vision, computer graphics, robotics, embodied agents, and generative world models. Nevertheless, current systems remain fragmented, excelling in specific tasks but lacking a cohesive, human-centered paradigm that links sensing, modeling, and deployment across domains. Building on this, this thesis introduces a spatial intelligence paradigm for 3D artificial intelligence, grounded in mature AI technologies and aimed at unifying sensing, neural synthesis, generative modeling, and interaction within a coherent, application-oriented approach. Methodologically, the thesis is structured around three interconnected pillars: multimodal sensing, vision-language modeling, and real-world generalization. Within multimodal sensing, the work investigates how heterogeneous spatial data, ranging from multi-view images to 3D assets, can be transformed into coherent 3D representations using neural rendering and generative AI approaches. Two case studies are presented: an end-to-end neural rendering framework for fashion design based on Neural Radiance Fields and 3D Gaussian Splatting, and a comparative framework for cultural heritage that evaluates generative 3D methods in terms of both 2D visual quality and 3D structural fidelity. The second pillar, vision-language modeling, explores how multimodal large language models and diffusion-based generators can bridge linguistic and spatial representations. This is demonstrated through two systems: an XR platform for context-aware, diffusion-driven 3D content generation, and a novel framework for visual reconstruction from EEG brain activity, combining neural decoding with multimodal generation and introducing a boosted reconstruction stage to enhance image quality. Both systems are validated through quantitative metrics and user-centered evaluations. Finally, the real-world generalization pillar addresses the integration of spatial AI models into interactive, human-in-the-loop environments. Specifically, a system for single-image-to-3D generation, that combines multimodal reasoning, multi-view question answering, and iterative refinement through human feedback with a user interface is proposed, and an immersive analytics platform for fashion that incorporates 3D product interaction, visual analytics, and trend visualization within an XR environment. Overall, this work presents a unified, operational paradigm for spatial intelligence, demonstrating how modern AI systems can be integrated, from sensing and representation to generation and interaction, across heterogeneous domains such as cultural heritage, fashion, and neuroscience. Beyond technical contributions, the thesis emphasizes the importance of human-centered design, interpretability, and interaction, positioning spatial intelligence not only as a computational capability, but as a collaborative interface between artificial systems and human creativity, cognition, and decision-making.
From Sensing to Understanding: A Spatial Intelligence Paradigm for 3D Artificial Intelligence / Balloni, Emanuele. - (2026 Mar).
From Sensing to Understanding: A Spatial Intelligence Paradigm for 3D Artificial Intelligence
BALLONI, EMANUELE
2026-03-01
Abstract
Recent advances in artificial intelligence, computer vision, and computer graphics have allowed artificial systems to evolve from passive visual perception toward a deeper, structured understanding of three-dimensional environments. This evolution has transformed the concept of spatial intelligence, shifting it from a notion rooted purely in human reasoning to the computational domain. In this context, spatial intelligence is defined as the capability of an artificial system to perceive, represent, interpret, and act upon three-dimensional environments by integrating visual, spatial, and semantic information across the full pipeline. This progress has been shaped by the convergence of multiple disciplines, including computer vision, computer graphics, robotics, embodied agents, and generative world models. Nevertheless, current systems remain fragmented, excelling in specific tasks but lacking a cohesive, human-centered paradigm that links sensing, modeling, and deployment across domains. Building on this, this thesis introduces a spatial intelligence paradigm for 3D artificial intelligence, grounded in mature AI technologies and aimed at unifying sensing, neural synthesis, generative modeling, and interaction within a coherent, application-oriented approach. Methodologically, the thesis is structured around three interconnected pillars: multimodal sensing, vision-language modeling, and real-world generalization. Within multimodal sensing, the work investigates how heterogeneous spatial data, ranging from multi-view images to 3D assets, can be transformed into coherent 3D representations using neural rendering and generative AI approaches. Two case studies are presented: an end-to-end neural rendering framework for fashion design based on Neural Radiance Fields and 3D Gaussian Splatting, and a comparative framework for cultural heritage that evaluates generative 3D methods in terms of both 2D visual quality and 3D structural fidelity. The second pillar, vision-language modeling, explores how multimodal large language models and diffusion-based generators can bridge linguistic and spatial representations. This is demonstrated through two systems: an XR platform for context-aware, diffusion-driven 3D content generation, and a novel framework for visual reconstruction from EEG brain activity, combining neural decoding with multimodal generation and introducing a boosted reconstruction stage to enhance image quality. Both systems are validated through quantitative metrics and user-centered evaluations. Finally, the real-world generalization pillar addresses the integration of spatial AI models into interactive, human-in-the-loop environments. Specifically, a system for single-image-to-3D generation, that combines multimodal reasoning, multi-view question answering, and iterative refinement through human feedback with a user interface is proposed, and an immersive analytics platform for fashion that incorporates 3D product interaction, visual analytics, and trend visualization within an XR environment. Overall, this work presents a unified, operational paradigm for spatial intelligence, demonstrating how modern AI systems can be integrated, from sensing and representation to generation and interaction, across heterogeneous domains such as cultural heritage, fashion, and neuroscience. Beyond technical contributions, the thesis emphasizes the importance of human-centered design, interpretability, and interaction, positioning spatial intelligence not only as a computational capability, but as a collaborative interface between artificial systems and human creativity, cognition, and decision-making.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


