The enormous growth of artificial intelligence (AI), and deep learning (DL) in particular, has led to the widespread use of these systems in a variety of contexts. One DL model capable of addressing complex computer vision tasks is the vision transformer (ViT). Despite its huge success, the reasoning behind the inferences it makes is often unclear, which poses significant challenges in critical scenarios. In this paper, we propose a new approach called MUltiplex Transformer EXplainer (MUTEX), which aims to explain the inferences made by ViTs. MUTEX combines multiplex network-based representations of attention matrices and mask perturbation approaches to provide insight into the inference process of ViTs. By mapping the attention layers of a ViT into a multiplex network, MUTEX is able to analyze the relationships between different parts of the input image and identify the image patches that most influence the inference process. We tested MUTEX on a subset of ImageNet and on BloodMNIST and compared its performance with that of existing visual explainability approaches. In addition, to assess the robustness and adaptability of MUTEX, we conducted a qualitative analysis, along with a hyperparameter and ablation study, which allowed us to further appreciate its potential in visual explainability of ViT

Multiplex Network-Based Representation of Vision Transformers for Visual Explainability / Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - In: NEURAL COMPUTING & APPLICATIONS. - ISSN 1433-3058. - 37:29(2025), pp. 24385-24420. [10.1007/s00521-025-11591-x]

Multiplex Network-Based Representation of Vision Transformers for Visual Explainability

M. Marchetti;D. Traini;D. Ursino;L. Virgili
2025-01-01

Abstract

The enormous growth of artificial intelligence (AI), and deep learning (DL) in particular, has led to the widespread use of these systems in a variety of contexts. One DL model capable of addressing complex computer vision tasks is the vision transformer (ViT). Despite its huge success, the reasoning behind the inferences it makes is often unclear, which poses significant challenges in critical scenarios. In this paper, we propose a new approach called MUltiplex Transformer EXplainer (MUTEX), which aims to explain the inferences made by ViTs. MUTEX combines multiplex network-based representations of attention matrices and mask perturbation approaches to provide insight into the inference process of ViTs. By mapping the attention layers of a ViT into a multiplex network, MUTEX is able to analyze the relationships between different parts of the input image and identify the image patches that most influence the inference process. We tested MUTEX on a subset of ImageNet and on BloodMNIST and compared its performance with that of existing visual explainability approaches. In addition, to assess the robustness and adaptability of MUTEX, we conducted a qualitative analysis, along with a hyperparameter and ablation study, which allowed us to further appreciate its potential in visual explainability of ViT
2025
Attention Mechanism; Computer Vision; Multiplex Network; Vision Transformer; Visual Explainability
File in questo prodotto:
File Dimensione Formato  
versione pubblicata.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso: Creative commons
Dimensione 2.86 MB
Formato Adobe PDF
2.86 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/346672
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact