The enormous growth of artificial intelligence (AI), and deep learning (DL) in particular, has led to the widespread use of these systems in a variety of contexts. One DL model capable of addressing complex computer vision tasks is the vision transformer (ViT). Despite its huge success, the reasoning behind the inferences it makes is often unclear, which poses significant challenges in critical scenarios. In this paper, we propose a new approach called MUltiplex Transformer EXplainer (MUTEX), which aims to explain the inferences made by ViTs. MUTEX combines multiplex network-based representations of attention matrices and mask perturbation approaches to provide insight into the inference process of ViTs. By mapping the attention layers of a ViT into a multiplex network, MUTEX is able to analyze the relationships between different parts of the input image and identify the image patches that most influence the inference process. We tested MUTEX on a subset of ImageNet and on BloodMNIST and compared its performance with that of existing visual explainability approaches. In addition, to assess the robustness and adaptability of MUTEX, we conducted a qualitative analysis, along with a hyperparameter and ablation study, which allowed us to further appreciate its potential in visual explainability of ViT
Multiplex Network-Based Representation of Vision Transformers for Visual Explainability / Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - In: NEURAL COMPUTING & APPLICATIONS. - ISSN 1433-3058. - 37:29(2025), pp. 24385-24420. [10.1007/s00521-025-11591-x]
Multiplex Network-Based Representation of Vision Transformers for Visual Explainability
M. Marchetti;D. Traini;D. Ursino;L. Virgili
2025-01-01
Abstract
The enormous growth of artificial intelligence (AI), and deep learning (DL) in particular, has led to the widespread use of these systems in a variety of contexts. One DL model capable of addressing complex computer vision tasks is the vision transformer (ViT). Despite its huge success, the reasoning behind the inferences it makes is often unclear, which poses significant challenges in critical scenarios. In this paper, we propose a new approach called MUltiplex Transformer EXplainer (MUTEX), which aims to explain the inferences made by ViTs. MUTEX combines multiplex network-based representations of attention matrices and mask perturbation approaches to provide insight into the inference process of ViTs. By mapping the attention layers of a ViT into a multiplex network, MUTEX is able to analyze the relationships between different parts of the input image and identify the image patches that most influence the inference process. We tested MUTEX on a subset of ImageNet and on BloodMNIST and compared its performance with that of existing visual explainability approaches. In addition, to assess the robustness and adaptability of MUTEX, we conducted a qualitative analysis, along with a hyperparameter and ablation study, which allowed us to further appreciate its potential in visual explainability of ViT| File | Dimensione | Formato | |
|---|---|---|---|
|
versione pubblicata.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Creative commons
Dimensione
2.86 MB
Formato
Adobe PDF
|
2.86 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


