Vision Transformers (ViTs) have gained recognition in computer vision due to their outstanding performance. Despite their success, the explainability of ViT outputs is still a challenging issue. To address it, we propose a novel explainability method that leverages image patch embeddings from each attention layer of a ViT to construct similarity graphs. The latter are used to generate binary masks by exploring paths starting from specific patches. The masks from all layers are then aggregated into a comprehensive heatmap using the coverage bias formula. We tested our method on two Vision Transformer architectures (ViT-Base and DeiT-Base) and a subset of the ImageNet validation set. Using Insertion and Deletion metrics, we demonstrate the effectiveness of our proposed method compared to similar ones in the literature. Finally, we include a qualitative analysis that shows the capabilities of our method to make ViTs more interpretable.

Explaining Vision Transformers Through Similarity-based Graphs / Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - (2025). ( 2025 International Joint Conference on Neural Networks (IJCNN'25) Roma 30 - 05 July 2025).

Explaining Vision Transformers Through Similarity-based Graphs

M. Marchetti
;
D. Traini
;
D. Ursino
;
L. Virgili
2025-01-01

Abstract

Vision Transformers (ViTs) have gained recognition in computer vision due to their outstanding performance. Despite their success, the explainability of ViT outputs is still a challenging issue. To address it, we propose a novel explainability method that leverages image patch embeddings from each attention layer of a ViT to construct similarity graphs. The latter are used to generate binary masks by exploring paths starting from specific patches. The masks from all layers are then aggregated into a comprehensive heatmap using the coverage bias formula. We tested our method on two Vision Transformer architectures (ViT-Base and DeiT-Base) and a subset of the ImageNet validation set. Using Insertion and Deletion metrics, we demonstrate the effectiveness of our proposed method compared to similar ones in the literature. Finally, we include a qualitative analysis that shows the capabilities of our method to make ViTs more interpretable.
2025
File in questo prodotto:
File Dimensione Formato  
Marchetti_Explaining-Vision-Transformers-Through_2025.pdf

Solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso: Tutti i diritti riservati
Dimensione 4.1 MB
Formato Adobe PDF
4.1 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/342492
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact