Vision Transformers (ViTs) have gained recognition in computer vision due to their outstanding performance. Despite their success, the explainability of ViT outputs is still a challenging issue. To address it, we propose a novel explainability method that leverages image patch embeddings from each attention layer of a ViT to construct similarity graphs. The latter are used to generate binary masks by exploring paths starting from specific patches. The masks from all layers are then aggregated into a comprehensive heatmap using the coverage bias formula. We tested our method on two Vision Transformer architectures (ViT-Base and DeiT-Base) and a subset of the ImageNet validation set. Using Insertion and Deletion metrics, we demonstrate the effectiveness of our proposed method compared to similar ones in the literature. Finally, we include a qualitative analysis that shows the capabilities of our method to make ViTs more interpretable.
Explaining Vision Transformers Through Similarity-based Graphs / Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - (2025). ( 2025 International Joint Conference on Neural Networks (IJCNN'25) Roma 30 - 05 July 2025).
Explaining Vision Transformers Through Similarity-based Graphs
M. Marchetti
;D. Traini
;D. Ursino
;L. Virgili
2025-01-01
Abstract
Vision Transformers (ViTs) have gained recognition in computer vision due to their outstanding performance. Despite their success, the explainability of ViT outputs is still a challenging issue. To address it, we propose a novel explainability method that leverages image patch embeddings from each attention layer of a ViT to construct similarity graphs. The latter are used to generate binary masks by exploring paths starting from specific patches. The masks from all layers are then aggregated into a comprehensive heatmap using the coverage bias formula. We tested our method on two Vision Transformer architectures (ViT-Base and DeiT-Base) and a subset of the ImageNet validation set. Using Insertion and Deletion metrics, we demonstrate the effectiveness of our proposed method compared to similar ones in the literature. Finally, we include a qualitative analysis that shows the capabilities of our method to make ViTs more interpretable.| File | Dimensione | Formato | |
|---|---|---|---|
|
Marchetti_Explaining-Vision-Transformers-Through_2025.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Tutti i diritti riservati
Dimensione
4.1 MB
Formato
Adobe PDF
|
4.1 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


