Explaining Vision Transformers Through Similarity-based Graphs

Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.

Vision Transformers (ViTs) have gained recognition in computer vision due to their outstanding performance. Despite their success, the explainability of ViT outputs is still a challenging issue. To address it, we propose a novel explainability method that leverages image patch embeddings from each attention layer of a ViT to construct similarity graphs. The latter are used to generate binary masks by exploring paths starting from specific patches. The masks from all layers are then aggregated into a comprehensive heatmap using the coverage bias formula. We tested our method on two Vision Transformer architectures (ViT-Base and DeiT-Base) and a subset of the ImageNet validation set. Using Insertion and Deletion metrics, we demonstrate the effectiveness of our proposed method compared to similar ones in the literature. Finally, we include a qualitative analysis that shows the capabilities of our method to make ViTs more interpretable.

Explaining Vision Transformers Through Similarity-based Graphs / Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - (2025). ( 2025 International Joint Conference on Neural Networks (IJCNN'25) Roma 30 - 05 July 2025).