Vision Transformers (ViTs), although very successful, have a major limitation to overcome, namely the need for significant computational resources to use them. Several approaches have been proposed to limit the resources required to work with ViTs, aiming at pruning the data provided in input to them. In this paper, we propose Token Reduction via an Attention-based Multilayer network (TRAM), the first approach that achieves this goal using a multilayer network-based representation of the attention matrices. TRAM can work with most ViTs without the need for fine-tuning. It makes several contributions to the literature in this research area; in particular, it is characterized by: (i) a new representation of ViTs based on a multilayer network; (ii) a new approach to evaluate the relevance of tokens based on a new centrality measure computed on the multilayer network; and (iii) an approach to reduce the number of tokens based on this centrality measure. We have validated TRAM by comparing it with several state-of-the-art approaches during an extensive experimental campaign carried out on different image datasets. The results obtained demonstrate not only the efficiency but also the effectiveness of TRAM in reducing the computational load of ViTs while still allowing them to provide accurate results
Efficient Token Pruning in Vision Transformers Using an Attention-Based Multilayer Network / Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - 279:(2025). [10.1016/j.eswa.2025.127449]
Efficient Token Pruning in Vision Transformers Using an Attention-Based Multilayer Network
M. Marchetti;D. Traini;D. Ursino;L. Virgili
2025-01-01
Abstract
Vision Transformers (ViTs), although very successful, have a major limitation to overcome, namely the need for significant computational resources to use them. Several approaches have been proposed to limit the resources required to work with ViTs, aiming at pruning the data provided in input to them. In this paper, we propose Token Reduction via an Attention-based Multilayer network (TRAM), the first approach that achieves this goal using a multilayer network-based representation of the attention matrices. TRAM can work with most ViTs without the need for fine-tuning. It makes several contributions to the literature in this research area; in particular, it is characterized by: (i) a new representation of ViTs based on a multilayer network; (ii) a new approach to evaluate the relevance of tokens based on a new centrality measure computed on the multilayer network; and (iii) an approach to reduce the number of tokens based on this centrality measure. We have validated TRAM by comparing it with several state-of-the-art approaches during an extensive experimental campaign carried out on different image datasets. The results obtained demonstrate not only the efficiency but also the effectiveness of TRAM in reducing the computational load of ViTs while still allowing them to provide accurate resultsFile | Dimensione | Formato | |
---|---|---|---|
ESWA25.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Creative commons
Dimensione
3.41 MB
Formato
Adobe PDF
|
3.41 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.