In recent years, Transformers have revolutionized the management of Natural Language Processing tasks, and Vision Transformers (ViTs) promise to do the same for Computer Vision ones. However, the adoption of ViTs is hampered by their computational cost. Indeed, given an image divided into patches, it is necessary to compute for each layer the attention of each patch with respect to all the others. Researchers have proposed many solutions to reduce the computational cost of attention layers by adopting techniques such as quantization, knowledge distillation and manipulation of input images. In this paper, we aim to contribute to the solution of this problem. In particular, we propose a new framework, called AgentViT, which uses Reinforcement Learning to train an agent that selects the most important patches to improve the learning of a ViT. The goal of AgentViT is to reduce the number of patches processed by a ViT, and thus its computational load, while still maintaining competitive performance. We tested AgentViT on CIFAR10, FashionMNIST, and Imagenette+ (which is a subset of ImageNet) in the image classification task and obtained promising performance when compared to baseline ViTs and other related approaches available in the literature
Adaptive Patch Selection to Improve Vision Transformers through Reinforcement Learning / Cauteruccio, F.; Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - In: APPLIED INTELLIGENCE. - ISSN 1573-7497. - 55:(2025). [10.1007/s10489-025-06516-z]
Adaptive Patch Selection to Improve Vision Transformers through Reinforcement Learning
F. Cauteruccio;M. Marchetti;D. Traini;D. Ursino;L. Virgili
2025-01-01
Abstract
In recent years, Transformers have revolutionized the management of Natural Language Processing tasks, and Vision Transformers (ViTs) promise to do the same for Computer Vision ones. However, the adoption of ViTs is hampered by their computational cost. Indeed, given an image divided into patches, it is necessary to compute for each layer the attention of each patch with respect to all the others. Researchers have proposed many solutions to reduce the computational cost of attention layers by adopting techniques such as quantization, knowledge distillation and manipulation of input images. In this paper, we aim to contribute to the solution of this problem. In particular, we propose a new framework, called AgentViT, which uses Reinforcement Learning to train an agent that selects the most important patches to improve the learning of a ViT. The goal of AgentViT is to reduce the number of patches processed by a ViT, and thus its computational load, while still maintaining competitive performance. We tested AgentViT on CIFAR10, FashionMNIST, and Imagenette+ (which is a subset of ImageNet) in the image classification task and obtained promising performance when compared to baseline ViTs and other related approaches available in the literatureFile | Dimensione | Formato | |
---|---|---|---|
APIN25.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Creative commons
Dimensione
2.23 MB
Formato
Adobe PDF
|
2.23 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.