In recent years, Transformers have led a revolution in Natural Language Processing, and Vision Transformers (ViTs) promise to do the same in Computer Vision. The main obstacle to the widespread use of ViTs is their computational cost. Indeed, given an image divided into a list of patches, ViTs compute, for each layer, the attention of each patch with respect to all others. In the literature, many solutions try to reduce the computational cost of attention layers using quantization, knowledge distillation, and input perturbation. In this paper, we aim to make a contribution in this setting. In particular, we propose AgentViT, a framework that uses Reinforcement Learning to train an agent whose task is to identify the least important patches during the training of a ViT. Once such patches are identified, AgentViT removes them, thus reducing the number of patches processed by the ViT. Our goal is to reduce the training time of the ViT while maintaining competitive performance
Speeding up Vision Transformers Through Reinforcement Learning / Cauteruccio, F.; Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - 3741:(2024), pp. 174-184. (Intervento presentato al convegno 32nd Italian Symposium on Advanced Database Systems, SEBD 2024 tenutosi a Villasimius, Italy nel 23 - 26 June 2024).
Speeding up Vision Transformers Through Reinforcement Learning
F. Cauteruccio
;M. Marchetti
;D. Traini
;D. Ursino
;L. Virgili
2024-01-01
Abstract
In recent years, Transformers have led a revolution in Natural Language Processing, and Vision Transformers (ViTs) promise to do the same in Computer Vision. The main obstacle to the widespread use of ViTs is their computational cost. Indeed, given an image divided into a list of patches, ViTs compute, for each layer, the attention of each patch with respect to all others. In the literature, many solutions try to reduce the computational cost of attention layers using quantization, knowledge distillation, and input perturbation. In this paper, we aim to make a contribution in this setting. In particular, we propose AgentViT, a framework that uses Reinforcement Learning to train an agent whose task is to identify the least important patches during the training of a ViT. Once such patches are identified, AgentViT removes them, thus reducing the number of patches processed by the ViT. Our goal is to reduce the training time of the ViT while maintaining competitive performanceFile | Dimensione | Formato | |
---|---|---|---|
SEBD24.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza d'uso:
Creative commons
Dimensione
900.59 kB
Formato
Adobe PDF
|
900.59 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.