Speeding up Vision Transformers Through Reinforcement Learning

IRIS

In recent years, Transformers have led a revolution in Natural Language Processing, and Vision Transformers (ViTs) promise to do the same in Computer Vision. The main obstacle to the widespread use of ViTs is their computational cost. Indeed, given an image divided into a list of patches, ViTs compute, for each layer, the attention of each patch with respect to all others. In the literature, many solutions try to reduce the computational cost of attention layers using quantization, knowledge distillation, and input perturbation. In this paper, we aim to make a contribution in this setting. In particular, we propose AgentViT, a framework that uses Reinforcement Learning to train an agent whose task is to identify the least important patches during the training of a ViT. Once such patches are identified, AgentViT removes them, thus reducing the number of patches processed by the ViT. Our goal is to reduce the training time of the ViT while maintaining competitive performance

Speeding up Vision Transformers Through Reinforcement Learning / Cauteruccio, F.; Marchetti, M.; Traini, D.; Ursino, D.; Virgili, L.. - 3741:(2024), pp. 174-184. ( 32nd Italian Symposium on Advanced Database Systems, SEBD 2024 Villasimius, Italy 23 - 26 June 2024).

Speeding up Vision Transformers Through Reinforcement Learning

F. Cauteruccio;M. Marchetti;D. Traini;D. Ursino;L. Virgili

2024-01-01

Abstract

In recent years, Transformers have led a revolution in Natural Language Processing, and Vision Transformers (ViTs) promise to do the same in Computer Vision. The main obstacle to the widespread use of ViTs is their computational cost. Indeed, given an image divided into a list of patches, ViTs compute, for each layer, the attention of each patch with respect to all others. In the literature, many solutions try to reduce the computational cost of attention layers using quantization, knowledge distillation, and input perturbation. In this paper, we aim to make a contribution in this setting. In particular, we propose AgentViT, a framework that uses Reinforcement Learning to train an agent whose task is to identify the least important patches during the training of a ViT. Once such patches are identified, AgentViT removes them, thus reducing the number of patches processed by the ViT. Our goal is to reduce the training time of the ViT while maintaining competitive performance

Scheda breve

Scheda completa

Scheda completa (DC)

Anno di pubblicazione

2024

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
SEBD24.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza d'uso: Creative commons Dimensione 900.59 kB Formato Adobe PDF Visualizza/Apri	900.59 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/328951

Citazioni

ND

0

ND

social impact