Several techniques for the automatic detection of violent scenes in videos and security footage appeared in recent years, for example with the goal of unburdening authorities from the need of analyzing hours of Closed-Circuit TeleVision (CCTV) clips. In this regard, Deep Learning-based techniques such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) emerged as effective for violence detection. Nevertheless, most of such techniques require significant computational and memory resources to run the automatic detection of violence. Thus, we propose the combination of an established CNN, MobileNetV2, designed for the use in mobile and embedded devices with a recurrent layer to extract the spatio-temporal features in the security videos. A lightweight model can run in embedded devices, in a edge computing fashion, for example to allow processing the videos near the camera recording them, to preserve privacy. Specifically, we exploit transfer learning, as we use a pre-trained version of MobileNetV2, and we propose two different models combining it with a Bidirectional Long Short-Term Memory (Bi-LSTM) and a Convolutional LSTM (ConvLSTM). The paper presents accuracy tests of the two models on the AIRTLab dataset and a comparison with more complex models developed in our previous work, in order to evaluate the drop of accuracy necessary to use a model compatible with limited resources. The network composed of MobileNetV2 and the ConvLSTM scores a 94.1% accuracy, against the 96.1% of a model based on a more complex 3D CNN.
Combining a mobile deep neural network and a recurrent layer for violence detection in videos / Contardo, P.; Tomassini, S.; Falcionelli, N.; Dragoni, A. F.; Sernani, P.. - 3402:(2023), pp. 35-43. (Intervento presentato al convegno 5th International Conference on Recent Trends and Applications in Computer Science and Information Technology, RTA-CSIT 2023 tenutosi a alb nel 2023).