Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data

IRIS

Recent papers in the cybersecurity research field of Domain Generation Algorithms (DGAs) detection show the increase of performances associated with the introduction of unsupervised neural vectorized representation of domain names in the supervised classification process. In this paper we explore the effectiveness of this approach by proposing a novel mixed pre-trained neural embeddings model which integrates different vectorized representations of domain names: n-grams streams and words. We used the embeddings with two different classifiers, both based on ensemble architectures: a stacking model and an end-to-end multi-input neural architecture. We trained and tested the classifiers with two datasets, differing both in the distribution of domain names between real and DGAs and in the number and type of DGAs. The obtained results show that our solution provides considerable advantages with respect to state-of-the-art single classifiers both in classification accuracy and in the detection of challenging DGAs, such as those based on word dictionaries. The improvement of performance is significant in a particularly relevant operating condition, known as few-shot-learning, where only few examples of DGA-generated domain names are available for the classifier training.

Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data / Morbidoni, Christian; Cucchiarelli, Alessandro; Spalazzi, Luca. - In: IEEE ACCESS. - ISSN 2169-3536. - 13:(2025), pp. 81167-81187. [10.1109/access.2025.3565022]

Mixed-Embeddings and Deep Learning Ensemble for DGA Classification With Limited Training Data

Morbidoni, Christian;Cucchiarelli, Alessandro;Spalazzi, Luca

2025-01-01

Abstract

Recent papers in the cybersecurity research field of Domain Generation Algorithms (DGAs) detection show the increase of performances associated with the introduction of unsupervised neural vectorized representation of domain names in the supervised classification process. In this paper we explore the effectiveness of this approach by proposing a novel mixed pre-trained neural embeddings model which integrates different vectorized representations of domain names: n-grams streams and words. We used the embeddings with two different classifiers, both based on ensemble architectures: a stacking model and an end-to-end multi-input neural architecture. We trained and tested the classifiers with two datasets, differing both in the distribution of domain names between real and DGAs and in the number and type of DGAs. The obtained results show that our solution provides considerable advantages with respect to state-of-the-art single classifiers both in classification accuracy and in the detection of challenging DGAs, such as those based on word dictionaries. The improvement of performance is significant in a particularly relevant operating condition, known as few-shot-learning, where only few examples of DGA-generated domain names are available for the classifier training.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Rivista su cui è pubblicata l'opera
	
				IEEE ACCESS
			
	Codice DOI
	
				https://dx.doi.org/10.1109/access.2025.3565022
			
	Parole chiave
	
				botnet; deep learning; DGA; Domain generation algorithms; few-shot learning; LSTM; n-grams; pre-trained embeddings
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Mixed-Embeddings_and_Deep_Learning_Ensemble_for_DGA_Classification_With_Limited_Training_Data.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza d'uso: Creative commons Dimensione 7.94 MB Formato Adobe PDF Visualizza/Apri	7.94 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11566/343666

Citazioni

ND

0

0

social impact