Innovative techniques for malware and botnet detection in real networks based on advanced traffic analysis

Principi, Lorenzo

This thesis investigates Network Intrusion Detection Systems (NIDS) based on the use of Machine Learning (ML). The automation of monitoring, which ensures cost reduction and increased efficiency, has made the development of new NIDS based on ML a key focus in both research and industry. However, the development of ML-based cybersecurity tools requires a high level of real-world knowledge and experience, an aspect that is often ignored given the evident gap between research and actual implementation that affects this area of research. The main reason for this is the impossibility of generalising the proposed methods due to the enormous variability of networks. In this thesis, the main objective is to propose both innovative systems and in-depth studies that consider this aspect as crucial. This thesis proposes two innovative NIDS: the first one is based on detecting a threat by monitoring Domain Name System (DNS) traffic and classifying domain names into healthy and malicious domains, using a Long Short-Term Memory neural network trained to distinguish healthy domains from those generated by malicious Domain Generation Algorithms, which are used by the malware to connect to the attacker’s server. The second NIDS uses an innovative technique that applies graph theory to a local area network, building a graph by monitoring connections, where each terminal is a node and each connection is an edge. This graph is then used to extract graph theory metrics from which classification algorithms are trained and then used to distinguish between connections generated by infected terminals and those that are not. On the CIC2017 dataset, the prototype showed promising results. We then focus on studying how detection is more dependent on traffic variability than on good classification metrics, and how combinations of healthy and malicious traffic can lead to poor performance. To this aim, real traffic is used, preserving its temporal characteristics without reducing it to data where the temporal order doesn't matter, and requiring metrics that reflect the ability of the systems to detect the threat, not just correctly classify elements. Real traffic is rarely used in this research due to its scarcity and lack of labelling. In this work, we have adopted traffic captures provided by Stratosphere's Malware Capture Facility Project and the TI-2016 DNS dataset.

Questa tesi studia i sistemi di rilevamento delle intrusioni di rete (NIDS) basati sull'uso dell'apprendimento automatico. L'automazione del monitoraggio, che garantisce la riduzione dei costi e l'aumento dell'efficienza, ha reso lo sviluppo di nuovi NIDS basati sul Machine Learning (ML) un obiettivo chiave sia nella ricerca che nell'industria. Tuttavia, lo sviluppo di strumenti di cybersecurity basati sul ML richiede un elevato livello di conoscenza ed esperienza del mondo reale, aspetto che viene spesso ignorato dato l'evidente divario tra la ricerca e l'effettiva implementazione che interessa quest'area di ricerca. Il motivo principale è l'impossibilità di generalizzare i metodi proposti a causa dell'enorme variabilità delle reti. In questa tesi, l'obiettivo principale è quello di proporre sia sistemi innovativi sia studi approfonditi che considerino questo aspetto come cruciale. Questa tesi propone due NIDS innovativi: il primo si basa sul rilevamento di una minaccia monitorando il traffico del Domain Name System (DNS), classificando i nomi di dominio in domini sani e maligni, utilizzando una rete neurale Long Short-Term Memory addestrata a distinguere i domini sani da quelli generati da algoritmi di generazione di domini (Domain Generation Algorithm) maligni, che vengono utilizzati dal malware per connettersi al server dell'attaccante. Il secondo NIDS utilizza una tecnica innovativa che applica la teoria dei grafi a una rete locale, costruendo un grafo attraverso il monitoraggio delle connessioni, dove ogni terminale è un nodo e ogni connessione è un bordo. Questo grafo viene poi utilizzato per estrarre le metriche della teoria dei grafi da cui vengono addestrati gli algoritmi di classificazione, che vengono poi utilizzati per distinguere tra le connessioni generate da terminali infetti e quelle che non lo sono. Sul dataset CIC2017, il prototipo ha mostrato risultati promettenti. Descriviamo quindi due studi che mostrano come il rilevamento dipenda più dalla variabilità del traffico che da buone metriche di classificazione e come le combinazioni di traffico sano e dannoso possano portare a prestazioni scadenti. Per ottenere questo risultato, viene utilizzato traffico reale, preservando le sue caratteristiche temporali senza ridurlo a dati in cui l'ordine temporale non ha importanza e richiedendo metriche che riflettano la capacità dei sistemi di rilevare la minaccia, non solo di classificare correttamente gli elementi. Il traffico reale è raramente utilizzato in questa ricerca a causa della sua scarsità e della mancanza di etichettatura dei set di dati. In questo lavoro, abbiamo adottato Packet CAPture (PCAP) di traffico fornite dal Malware Capture Facility Project di Stratosphere e il dataset DNS TI-2016.

Innovative techniques for malware and botnet detection in real networks based on advanced traffic analysis / Principi, Lorenzo. - (2025 Mar 21).