Packet Traffic Learning

INFO 523 - Final Project

Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.
Author
Affiliation

The Anomalists - Joey Garcia, David Kyle

College of Information Science, University of Arizona

Abstract

This project studies the implications and results of using supervised and unsupervised learning methods for classifying network traffic. For supervised learning, we employ the XGBooster classier to assess the accuracy of detecting anomalous network traffic using labeled data. In parallel, we’ll explore unsupervised learning methods, such as K-means clustering and Density-based spatial clustering of applications with noise (DBSCAN), to determine their capabilities in identifying anomalous traffic without the use of labeled data. This project is unique in the sense that we’re comparing the strengths and limitations of the two approaches above. These strategies can determine how intrusion detection, network security monitoring, and incident response can both be used in unison for future hybrid approaches.

Introduction

Network intrusion threats are continuously evolving in the realm of cybersecurity, challenging the effectiveness of traditional security practices. As attackers develop new techniques, it becomes increasingly important to efficiently analyze network traffic to identify anomalous patterns that may indicate potential threats to cyber infrastructure. Leveraging machine learning, particularly unsupervised learning methods, allows for the detection of previously unseen or novel threats without relying solely on labeled data. This adaptability is crucial in an ever-changing threat landscape, enabling proactive identification and mitigation of security risks.

Research Questions

Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model’s predictions?

Q2. Can unsupervised learning methods such as K-Means Clustering and DBSCAN detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models?

Dataset

We are using a training and testing dataset of network intrusion detection from NSL-KDD on Kaggle. The training dataset contains 125,972 rows and 43 columns, while the test dataset contains 22,543 rows and 43 columns.

The records include both normal traffic and a variety of attack samples. The dataset is curated to provide challenging cases for machine learning algorithms, with each sample rated on a difficulty scale from 1 to 21, where 21 represents the most difficult to predict.

The attack field indicates whether each observation is normal or belongs to a specific attack type, allowing for multi-class classification of anomalous network activity. For this project, we add a new binary classification feature, is_anomalous, to indicate whether a network connection is anomalous or not. This will serve as the target variable.

Looking at the vaious attack values, combining non-normal traffic into a composite column makes sense:

Once the data is combined into the composite is_anomalous column, good balance is seen making this a solid choice for a target column.

Count Percentage
is_anomalous
Normal 67343 53.46
Attack 58630 46.54

We chose this dataset because it provides a rich and realistic representation of network traffic. The presence of labeled data allows us to train and evaluate supervised models, while the diversity and complexity of traffic patterns also make it well suited for exploring unsupervised anomaly detection techniques. This balance between complexity and feature richness aligns with our research questions and modeling goals.

Preprocessing (Scaling, Feature Engineering, Dimension Reduction)

To ensure a fair, consistent comparison between supervised and unsupervised approaches, the feature engineering pipeline will be applied to both. All features will be named according to the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag, will be one-hot encoded while low-variance columns will be removed. Numerical features will be standardized to normalize their ranges.

For supervised models, the processed features will be paired with the binary target is_anomalous. For unsupervised models, the same feature set will be used without labels, allowing the algorithms to uncover structure or detect anomalies. This consistent preprocessing ensures that performance differences are driven by modeling choices rather than data preparation discrepancies.

Given the dataset’s over 40 features, we explore dimensionality reduction methods such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), however, for interpretability we’ chose to use Mutual information to . These will aid in visualization, mitigate the curse of dimensionality, and potentially enhance model performance. We will compare results using reduced dimensions versus the full feature set to determine which yields more interpretable and effective models.

Exploritory Data Analysis Steps

After the data analysis and feature selection, 17 features remain.

Final Dataset Features

The final dataset used for modeling contains the following features and target variable:

Column Name Data Type Description Notes
src_bytes int64 Number of data bytes sent from source to destination. Numeric (scaled)
dst_bytes int64 Number of data bytes sent from destination to source. Numeric (scaled)
count int64 Number of connections to the same host in the past 2 seconds. Numeric (scaled)
srv_diff_host_rate float64 % of connections to different hosts on the same service. Numeric (scaled)
serror_rate float64 % of connections with SYN errors. Numeric (scaled)
same_srv_rate float64 % of connections to the same service. Numeric (scaled)
diff_srv_rate float64 % of connections to different services. Numeric (scaled)
dst_host_count int64 Number of connections to the destination host. Numeric (scaled)
dst_host_srv_count int64 Number of connections to the destination host and service. Numeric (scaled)
dst_host_same_srv_rate float64 % of connections to the same service on the destination host. Numeric (scaled)
dst_host_diff_srv_rate float64 % of connections to different services on the destination host. Numeric (scaled)
dst_host_same_src_port_rate float64 % of connections from the same source port. Numeric (scaled)
dst_host_srv_diff_host_rate float64 % of connections to the same service from different hosts. Numeric (scaled)
logged_in int64 1 if successfully logged in; 0 otherwise. Binary indicator
flag_SF int64 One-hot encoded: Status flag “SF” of the connection. One-hot encoded categorical
service_http int64 One-hot encoded: Network service is HTTP. One-hot encoded categorical
service_private int64 One-hot encoded: Network service is “private”. One-hot encoded categorical
is_anomalous int64 Target: 1 if connection is anomalous; 0 if normal. Target variable

Data Validation Visualizations

A correlation heatmap and a PCA projection validte our data and provide a look at the quality of the data following the processing.

The heatmap shows some correlation, but we’re using models that will handle it.

PCA Projection

PCA projection of the top features shows partial separation between normal and anomalous traffic. While some overlap exists, the clustering indicates meaningful structure that can support classification models.

The team is satistfied with the results of the EDA and feel the data is in a good position for modeling.

How we’ll evaluate performance Metrics

For the supervised learning methods, we’ll use F1 score as the primary metric for model evaluation. The F1 score balances precision (minimizing false positives) and recall (minimizing false negatives). The F1 score is a balanced approach to identify both types of errors, misclassifying normal traffic (false positive) or missing a true anomoly (false negative), that carry significant impacts. Secondary metrics, such as accuracy, ROC AUC, and confusion matrix analysis, will be monitored for additional insight into the model’s behavior. To ensure robust results, we’ll apply sampling methods such as cross-validation to search for the best F1-score.

For unsupervised learning methods, we’ll use the silhouette score and adjusted rand index (ARI) to measure the performance for our K-Means clustering, gaussian mixture models (GMM), and DBSCAN models. The silhouette score values range from -1 to +1, a higher positive score indicates that the data point fits very well in its own cluster. Secondly, since we have labeled data, the ARI will measure similarity between the predicted clusters and true labels. ARI scores range from -1 to +1 and higher numbers inidicate better results.

Model Predictions

Supervised Learning Model

Test Evaluation

Run the results of the RandomizedSearchCV against the testing dataset to sew how well it predicts the set and provide metrics and a confusion matrix to see where the model performed.

# the class predictions and the predicted probablility of a positive--anomalous--score
y_pred = search.best_estimator_.predict(X_test)
y_proba = search.best_estimator_.predict_proba(X_test)[:, 1]

# scores
print("Test F1 Score:", f1_score(y_test, y_pred))
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("Classification Report:\n", classification_report(y_test, y_pred))

# confision martix to see where the model did well or failed
cm = confusion_matrix(y_test, y_pred)
labels = ['Normal (0)', 'Anomalous (1)']

# heatmap of confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
Test F1 Score: 0.7639387639387639
Test Accuracy: 0.778388928317956
Test ROC AUC: 0.963783985241748
Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.97      0.79      9711
           1       0.97      0.63      0.76     12833

    accuracy                           0.78     22544
   macro avg       0.82      0.80      0.78     22544
weighted avg       0.84      0.78      0.78     22544

The tuned XGBoost model achieved strong overall performance with a test F1 score of 0.76, accuracy of 0.78, and an excellent ROC AUC of 0.96. While the model is highly precise in detecting anomalous traffic, it only recalls about 63% of true anomalies, meaning some attacks are missed. This suggests the model is conservative in raising alerts, favoring fewer false alarms at the cost of letting certain anomalies go undetected.

The differences between the training and testing performance suggest distributional drift between the two datasets. While the model captured patterns extremely well during training, its performance dropped on the test set.

Feature Influence on the Model

To examine which features most strongly influence the model’s predictions, we present a SHAP violin plot. This visualization shows both the direction and magnitude of feature impacts on the classification, highlighting how variables such as src_bytes can contribute to both anomalous and non-anomalous outcomes.

The SHAP analysis shows that traffic volume features such as src_bytes and dst_bytes have the strongest influence on the model’s predictions. Additional features like repeated service requests, specific services , and login or flag indicators provide secondary but meaningful contributions to detecting anomalies.

All 17 EDA-refined features impact the classification with varying degrees of influence, suggesting successful data analysis and pre-processing.

Question 1 Result

The tuned XGBoost model classified network traffic with strong results, achieving 0.78 accuracy, 0.76 F1, and 0.96 ROC AUC. SHAP analysis shows all 17 refined features contributed, with src_bytes, dst_bytes, dst_host_srv_count, count, and same_srv_rate most influential. Overall, supervised learning proved effective for distinguishing normal and anomalous traffic, though recall limitations mean some anomalies were missed.

Unsupervised Learning Model

Unsupervised learning methods are essential for discovering hidden patterns and structures in data without relying on labeled outcomes. In network traffic analysis, these techniques can identify unseen anomalies, making them valuable for proactive security monitoring when labeled data isn’t accessible. For this project, we apply three unsupervised machine learning algorithms K-Means Clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Soft Clustering (Gaussian Mixture Models). By evaluating the clustering results with silhouette score and adjusted rand index metrics, we assess how effectively these models separate normal and anomalous traffic, and compare their strengths and limitations in detecting network threats.

Finding Optimal Parameters

Selecting optimal parameters is crucial for the unsupervised learning algorithms we’re using in this project. Unlike supervised learning methods, unsupervised models don’t have labeled outcomes to guide their learning, performance is heavily reliant on the algorithm parameters. For K-Means, determining the ideal number of clusters ensures meaningful groupings that reflect the underlying data structure. DBSCAN requires tuning of the epsilon (distance threshold) and minimum samples parameters to accurately identify dense regions and outliers. Soft Clustering methods, such as Gaussian Mixture Models, rely on selecting the appropriate number of components to capture the underlying data distribution complexity.

Proper parameter selection improves cluster quality, enhances the ability to detect anomalies, and ensures that the models provide actionable insights for network traffic analysis. To achieve this, we use internal validation metrics like silhouette score and, when labels are available, external metrics such as adjusted rand index to guide our parameter choices.

Below are the visualizations that identify the optimal parameters:

Suggested eps: 2.938

Optimal number of clusters (lowest DBI): 19

Optimal number of clusters (BIC): 29

Unsupervised Learning Performance Metrics Analysis

To evaluate the effectiveness of our unsupervised models, we compared both silhouette scores and adjusted rand index (ARI) across DBSCAN, K-Means, and GMM.

Silhouette Scores:
- DBSCAN: 0.114
- KMeans clustering: 0.414
- GMM clustering: 0.307

Adjusted Rand Index:
- DBSCAN ARI: 0.242
- KMeans ARI: 0.179
- GMM ARI: 0.143

These results show that K-Means achieved the highest silhouette score, indicating more cohesive clusters, while DBSCAN had the highest ARI, suggesting its clusters most closely matched the true anomaly labels. Unsupervised methods did not reach the performance of supervised models, however, they still revealed meaningful structure and potential anomalies within the network traffic data.

Unsupervised learning Feature Importance

The top features identified by both K-Means and GMM match the most influential features found in the supervised XGBoost model. In supervised learning, SHAP analysis also highlighted src_bytes, dst_bytes, and connection rate features as key drivers for distinguishing normal and anomalous traffic.

The similarity demonstrates that both supervised and unsupervised methods are capturing the same underlying patterns in the network data. The consistency in feature importance across modeling approaches suggests these variables are fundamental indicators of network anomalies, regardless of whether labels are available. This strengthens confidence in the reliability and interpretability of the results, and shows that unsupervised models can surface meaningful insights even without explicit supervision.

Top 5 Feature Importances for K-Means Clustering

Feature Importance
dst_bytes 74.87
src_bytes 51.91
dst_host_srv_diff_host_rate 2.00
diff_srv_rate 1.52
dst_host_diff_srv_rate 1.46

Top 5 Feature Importances for Gaussian Mixture Model (GMM)

Feature Importance
dst_bytes 61.53
src_bytes 42.74
dst_host_srv_diff_host_rate 1.46
dst_host_diff_srv_rate 1.33
diff_srv_rate 1.20

Conclusion and Insights

After comparing supervised and unsupervised methods, we found that supervised learning (XGBoost) delivered higher accuracy and F1 scores for detecting anomalous network traffic, benefiting from labeled data and feature engineering. The model was able to precisely identify most anomalies, though some recall limitations remained.

Unsupervised methods K-Means, DBSCAN, and GMM proved valuable for discovering hidden patterns and potential anomalies without relying on labels. While their clustering performance was generally lower than supervised models, especially in terms of direct anomaly detection, they offered unique strengths in identifying novel or previously unseen threats and provided insights into the underlying structure of network traffic.

All in all, supervised learning is preferable when high-quality labeled data is available, enabling robust and interpretable anomaly detection. Supervised learning is able to capture the complexity of the network anomoly dataset. Unsupervised methods are essential for proactive monitoring and can complement supervised approaches, especially in dynamic environments where new attack types may emerge. Combining both strategies can lead to more resilient and adaptive network security solutions.