Packet Traffic Learning
INFO 523 - Final Project
Abstract
This project studies the implications and results of using supervised and unsupervised learning methods for classifying network traffic. For supervised learning, we employ the XGBooster classier to assess the accuracy of detecting anomalous network traffic using labeled data. In parallel, we’ll explore unsupervised learning methods, such as K-means clustering and Density-based spatial clustering of applications with noise (DBSCAN), to determine their capabilities in identifying anomalous traffic without the use of labeled data. This project is unique in the sense that we’re comparing the strengths and limitations of the two approaches above. These strategies can determine how intrusion detection, network security monitoring, and incident response can both be used in unison for future hybrid approaches.
Introduction
Network intrusion threats are continuously evolving in the realm of cybersecurity, challenging the effectiveness of traditional security practices. As attackers develop new techniques, it becomes increasingly important to efficiently analyze network traffic to identify anomalous patterns that may indicate potential threats to cyber infrastructure. Leveraging machine learning, particularly unsupervised learning methods, allows for the detection of previously unseen or novel threats without relying solely on labeled data. This adaptability is crucial in an ever-changing threat landscape, enabling proactive identification and mitigation of security risks.
Research Questions
Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model’s predictions?
Q2. Can unsupervised learning methods such as K-Means Clustering and DBSCAN detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models?
Dataset
We are using a training and testing dataset of network intrusion detection from NSL-KDD on Kaggle. The training dataset contains 125,972 rows and 43 columns, while the test dataset contains 22,543 rows and 43 columns.
The records include both normal traffic and a variety of attack samples. The dataset is curated to provide challenging cases for machine learning algorithms, with each sample rated on a difficulty scale from 1 to 21, where 21 represents the most difficult to predict.
The attack
field indicates whether each observation is normal or belongs to a specific attack type, allowing for multi-class classification of anomalous network activity. For this project, we add a new binary classification feature, is_anomalous
, to indicate whether a network connection is anomalous or not. This will serve as the target variable.
Looking at the vaious attack
values, combining non-normal traffic into a composite column makes sense:
Once the data is combined into the composite is_anomalous
column, good balance is seen making this a solid choice for a target column.
Count | Percentage | |
---|---|---|
is_anomalous | ||
Normal | 67343 | 53.46 |
Attack | 58630 | 46.54 |
We chose this dataset because it provides a rich and realistic representation of network traffic. The presence of labeled data allows us to train and evaluate supervised models, while the diversity and complexity of traffic patterns also make it well suited for exploring unsupervised anomaly detection techniques. This balance between complexity and feature richness aligns with our research questions and modeling goals.
Preprocessing (Scaling, Feature Engineering, Dimension Reduction)
To ensure a fair, consistent comparison between supervised and unsupervised approaches, the feature engineering pipeline will be applied to both. All features will be named according to the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag, will be one-hot encoded while low-variance columns will be removed. Numerical features will be standardized to normalize their ranges.
For supervised models, the processed features will be paired with the binary target is_anomalous
. For unsupervised models, the same feature set will be used without labels, allowing the algorithms to uncover structure or detect anomalies. This consistent preprocessing ensures that performance differences are driven by modeling choices rather than data preparation discrepancies.
Given the dataset’s over 40 features, we explore dimensionality reduction methods such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), however, for interpretability we’ chose to use Mutual information to . These will aid in visualization, mitigate the curse of dimensionality, and potentially enhance model performance. We will compare results using reduced dimensions versus the full feature set to determine which yields more interpretable and effective models.
After the data analysis and feature selection, 17 features remain.
Final Dataset Features
The final dataset used for modeling contains the following features and target variable:
Column Name | Data Type | Description | Notes |
---|---|---|---|
src_bytes |
int64 | Number of data bytes sent from source to destination. | Numeric (scaled) |
dst_bytes |
int64 | Number of data bytes sent from destination to source. | Numeric (scaled) |
count |
int64 | Number of connections to the same host in the past 2 seconds. | Numeric (scaled) |
srv_diff_host_rate |
float64 | % of connections to different hosts on the same service. | Numeric (scaled) |
serror_rate |
float64 | % of connections with SYN errors. | Numeric (scaled) |
same_srv_rate |
float64 | % of connections to the same service. | Numeric (scaled) |
diff_srv_rate |
float64 | % of connections to different services. | Numeric (scaled) |
dst_host_count |
int64 | Number of connections to the destination host. | Numeric (scaled) |
dst_host_srv_count |
int64 | Number of connections to the destination host and service. | Numeric (scaled) |
dst_host_same_srv_rate |
float64 | % of connections to the same service on the destination host. | Numeric (scaled) |
dst_host_diff_srv_rate |
float64 | % of connections to different services on the destination host. | Numeric (scaled) |
dst_host_same_src_port_rate |
float64 | % of connections from the same source port. | Numeric (scaled) |
dst_host_srv_diff_host_rate |
float64 | % of connections to the same service from different hosts. | Numeric (scaled) |
logged_in |
int64 | 1 if successfully logged in; 0 otherwise. | Binary indicator |
flag_SF |
int64 | One-hot encoded: Status flag “SF” of the connection. | One-hot encoded categorical |
service_http |
int64 | One-hot encoded: Network service is HTTP. | One-hot encoded categorical |
service_private |
int64 | One-hot encoded: Network service is “private”. | One-hot encoded categorical |
is_anomalous |
int64 | Target: 1 if connection is anomalous; 0 if normal. | Target variable |
Data Validation Visualizations
A correlation heatmap and a PCA projection validte our data and provide a look at the quality of the data following the processing.
The heatmap shows some correlation, but we’re using models that will handle it.
PCA Projection
PCA projection of the top features shows partial separation between normal and anomalous traffic. While some overlap exists, the clustering indicates meaningful structure that can support classification models.
The team is satistfied with the results of the EDA and feel the data is in a good position for modeling.
How we’ll evaluate performance Metrics
For the supervised learning methods, we’ll use F1 score as the primary metric for model evaluation. The F1 score balances precision (minimizing false positives) and recall (minimizing false negatives). The F1 score is a balanced approach to identify both types of errors, misclassifying normal traffic (false positive) or missing a true anomoly (false negative), that carry significant impacts. Secondary metrics, such as accuracy, ROC AUC, and confusion matrix analysis, will be monitored for additional insight into the model’s behavior. To ensure robust results, we’ll apply sampling methods such as cross-validation to search for the best F1-score.
For unsupervised learning methods, we’ll use the silhouette score and adjusted rand index (ARI) to measure the performance for our K-Means clustering, gaussian mixture models (GMM), and DBSCAN models. The silhouette score values range from -1 to +1, a higher positive score indicates that the data point fits very well in its own cluster. Secondly, since we have labeled data, the ARI will measure similarity between the predicted clusters and true labels. ARI scores range from -1 to +1 and higher numbers inidicate better results.
Model Predictions
Supervised Learning Model
XGB Random Search
For the supervised model, we start with performing a randomized hyperparameter optimization of an XGBoost classifier using stratified 5-fold cross-validation and selecting the best model based on F1 score.
# normal vs. anomalous samples
= np.bincount(y_train)
neg, pos # gives minority class more weight during training (our is prettyy good already)
= neg / pos
scale_pos_weight print(f"Computed scale_pos_weight: {scale_pos_weight:.2f}")
# set up classifier for binary classification
# uses histogram-based tree building--tree_method='hist'
# includes class weighting calculated above
= XGBClassifier(
xgb ='binary:logistic',
objective='logloss',
eval_metric='hist',
tree_method=42,
random_state=-1,
n_jobs=scale_pos_weight
scale_pos_weight
)
# hyperparameter options for optimization
= {
param_dist 'n_estimators': [330, 340, 350],
'max_depth': [13, 14, 15],
'learning_rate': [0.050, 0.055, 0.1],
'subsample': [0.93, 0.95, 0.97],
'colsample_bytree': [0.60, 0.625, 0.65],
'gamma': [0.04, 0.05, 0.06],
'reg_alpha': [2.5, 2.75, 3.0],
'reg_lambda': [2.25, 2.5, 2.75]
}
# 5-fold staratified CV
= StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv
# we are looking for the best f1 score
= make_scorer(f1_score, average='binary')
f1
# randomized search with cross-validation
= RandomizedSearchCV(
search =xgb,
estimator=param_dist,
param_distributions=3,
n_iter=f1,
scoring=cv,
cv=42,
random_state=1,
n_jobs=0
verbose
)
search.fit(X_train, y_train)
print(f"Total fits run: {len(search.cv_results_['mean_test_score']) * cv.get_n_splits()}")
print(f"Best F1 score: {search.best_score_:.4f}")
print("Best params:")
pprint(search.best_params_)
Computed scale_pos_weight: 1.15
Total fits run: 15
Best F1 score: 0.9981
Best params:
{'colsample_bytree': 0.65,
'gamma': 0.05,
'learning_rate': 0.055,
'max_depth': 13,
'n_estimators': 340,
'reg_alpha': 2.75,
'reg_lambda': 2.75,
'subsample': 0.97}
The randomized search evaluated 15 total fits across three parameter combinations and five cross-validation folds. The best model achieved an exceptionally high F1 score of 0.9981, indicating a nearly perfect balance of precision and recall in detecting anomalies. While results of this quality suggest data leakage, none is present here. More likely, this is a manufactured dataset and the realtively simple code and an excellent model captured the relationships nicely.
Test Evaluation
Run the results of the RandomizedSearchCV against the testing dataset to sew how well it predicts the set and provide metrics and a confusion matrix to see where the model performed.
# the class predictions and the predicted probablility of a positive--anomalous--score
= search.best_estimator_.predict(X_test)
y_pred = search.best_estimator_.predict_proba(X_test)[:, 1]
y_proba
# scores
print("Test F1 Score:", f1_score(y_test, y_pred))
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("Classification Report:\n", classification_report(y_test, y_pred))
# confision martix to see where the model did well or failed
= confusion_matrix(y_test, y_pred)
cm = ['Normal (0)', 'Anomalous (1)']
labels
# heatmap of confusion matrix
=(6, 5))
plt.figure(figsize=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
sns.heatmap(cm, annot'Predicted')
plt.xlabel('Actual')
plt.ylabel('Confusion Matrix')
plt.title(
plt.tight_layout() plt.show()
Test F1 Score: 0.7639387639387639
Test Accuracy: 0.778388928317956
Test ROC AUC: 0.963783985241748
Classification Report:
precision recall f1-score support
0 0.67 0.97 0.79 9711
1 0.97 0.63 0.76 12833
accuracy 0.78 22544
macro avg 0.82 0.80 0.78 22544
weighted avg 0.84 0.78 0.78 22544
The tuned XGBoost model achieved strong overall performance with a test F1 score of 0.76, accuracy of 0.78, and an excellent ROC AUC of 0.96. While the model is highly precise in detecting anomalous traffic, it only recalls about 63% of true anomalies, meaning some attacks are missed. This suggests the model is conservative in raising alerts, favoring fewer false alarms at the cost of letting certain anomalies go undetected.
The differences between the training and testing performance suggest distributional drift between the two datasets. While the model captured patterns extremely well during training, its performance dropped on the test set.
Feature Influence on the Model
To examine which features most strongly influence the model’s predictions, we present a SHAP violin plot. This visualization shows both the direction and magnitude of feature impacts on the classification, highlighting how variables such as src_bytes
can contribute to both anomalous and non-anomalous outcomes.
The SHAP analysis shows that traffic volume features such as src_bytes
and dst_bytes
have the strongest influence on the model’s predictions. Additional features like repeated service requests, specific services , and login or flag indicators provide secondary but meaningful contributions to detecting anomalies.
All 17 EDA-refined features impact the classification with varying degrees of influence, suggesting successful data analysis and pre-processing.
Question 1 Result
The tuned XGBoost model classified network traffic with strong results, achieving 0.78 accuracy, 0.76 F1, and 0.96 ROC AUC. SHAP analysis shows all 17 refined features contributed, with src_bytes
, dst_bytes
, dst_host_srv_count
, count
, and same_srv_rate
most influential. Overall, supervised learning proved effective for distinguishing normal and anomalous traffic, though recall limitations mean some anomalies were missed.
Unsupervised Learning Model
Unsupervised learning methods are essential for discovering hidden patterns and structures in data without relying on labeled outcomes. In network traffic analysis, these techniques can identify unseen anomalies, making them valuable for proactive security monitoring when labeled data isn’t accessible. For this project, we apply three unsupervised machine learning algorithms K-Means Clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Soft Clustering (Gaussian Mixture Models). By evaluating the clustering results with silhouette score and adjusted rand index metrics, we assess how effectively these models separate normal and anomalous traffic, and compare their strengths and limitations in detecting network threats.
Finding Optimal Parameters
Selecting optimal parameters is crucial for the unsupervised learning algorithms we’re using in this project. Unlike supervised learning methods, unsupervised models don’t have labeled outcomes to guide their learning, performance is heavily reliant on the algorithm parameters. For K-Means, determining the ideal number of clusters ensures meaningful groupings that reflect the underlying data structure. DBSCAN requires tuning of the epsilon (distance threshold) and minimum samples parameters to accurately identify dense regions and outliers. Soft Clustering methods, such as Gaussian Mixture Models, rely on selecting the appropriate number of components to capture the underlying data distribution complexity.
Proper parameter selection improves cluster quality, enhances the ability to detect anomalies, and ensures that the models provide actionable insights for network traffic analysis. To achieve this, we use internal validation metrics like silhouette score and, when labels are available, external metrics such as adjusted rand index to guide our parameter choices.
Below are the visualizations that identify the optimal parameters:
Suggested eps: 2.938
Optimal number of clusters (lowest DBI): 19
Optimal number of clusters (BIC): 29
Unsupervised Learning Performance Metrics Analysis
To evaluate the effectiveness of our unsupervised models, we compared both silhouette scores and adjusted rand index (ARI) across DBSCAN, K-Means, and GMM.
Silhouette Scores:
- DBSCAN: 0.114
- KMeans clustering: 0.414
- GMM clustering: 0.307
Adjusted Rand Index:
- DBSCAN ARI: 0.242
- KMeans ARI: 0.179
- GMM ARI: 0.143
These results show that K-Means achieved the highest silhouette score, indicating more cohesive clusters, while DBSCAN had the highest ARI, suggesting its clusters most closely matched the true anomaly labels. Unsupervised methods did not reach the performance of supervised models, however, they still revealed meaningful structure and potential anomalies within the network traffic data.
Unsupervised learning Feature Importance
The top features identified by both K-Means and GMM match the most influential features found in the supervised XGBoost model. In supervised learning, SHAP analysis also highlighted src_bytes
, dst_bytes
, and connection rate features as key drivers for distinguishing normal and anomalous traffic.
The similarity demonstrates that both supervised and unsupervised methods are capturing the same underlying patterns in the network data. The consistency in feature importance across modeling approaches suggests these variables are fundamental indicators of network anomalies, regardless of whether labels are available. This strengthens confidence in the reliability and interpretability of the results, and shows that unsupervised models can surface meaningful insights even without explicit supervision.
Top 5 Feature Importances for K-Means Clustering
Feature | Importance |
---|---|
dst_bytes | 74.87 |
src_bytes | 51.91 |
dst_host_srv_diff_host_rate | 2.00 |
diff_srv_rate | 1.52 |
dst_host_diff_srv_rate | 1.46 |
Top 5 Feature Importances for Gaussian Mixture Model (GMM)
Feature | Importance |
---|---|
dst_bytes | 61.53 |
src_bytes | 42.74 |
dst_host_srv_diff_host_rate | 1.46 |
dst_host_diff_srv_rate | 1.33 |
diff_srv_rate | 1.20 |
Conclusion and Insights
After comparing supervised and unsupervised methods, we found that supervised learning (XGBoost) delivered higher accuracy and F1 scores for detecting anomalous network traffic, benefiting from labeled data and feature engineering. The model was able to precisely identify most anomalies, though some recall limitations remained.
Unsupervised methods K-Means, DBSCAN, and GMM proved valuable for discovering hidden patterns and potential anomalies without relying on labels. While their clustering performance was generally lower than supervised models, especially in terms of direct anomaly detection, they offered unique strengths in identifying novel or previously unseen threats and provided insights into the underlying structure of network traffic.
All in all, supervised learning is preferable when high-quality labeled data is available, enabling robust and interpretable anomaly detection. Supervised learning is able to capture the complexity of the network anomoly dataset. Unsupervised methods are essential for proactive monitoring and can complement supervised approaches, especially in dynamic environments where new attack types may emerge. Combining both strategies can lead to more resilient and adaptive network security solutions.