Packet Traffic Learning

INFO 523 - Final Project

Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.

Author

Affiliation

The Anomalists - Joey Garcia, David Kyle

College of Information Science, University of Arizona

Abstract

This project studies the implications and results of using supervised and unsupervised learning methods for classifying network traffic. For supervised learning, we employ the XGBooster classier to assess the accuracy of detecting anomalous network traffic using labeled data. In parallel, we’ll explore unsupervised learning methods, such as K-means clustering and Density-based spatial clustering of applications with noise (DBSCAN), to determine their capabilities in identifying anomalous traffic without the use of labeled data. This project is unique in the sense that we’re comparing the strengths and limitations of the two approaches above. These strategies can determine how intrusion detection, network security monitoring, and incident response can both be used in unison for future hybrid approaches.

Introduction

Network intrusion threats are continuously evolving in the realm of cybersecurity, challenging the effectiveness of traditional security practices. As attackers develop new techniques, it becomes increasingly important to efficiently analyze network traffic to identify anomalous patterns that may indicate potential threats to cyber infrastructure. Leveraging machine learning, particularly unsupervised learning methods, allows for the detection of previously unseen or novel threats without relying solely on labeled data. This adaptability is crucial in an ever-changing threat landscape, enabling proactive identification and mitigation of security risks.

Research Questions

Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model’s predictions?

Q2. Can unsupervised learning methods such as K-Means Clustering and DBSCAN detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models?

Dataset

We are using a training and testing dataset of network intrusion detection from NSL-KDD on Kaggle. The training dataset contains 125,972 rows and 43 columns, while the test dataset contains 22,543 rows and 43 columns.

The records include both normal traffic and a variety of attack samples. The dataset is curated to provide challenging cases for machine learning algorithms, with each sample rated on a difficulty scale from 1 to 21, where 21 represents the most difficult to predict.

The attack field indicates whether each observation is normal or belongs to a specific attack type, allowing for multi-class classification of anomalous network activity. For this project, we add a new binary classification feature, is_anomalous, to indicate whether a network connection is anomalous or not. This will serve as the target variable.

Looking at the vaious attack values, combining non-normal traffic into a composite column makes sense:

Once the data is combined into the composite is_anomalous column, good balance is seen making this a solid choice for a target column.

	Count	Percentage
is_anomalous
Normal	67343	53.46
Attack	58630	46.54

We chose this dataset because it provides a rich and realistic representation of network traffic. The presence of labeled data allows us to train and evaluate supervised models, while the diversity and complexity of traffic patterns also make it well suited for exploring unsupervised anomaly detection techniques. This balance between complexity and feature richness aligns with our research questions and modeling goals.

Preprocessing (Scaling, Feature Engineering, Dimension Reduction)

To ensure a fair, consistent comparison between supervised and unsupervised approaches, the feature engineering pipeline will be applied to both. All features will be named according to the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag, will be one-hot encoded while low-variance columns will be removed. Numerical features will be standardized to normalize their ranges.

For supervised models, the processed features will be paired with the binary target is_anomalous. For unsupervised models, the same feature set will be used without labels, allowing the algorithms to uncover structure or detect anomalies. This consistent preprocessing ensures that performance differences are driven by modeling choices rather than data preparation discrepancies.

Given the dataset’s over 40 features, we explore dimensionality reduction methods such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), however, for interpretability we’ chose to use Mutual information to . These will aid in visualization, mitigate the curse of dimensionality, and potentially enhance model performance. We will compare results using reduced dimensions versus the full feature set to determine which yields more interpretable and effective models.

After the data analysis and feature selection, 17 features remain.

Final Dataset Features

The final dataset used for modeling contains the following features and target variable:

Column Name	Data Type	Description	Notes
`src_bytes`	int64	Number of data bytes sent from source to destination.	Numeric (scaled)
`dst_bytes`	int64	Number of data bytes sent from destination to source.	Numeric (scaled)
`count`	int64	Number of connections to the same host in the past 2 seconds.	Numeric (scaled)
`srv_diff_host_rate`	float64	% of connections to different hosts on the same service.	Numeric (scaled)
`serror_rate`	float64	% of connections with SYN errors.	Numeric (scaled)
`same_srv_rate`	float64	% of connections to the same service.	Numeric (scaled)
`diff_srv_rate`	float64	% of connections to different services.	Numeric (scaled)
`dst_host_count`	int64	Number of connections to the destination host.	Numeric (scaled)
`dst_host_srv_count`	int64	Number of connections to the destination host and service.	Numeric (scaled)
`dst_host_same_srv_rate`	float64	% of connections to the same service on the destination host.	Numeric (scaled)
`dst_host_diff_srv_rate`	float64	% of connections to different services on the destination host.	Numeric (scaled)
`dst_host_same_src_port_rate`	float64	% of connections from the same source port.	Numeric (scaled)
`dst_host_srv_diff_host_rate`	float64	% of connections to the same service from different hosts.	Numeric (scaled)
`logged_in`	int64	1 if successfully logged in; 0 otherwise.	Binary indicator
`flag_SF`	int64	One-hot encoded: Status flag “SF” of the connection.	One-hot encoded categorical
`service_http`	int64	One-hot encoded: Network service is HTTP.	One-hot encoded categorical
`service_private`	int64	One-hot encoded: Network service is “private”.	One-hot encoded categorical
`is_anomalous`	int64	Target: 1 if connection is anomalous; 0 if normal.	Target variable

Data Validation Visualizations

A correlation heatmap and a PCA projection validte our data and provide a look at the quality of the data following the processing.

The heatmap shows some correlation, but we’re using models that will handle it.

PCA Projection

PCA projection of the top features shows partial separation between normal and anomalous traffic. While some overlap exists, the clustering indicates meaningful structure that can support classification models.

The team is satistfied with the results of the EDA and feel the data is in a good position for modeling.

How we’ll evaluate performance Metrics

For the supervised learning methods, we’ll use F1 score as the primary metric for model evaluation. The F1 score balances precision (minimizing false positives) and recall (minimizing false negatives). The F1 score is a balanced approach to identify both types of errors, misclassifying normal traffic (false positive) or missing a true anomoly (false negative), that carry significant impacts. Secondary metrics, such as accuracy, ROC AUC, and confusion matrix analysis, will be monitored for additional insight into the model’s behavior. To ensure robust results, we’ll apply sampling methods such as cross-validation to search for the best F1-score.

For unsupervised learning methods, we’ll use the silhouette score and adjusted rand index (ARI) to measure the performance for our K-Means clustering, gaussian mixture models (GMM), and DBSCAN models. The silhouette score values range from -1 to +1, a higher positive score indicates that the data point fits very well in its own cluster. Secondly, since we have labeled data, the ARI will measure similarity between the predicted clusters and true labels. ARI scores range from -1 to +1 and higher numbers inidicate better results.

Model Predictions

Supervised Learning Model

XGB Random Search

For the supervised model, we start with performing a randomized hyperparameter optimization of an XGBoost classifier using stratified 5-fold cross-validation and selecting the best model based on F1 score.

# normal vs. anomalous samples
neg, pos = np.bincount(y_train)
# gives minority class more weight during training (our is prettyy good already)
scale_pos_weight = neg / pos
print(f"Computed scale_pos_weight: {scale_pos_weight:.2f}")

# set up classifier for binary classification
# uses histogram-based tree building--tree_method='hist'
# includes class weighting calculated above
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    tree_method='hist',
    random_state=42,
    n_jobs=-1,
    scale_pos_weight=scale_pos_weight
)

# hyperparameter options for optimization
param_dist = {
    'n_estimators': [330, 340, 350],
    'max_depth': [13, 14, 15],
    'learning_rate': [0.050, 0.055, 0.1],
    'subsample': [0.93, 0.95, 0.97],
    'colsample_bytree': [0.60, 0.625, 0.65],
    'gamma': [0.04, 0.05, 0.06],
    'reg_alpha': [2.5, 2.75, 3.0],
    'reg_lambda': [2.25, 2.5, 2.75]
}

# 5-fold staratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# we are looking for the best f1 score
f1 = make_scorer(f1_score, average='binary')

# randomized search with cross-validation
search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_dist,
    n_iter=3,
    scoring=f1,
    cv=cv,
    random_state=42,
    n_jobs=1,
    verbose=0
)
search.fit(X_train, y_train)

print(f"Total fits run: {len(search.cv_results_['mean_test_score']) * cv.get_n_splits()}")
print(f"Best F1 score: {search.best_score_:.4f}")
print("Best params:")
pprint(search.best_params_)

Computed scale_pos_weight: 1.15
Total fits run: 15
Best F1 score: 0.9981
Best params:
{'colsample_bytree': 0.65,
 'gamma': 0.05,
 'learning_rate': 0.055,
 'max_depth': 13,
 'n_estimators': 340,
 'reg_alpha': 2.75,
 'reg_lambda': 2.75,
 'subsample': 0.97}

The randomized search evaluated 15 total fits across three parameter combinations and five cross-validation folds. The best model achieved an exceptionally high F1 score of 0.9981, indicating a nearly perfect balance of precision and recall in detecting anomalies. While results of this quality suggest data leakage, none is present here. More likely, this is a manufactured dataset and the realtively simple code and an excellent model captured the relationships nicely.

Test Evaluation

Run the results of the RandomizedSearchCV against the testing dataset to sew how well it predicts the set and provide metrics and a confusion matrix to see where the model performed.

# the class predictions and the predicted probablility of a positive--anomalous--score
y_pred = search.best_estimator_.predict(X_test)
y_proba = search.best_estimator_.predict_proba(X_test)[:, 1]

# scores
print("Test F1 Score:", f1_score(y_test, y_pred))
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test ROC AUC:", roc_auc_score(y_test, y_proba))
print("Classification Report:\n", classification_report(y_test, y_pred))

# confision martix to see where the model did well or failed
cm = confusion_matrix(y_test, y_pred)
labels = ['Normal (0)', 'Anomalous (1)']

# heatmap of confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

Test F1 Score: 0.7639387639387639
Test Accuracy: 0.778388928317956
Test ROC AUC: 0.963783985241748
Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.97      0.79      9711
           1       0.97      0.63      0.76     12833

    accuracy                           0.78     22544
   macro avg       0.82      0.80      0.78     22544
weighted avg       0.84      0.78      0.78     22544

The tuned XGBoost model achieved strong overall performance with a test F1 score of 0.76, accuracy of 0.78, and an excellent ROC AUC of 0.96. While the model is highly precise in detecting anomalous traffic, it only recalls about 63% of true anomalies, meaning some attacks are missed. This suggests the model is conservative in raising alerts, favoring fewer false alarms at the cost of letting certain anomalies go undetected.

The differences between the training and testing performance suggest distributional drift between the two datasets. While the model captured patterns extremely well during training, its performance dropped on the test set.

Feature Influence on the Model

To examine which features most strongly influence the model’s predictions, we present a SHAP violin plot. This visualization shows both the direction and magnitude of feature impacts on the classification, highlighting how variables such as src_bytes can contribute to both anomalous and non-anomalous outcomes.

The SHAP analysis shows that traffic volume features such as src_bytes and dst_bytes have the strongest influence on the model’s predictions. Additional features like repeated service requests, specific services , and login or flag indicators provide secondary but meaningful contributions to detecting anomalies.

All 17 EDA-refined features impact the classification with varying degrees of influence, suggesting successful data analysis and pre-processing.

Question 1 Result

The tuned XGBoost model classified network traffic with strong results, achieving 0.78 accuracy, 0.76 F1, and 0.96 ROC AUC. SHAP analysis shows all 17 refined features contributed, with src_bytes, dst_bytes, dst_host_srv_count, count, and same_srv_rate most influential. Overall, supervised learning proved effective for distinguishing normal and anomalous traffic, though recall limitations mean some anomalies were missed.

Unsupervised Learning Model

Unsupervised learning methods are essential for discovering hidden patterns and structures in data without relying on labeled outcomes. In network traffic analysis, these techniques can identify unseen anomalies, making them valuable for proactive security monitoring when labeled data isn’t accessible. For this project, we apply three unsupervised machine learning algorithms K-Means Clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Soft Clustering (Gaussian Mixture Models). By evaluating the clustering results with silhouette score and adjusted rand index metrics, we assess how effectively these models separate normal and anomalous traffic, and compare their strengths and limitations in detecting network threats.

Finding Optimal Parameters

Selecting optimal parameters is crucial for the unsupervised learning algorithms we’re using in this project. Unlike supervised learning methods, unsupervised models don’t have labeled outcomes to guide their learning, performance is heavily reliant on the algorithm parameters. For K-Means, determining the ideal number of clusters ensures meaningful groupings that reflect the underlying data structure. DBSCAN requires tuning of the epsilon (distance threshold) and minimum samples parameters to accurately identify dense regions and outliers. Soft Clustering methods, such as Gaussian Mixture Models, rely on selecting the appropriate number of components to capture the underlying data distribution complexity.

Proper parameter selection improves cluster quality, enhances the ability to detect anomalies, and ensures that the models provide actionable insights for network traffic analysis. To achieve this, we use internal validation metrics like silhouette score and, when labels are available, external metrics such as adjusted rand index to guide our parameter choices.

Below are the visualizations that identify the optimal parameters:

Suggested eps: 2.938

Optimal number of clusters (lowest DBI): 19

Optimal number of clusters (BIC): 29

Unsupervised Learning Performance Metrics Analysis

To evaluate the effectiveness of our unsupervised models, we compared both silhouette scores and adjusted rand index (ARI) across DBSCAN, K-Means, and GMM.

Silhouette Scores:
- DBSCAN: 0.114
- KMeans clustering: 0.414
- GMM clustering: 0.307

Adjusted Rand Index:
- DBSCAN ARI: 0.242
- KMeans ARI: 0.179
- GMM ARI: 0.143

These results show that K-Means achieved the highest silhouette score, indicating more cohesive clusters, while DBSCAN had the highest ARI, suggesting its clusters most closely matched the true anomaly labels. Unsupervised methods did not reach the performance of supervised models, however, they still revealed meaningful structure and potential anomalies within the network traffic data.

Unsupervised learning Feature Importance

The top features identified by both K-Means and GMM match the most influential features found in the supervised XGBoost model. In supervised learning, SHAP analysis also highlighted src_bytes, dst_bytes, and connection rate features as key drivers for distinguishing normal and anomalous traffic.

The similarity demonstrates that both supervised and unsupervised methods are capturing the same underlying patterns in the network data. The consistency in feature importance across modeling approaches suggests these variables are fundamental indicators of network anomalies, regardless of whether labels are available. This strengthens confidence in the reliability and interpretability of the results, and shows that unsupervised models can surface meaningful insights even without explicit supervision.

Top 5 Feature Importances for K-Means Clustering

Feature	Importance
dst_bytes	74.87
src_bytes	51.91
dst_host_srv_diff_host_rate	2.00
diff_srv_rate	1.52
dst_host_diff_srv_rate	1.46

Top 5 Feature Importances for Gaussian Mixture Model (GMM)

Feature	Importance
dst_bytes	61.53
src_bytes	42.74
dst_host_srv_diff_host_rate	1.46
dst_host_diff_srv_rate	1.33
diff_srv_rate	1.20

Conclusion and Insights

After comparing supervised and unsupervised methods, we found that supervised learning (XGBoost) delivered higher accuracy and F1 scores for detecting anomalous network traffic, benefiting from labeled data and feature engineering. The model was able to precisely identify most anomalies, though some recall limitations remained.

Unsupervised methods K-Means, DBSCAN, and GMM proved valuable for discovering hidden patterns and potential anomalies without relying on labels. While their clustering performance was generally lower than supervised models, especially in terms of direct anomaly detection, they offered unique strengths in identifying novel or previously unseen threats and provided insights into the underlying structure of network traffic.

All in all, supervised learning is preferable when high-quality labeled data is available, enabling robust and interpretable anomaly detection. Supervised learning is able to capture the complexity of the network anomoly dataset. Unsupervised methods are essential for proactive monitoring and can complement supervised approaches, especially in dynamic environments where new attack types may emerge. Combining both strategies can lead to more resilient and adaptive network security solutions.

--- title: "Packet Traffic Learning" subtitle: "INFO 523 - Final Project" author: - name: "The Anomalists - Joey Garcia, David Kyle" affiliations: - name: "College of Information Science, University of Arizona" description: "Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation." format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false freeze: auto jupyter: python3 --- ```{python} #| label: load-packages #| message: false import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy import stats, sparse # Preprocessing from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.feature_selection import VarianceThreshold, mutual_info_classif from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold from sklearn.decomposition import PCA # Machine Learning Models from xgboost import XGBClassifier # Supervised from sklearn.cluster import KMeans, DBSCAN # Unsupervised from sklearn.mixture import GaussianMixture from sklearn.neighbors import NearestNeighbors from kneed import KneeLocator # find epsilon for DBSCAN # Metrics and Visualization from sklearn.metrics import make_scorer, f1_score, roc_auc_score, \ accuracy_score, classification_report, confusion_matrix, \ silhouette_score, adjusted_rand_score, davies_bouldin_score import shap from pprint import pprint ``` ```{python} #| label: load-ds #| message: false df_train = pd.read_csv('data/KDDTrain.csv') df_test = pd.read_csv('data/KDDTest.csv') ``` ## Abstract This project studies the implications and results of using supervised and unsupervised learning methods for classifying network traffic. For supervised learning, we employ the XGBooster classier to assess the accuracy of detecting anomalous network traffic using labeled data. In parallel, we'll explore unsupervised learning methods, such as K-means clustering and Density-based spatial clustering of applications with noise (DBSCAN), to determine their capabilities in identifying anomalous traffic without the use of labeled data. This project is unique in the sense that we're comparing the strengths and limitations of the two approaches above. These strategies can determine how intrusion detection, network security monitoring, and incident response can both be used in unison for future hybrid approaches. ## Introduction Network intrusion threats are continuously evolving in the realm of cybersecurity, challenging the effectiveness of traditional security practices. As attackers develop new techniques, it becomes increasingly important to efficiently analyze network traffic to identify anomalous patterns that may indicate potential threats to cyber infrastructure. Leveraging machine learning, particularly unsupervised learning methods, allows for the detection of previously unseen or novel threats without relying solely on labeled data. This adaptability is crucial in an ever-changing threat landscape, enabling proactive identification and mitigation of security risks. ## Research Questions Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model's predictions? Q2. Can unsupervised learning methods such as K-Means Clustering and DBSCAN detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models? ## Dataset We are using a training and testing dataset of network intrusion detection from [NSL-KDD on Kaggle](https://www.kaggle.com/datasets/hassan06/nslkdd/data?select=KDDTrain1.jpg). The training dataset contains **125,972 rows** and **43 columns**, while the test dataset contains **22,543 rows** and **43 columns**. The records include both normal traffic and a variety of attack samples. The dataset is curated to provide challenging cases for machine learning algorithms, with each sample rated on a difficulty scale from 1 to 21, where 21 represents the most difficult to predict. The `attack` field indicates whether each observation is normal or belongs to a specific attack type, allowing for multi-class classification of anomalous network activity. For this project, we add a new binary classification feature, `is_anomalous`, to indicate whether a network connection is anomalous or not. This will serve as the target variable. Looking at the vaious `attack` values, combining non-normal traffic into a composite column makes sense: ```{python} #| label: countplot-anomalous #| message: false # plot style df_train['is_anomalous'] = df_train['attack'].apply( lambda x: 0 if x == 'normal' else 1) df_test['is_anomalous'] = df_test['attack'].apply( lambda x: 0 if x == 'normal' else 1) sns.set(style="whitegrid") # get top 10 attack types top_10_attacks = df_train['attack'].value_counts().head(10).index # plot plt.figure(figsize=(8, 5)) sns.countplot( data=df_train[df_train['attack'].isin(top_10_attacks)], x="attack", hue="attack", palette="Set2", order=top_10_attacks ) plt.title("Top 10 Attack Types in Training Data") plt.xlabel("Attack Type") plt.ylabel("Count") plt.xticks(rotation=45) plt.tight_layout() plt.show() ``` Once the data is combined into the composite `is_anomalous` column, good balance is seen making this a solid choice for a target column. ```{python} #| label: load-pkgs #| message: false # plot style plt.figure(figsize = (8, 5)) ax = sns.countplot(x = 'is_anomalous', data = df_train, palette = "Set2", hue = "is_anomalous" ) plt.xticks([0, 1], ['Normal (0)', 'Anomalous (1)']) plt.xlabel('Network Traffic Type') plt.ylabel('Count') plt.title('Normal vs Anomalous Network Traffic in Training Data') ax.legend_.remove() ax.grid(False, axis = 'y') plt.tight_layout() plt.show() # Create summary table for is_anomalous value_counts = df_train["is_anomalous"].value_counts() percentages = (value_counts / value_counts.sum()) * 100 summary_table = pd.DataFrame({ "Count": value_counts, "Percentage": percentages.round(2) }).rename(index={0: "Normal", 1: "Attack"}) summary_table ``` We chose this dataset because it provides a rich and realistic representation of network traffic. The presence of labeled data allows us to train and evaluate supervised models, while the diversity and complexity of traffic patterns also make it well suited for exploring unsupervised anomaly detection techniques. This balance between complexity and feature richness aligns with our research questions and modeling goals. ### Preprocessing (Scaling, Feature Engineering, Dimension Reduction) To ensure a fair, consistent comparison between supervised and unsupervised approaches, the feature engineering pipeline will be applied to both. All features will be named according to the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag, will be one-hot encoded while low-variance columns will be removed. Numerical features will be standardized to normalize their ranges. For supervised models, the processed features will be paired with the binary target `is_anomalous`. For unsupervised models, the same feature set will be used without labels, allowing the algorithms to uncover structure or detect anomalies. This consistent preprocessing ensures that performance differences are driven by modeling choices rather than data preparation discrepancies. Given the dataset’s over 40 features, we explore dimensionality reduction methods such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), however, for interpretability we' chose to use Mutual information to . These will aid in visualization, mitigate the curse of dimensionality, and potentially enhance model performance. We will compare results using reduced dimensions versus the full feature set to determine which yields more interpretable and effective models. ![Exploritory Data Analysis Steps](images/EDA.png){fig-align="center" width="40%"} After the data analysis and feature selection, 17 features remain. #### Final Dataset Features The final dataset used for modeling contains the following features and target variable: | Column Name | Data Type | Description | Notes | |---------------|---------------|-----------------------------|---------------| | `src_bytes` | int64 | Number of data bytes sent from source to destination. | Numeric (scaled) | | `dst_bytes` | int64 | Number of data bytes sent from destination to source. | Numeric (scaled) | | `count` | int64 | Number of connections to the same host in the past 2 seconds. | Numeric (scaled) | | `srv_diff_host_rate` | float64 | \% of connections to different hosts on the same service. | Numeric (scaled) | | `serror_rate` | float64 | \% of connections with SYN errors. | Numeric (scaled) | | `same_srv_rate` | float64 | \% of connections to the same service. | Numeric (scaled) | | `diff_srv_rate` | float64 | \% of connections to different services. | Numeric (scaled) | | `dst_host_count` | int64 | Number of connections to the destination host. | Numeric (scaled) | | `dst_host_srv_count` | int64 | Number of connections to the destination host and service. | Numeric (scaled) | | `dst_host_same_srv_rate` | float64 | \% of connections to the same service on the destination host. | Numeric (scaled) | | `dst_host_diff_srv_rate` | float64 | \% of connections to different services on the destination host. | Numeric (scaled) | | `dst_host_same_src_port_rate` | float64 | \% of connections from the same source port. | Numeric (scaled) | | `dst_host_srv_diff_host_rate` | float64 | \% of connections to the same service from different hosts. | Numeric (scaled) | | `logged_in` | int64 | 1 if successfully logged in; 0 otherwise. | Binary indicator | | `flag_SF` | int64 | One-hot encoded: Status flag "SF" of the connection. | One-hot encoded categorical | | `service_http` | int64 | One-hot encoded: Network service is HTTP. | One-hot encoded categorical | | `service_private` | int64 | One-hot encoded: Network service is "private". | One-hot encoded categorical | | `is_anomalous` | int64 | Target: 1 if connection is anomalous; 0 if normal. | Target variable | #### Data Validation Visualizations ```{python} #| label: load-ds_EDA #| message: false df_train_EDA = pd.read_csv('data/df_train_EDA.csv') df_test_EDA = pd.read_csv('data/df_test_EDA.csv') ``` A correlation heatmap and a PCA projection validte our data and provide a look at the quality of the data following the processing. ```{python} #| label: correlation_heatmap #| message: false target_col = "is_anomalous" X_EDA = df_train_EDA.drop(columns=[target_col]) # compute correlation matrix corr = X_EDA.corr(method="pearson") # plot style sns.set(style="white") # heatmap plt.figure(figsize=(12, 10)) sns.heatmap( corr, cmap="vlag", linewidths=0.5, annot=False, center=0, square=True, cbar_kws={"shrink": 0.8} ) plt.title("Feature Correlation Heatmap", fontsize=14) plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.show() ``` The heatmap shows some correlation, but we're using models that will handle it. PCA Projection ```{python} #| label: PCA-Projection #| message: false #| warning: false #| echo: false target_col = "is_anomalous" X_pca = PCA(n_components=2, random_state=42).fit_transform( df_train_EDA.drop(columns=[target_col]) ) pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"]) pca_df[target_col] = df_train_EDA[target_col].values plt.figure(figsize=(8, 6)) sns.scatterplot( data=pca_df, x="PC1", y="PC2", hue=target_col, palette="coolwarm", alpha=0.5 ) plt.title("PCA Projection Final Features") plt.show() ``` PCA projection of the top features shows partial separation between normal and anomalous traffic. While some overlap exists, the clustering indicates meaningful structure that can support classification models. The team is satistfied with the results of the EDA and feel the data is in a good position for modeling. ### How we'll evaluate performance Metrics For the supervised learning methods, we'll use F1 score as the primary metric for model evaluation. The F1 score balances precision (minimizing false positives) and recall (minimizing false negatives). The F1 score is a balanced approach to identify both types of errors, misclassifying normal traffic (false positive) or missing a true anomoly (false negative), that carry significant impacts. Secondary metrics, such as accuracy, ROC AUC, and confusion matrix analysis, will be monitored for additional insight into the model's behavior. To ensure robust results, we'll apply sampling methods such as cross-validation to search for the best F1-score. For unsupervised learning methods, we'll use the silhouette score and adjusted rand index (ARI) to measure the performance for our K-Means clustering, gaussian mixture models (GMM), and DBSCAN models. The silhouette score values range from -1 to +1, a higher positive score indicates that the data point fits very well in its own cluster. Secondly, since we have labeled data, the ARI will measure similarity between the predicted clusters and true labels. ARI scores range from -1 to +1 and higher numbers inidicate better results. ## Model Predictions ```{python} #| label: X_train_y_train #| message: false #| warning: false #| echo: false X_train = df_train_EDA.drop(columns=['is_anomalous']) y_train = df_train_EDA['is_anomalous'] X_test = df_test_EDA.drop(columns=['is_anomalous']) y_test = df_test_EDA['is_anomalous'] ``` ### Supervised Learning Model #### XGB Random Search For the supervised model, we start with performing a randomized hyperparameter optimization of an XGBoost classifier using stratified 5-fold cross-validation and selecting the best model based on F1 score. ```{python} #| label: xgb_random_search #| message: false #| warning: false #| echo: true # normal vs. anomalous samples neg, pos = np.bincount(y_train) # gives minority class more weight during training (our is prettyy good already) scale_pos_weight = neg / pos print(f"Computed scale_pos_weight: {scale_pos_weight:.2f}") # set up classifier for binary classification # uses histogram-based tree building--tree_method='hist' # includes class weighting calculated above xgb = XGBClassifier( objective='binary:logistic', eval_metric='logloss', tree_method='hist', random_state=42, n_jobs=-1, scale_pos_weight=scale_pos_weight ) # hyperparameter options for optimization param_dist = { 'n_estimators': [330, 340, 350], 'max_depth': [13, 14, 15], 'learning_rate': [0.050, 0.055, 0.1], 'subsample': [0.93, 0.95, 0.97], 'colsample_bytree': [0.60, 0.625, 0.65], 'gamma': [0.04, 0.05, 0.06], 'reg_alpha': [2.5, 2.75, 3.0], 'reg_lambda': [2.25, 2.5, 2.75] } # 5-fold staratified CV cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # we are looking for the best f1 score f1 = make_scorer(f1_score, average='binary') # randomized search with cross-validation search = RandomizedSearchCV( estimator=xgb, param_distributions=param_dist, n_iter=3, scoring=f1, cv=cv, random_state=42, n_jobs=1, verbose=0 ) search.fit(X_train, y_train) print(f"Total fits run: {len(search.cv_results_['mean_test_score']) * cv.get_n_splits()}") print(f"Best F1 score: {search.best_score_:.4f}") print("Best params:") pprint(search.best_params_) ``` The randomized search evaluated 15 total fits across three parameter combinations and five cross-validation folds. The best model achieved an exceptionally high F1 score of 0.9981, indicating a nearly perfect balance of precision and recall in detecting anomalies. While results of this quality suggest data leakage, none is present here. More likely, this is a manufactured dataset and the realtively simple code and an excellent model captured the relationships nicely. #### Test Evaluation Run the results of the RandomizedSearchCV against the testing dataset to sew how well it predicts the set and provide metrics and a confusion matrix to see where the model performed. ```{python} #| label: xgb_test_evaluation #| message: false #| warning: false #| echo: true # the class predictions and the predicted probablility of a positive--anomalous--score y_pred = search.best_estimator_.predict(X_test) y_proba = search.best_estimator_.predict_proba(X_test)[:, 1] # scores print("Test F1 Score:", f1_score(y_test, y_pred)) print("Test Accuracy:", accuracy_score(y_test, y_pred)) print("Test ROC AUC:", roc_auc_score(y_test, y_proba)) print("Classification Report:\n", classification_report(y_test, y_pred)) # confision martix to see where the model did well or failed cm = confusion_matrix(y_test, y_pred) labels = ['Normal (0)', 'Anomalous (1)'] # heatmap of confusion matrix plt.figure(figsize=(6, 5)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels) plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.tight_layout() plt.show() ``` The tuned XGBoost model achieved strong overall performance with a test F1 score of 0.76, accuracy of 0.78, and an excellent ROC AUC of 0.96. While the model is highly precise in detecting anomalous traffic, it only recalls about 63% of true anomalies, meaning some attacks are missed. This suggests the model is conservative in raising alerts, favoring fewer false alarms at the cost of letting certain anomalies go undetected. The differences between the training and testing performance suggest distributional drift between the two datasets. While the model captured patterns extremely well during training, its performance dropped on the test set. #### Feature Influence on the Model To examine which features most strongly influence the model’s predictions, we present a SHAP violin plot. This visualization shows both the direction and magnitude of feature impacts on the classification, highlighting how variables such as `src_bytes` can contribute to both anomalous and non-anomalous outcomes. ```{python} #| label: feature_influence #| message: false #| warning: false #| echo: false explainer = shap.TreeExplainer(search.best_estimator_) # SHAP shap_values_train = explainer.shap_values(X_train) shap_values_test = explainer.shap_values(X_test) shap.summary_plot( shap_values_train, X_train, plot_type="violin", show=False ) plt.title("SHAP Violin Plot - Train Set") plt.tight_layout() plt.show() ``` The SHAP analysis shows that traffic volume features such as `src_bytes` and `dst_bytes` have the strongest influence on the model’s predictions. Additional features like repeated service requests, specific services , and login or flag indicators provide secondary but meaningful contributions to detecting anomalies. All 17 EDA-refined features impact the classification with varying degrees of influence, suggesting successful data analysis and pre-processing. #### Question 1 Result The tuned XGBoost model classified network traffic with strong results, achieving 0.78 accuracy, 0.76 F1, and 0.96 ROC AUC. SHAP analysis shows all 17 refined features contributed, with `src_bytes`, `dst_bytes`, `dst_host_srv_count`, `count`, and `same_srv_rate` most influential. Overall, supervised learning proved effective for distinguishing normal and anomalous traffic, though recall limitations mean some anomalies were missed. ### Unsupervised Learning Model Unsupervised learning methods are essential for discovering hidden patterns and structures in data without relying on labeled outcomes. In network traffic analysis, these techniques can identify unseen anomalies, making them valuable for proactive security monitoring when labeled data isn't accessible. For this project, we apply three unsupervised machine learning algorithms K-Means Clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Soft Clustering (Gaussian Mixture Models). By evaluating the clustering results with silhouette score and adjusted rand index metrics, we assess how effectively these models separate normal and anomalous traffic, and compare their strengths and limitations in detecting network threats. #### Finding Optimal Parameters Selecting optimal parameters is crucial for the unsupervised learning algorithms we're using in this project. Unlike supervised learning methods, unsupervised models don't have labeled outcomes to guide their learning, performance is heavily reliant on the algorithm parameters. For K-Means, determining the ideal number of clusters ensures meaningful groupings that reflect the underlying data structure. DBSCAN requires tuning of the epsilon (distance threshold) and minimum samples parameters to accurately identify dense regions and outliers. Soft Clustering methods, such as Gaussian Mixture Models, rely on selecting the appropriate number of components to capture the underlying data distribution complexity. Proper parameter selection improves cluster quality, enhances the ability to detect anomalies, and ensures that the models provide actionable insights for network traffic analysis. To achieve this, we use internal validation metrics like silhouette score and, when labels are available, external metrics such as adjusted rand index to guide our parameter choices. Below are the visualizations that identify the optimal parameters: ```{python} #| label: unsupervised-optimal-parameters #| message: false #| echo: false ''' For DBSCAN Choosing epsilon using elbow method ''' min_samples = 5 # same as you plan to use in DBSCAN neighbors = NearestNeighbors(n_neighbors=min_samples) neighbors_fit = neighbors.fit(X_train) distances, indices = neighbors_fit.kneighbors(X_train) # Take the k-th NN distance distances = np.sort(distances[:, -1]) # Find the elbow kneedle = KneeLocator(range(len(distances)), distances, curve="convex", direction="increasing") optimal_eps = distances[kneedle.knee] plt.figure(figsize=(8,5)) plt.plot(distances) plt.axhline(y=optimal_eps, color="red", linestyle="--", label=f"eps = {optimal_eps:.3f}") plt.xlabel("Points sorted by distance") plt.ylabel(f"{min_samples}-NN distance") plt.title("k-distance Graph with Suggested eps") plt.legend() plt.show() print(f"Suggested eps: {optimal_eps:.3f}") ''' Choosing K-clusters Important: must scale data Using Davies-Bouldin Index ''' k_values = range(2, 30) dbi_scores = [] for k in k_values: kmeans = KMeans(n_clusters=k, random_state=1, n_init=10) labels = kmeans.fit_predict(X_train) # Use scaled/encoded features dbi = davies_bouldin_score(X_train, labels) dbi_scores.append(dbi) # Plot DBI plt.figure(figsize=(8, 5)) plt.plot(k_values, dbi_scores, marker='o') plt.xlabel("Number of clusters (k)") plt.ylabel("Davies-Bouldin Index (DBI)") plt.title("Optimal k using Davies-Bouldin Index") plt.show() best_k = k_values[dbi_scores.index(min(dbi_scores))] print(f"Optimal number of clusters (lowest DBI): {best_k}") ''' Find Best n for GMM ''' n_components = range(1, 30) bics = [] aics = [] for n in n_components: gmm = GaussianMixture(n_components=n, covariance_type='full', random_state=1) gmm.fit(X_train) bics.append(gmm.bic(X_train)) aics.append(gmm.aic(X_train)) # Plot BIC and AIC plt.figure(figsize=(8, 5)) plt.plot(n_components, bics, marker='o', label="BIC") plt.plot(n_components, aics, marker='s', label="AIC", linestyle="--") plt.xlabel("Number of clusters") plt.ylabel("Score") plt.title("GMM model selection with BIC/AIC") plt.legend() plt.show() # Best n_clusters best_n = n_components[np.argmin(bics)] print("Optimal number of clusters (BIC):", best_n) ``` #### Unsupervised Learning Performance Metrics Analysis To evaluate the effectiveness of our unsupervised models, we compared both silhouette scores and adjusted rand index (ARI) across DBSCAN, K-Means, and GMM. **Silhouette Scores:** - DBSCAN: 0.114 - KMeans clustering: 0.414 - GMM clustering: 0.307 **Adjusted Rand Index:** - DBSCAN ARI: 0.242 - KMeans ARI: 0.179 - GMM ARI: 0.143 These results show that K-Means achieved the highest silhouette score, indicating more cohesive clusters, while DBSCAN had the highest ARI, suggesting its clusters most closely matched the true anomaly labels. Unsupervised methods did not reach the performance of supervised models, however, they still revealed meaningful structure and potential anomalies within the network traffic data. #### Unsupervised learning Feature Importance The top features identified by both K-Means and GMM match the most influential features found in the supervised XGBoost model. In supervised learning, SHAP analysis also highlighted `src_bytes`, `dst_bytes`, and connection rate features as key drivers for distinguishing normal and anomalous traffic. The similarity demonstrates that both supervised and unsupervised methods are capturing the same underlying patterns in the network data. The consistency in feature importance across modeling approaches suggests these variables are fundamental indicators of network anomalies, regardless of whether labels are available. This strengthens confidence in the reliability and interpretability of the results, and shows that unsupervised models can surface meaningful insights even without explicit supervision. Top 5 Feature Importances for K-Means Clustering | Feature | Importance | |-----------------------------|-------------| | dst_bytes | 74.87 | | src_bytes | 51.91 | | dst_host_srv_diff_host_rate | 2.00 | | diff_srv_rate | 1.52 | | dst_host_diff_srv_rate | 1.46 | Top 5 Feature Importances for Gaussian Mixture Model (GMM) | Feature | Importance | |-----------------------------|-------------| | dst_bytes | 61.53 | | src_bytes | 42.74 | | dst_host_srv_diff_host_rate | 1.46 | | dst_host_diff_srv_rate | 1.33 | | diff_srv_rate | 1.20 | ## Conclusion and Insights After comparing supervised and unsupervised methods, we found that supervised learning (XGBoost) delivered higher accuracy and F1 scores for detecting anomalous network traffic, benefiting from labeled data and feature engineering. The model was able to precisely identify most anomalies, though some recall limitations remained. Unsupervised methods K-Means, DBSCAN, and GMM proved valuable for discovering hidden patterns and potential anomalies without relying on labels. While their clustering performance was generally lower than supervised models, especially in terms of direct anomaly detection, they offered unique strengths in identifying novel or previously unseen threats and provided insights into the underlying structure of network traffic. All in all, supervised learning is preferable when high-quality labeled data is available, enabling robust and interpretable anomaly detection. Supervised learning is able to capture the complexity of the network anomoly dataset. Unsupervised methods are essential for proactive monitoring and can complement supervised approaches, especially in dynamic environments where new attack types may emerge. Combining both strategies can lead to more resilient and adaptive network security solutions.