Packet Traffic Learning

Proposal

Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.

Author

Affiliation

The Anomalists - Joey Garcia, David Kyle

College of Information Science, University of Arizona

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats # for analysis plan

Dataset

Our dataset doesn’t include column names, we’ll add the column names to the in-memory dataframes.

df_train = pd.read_csv('data/KDDTrain.csv')
df_test = pd.read_csv('data/KDDTest.csv')

'''
Columns recieved from kaggle project 
https://www.kaggle.com/code/faizankhandeshmukh/intrusion-detection-system

'''

# Define the list of column names based on the NSL-KDD dataset description
columns = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
    'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins',
    'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root',
    'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
    'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
    'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
    'dst_host_srv_rerror_rate', 'attack', 'level'
]

# Assign the column names to the dataframe
df_train.columns = columns
df_test.columns = columns


print('Shapes (train, test):', df_train.shape, df_test.shape)

Shapes (train, test): (125973, 43) (22544, 43)

We are using a training and testing dataset of network intrusion detection from NSL-KDD from Kaggle. The intrusion detection network traffic training dataset contains 125,972 rows and 43 columns, and 22,543 rows and 43 columns in the test dataset.

The attack field indicates normal or anomalous (multi-class) observations which allows us to use learning approaches for classifying anomalous network activity. A new binary classification feature, is_anomalous, will be added to indicate if the network connection was anomalous or not. This will be the target field for the project.

We chose this dataset because it provides a rich and realistic representation of network traffic data. The presence of labeled data allows us to train and evaluate supervised models; the diversity and volume of traffic patterns make it well-suited for exploring unsupervised anomaly detection techniques as well. This balance between complexity and feature richness aligns well with our research questions and modeling goals.

Questions

Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model’s predictions?

Q2. Can unsupervised learning methods such as K-Means Clustering and Density-based spatial clustering of applications with noise (DBSCAN) detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models?

Summary. In addition to evaluating the performance of supervised and unsupervised models on the task of anomaly detection, we compare the most influential features identified by each approach. This allows us to investigate how different learning paradigms “perceive” and prioritize threat indicators (features) within the same dataset.

Dataset Analysis

Variables

Column Name	Data Type	Description
`duration`	int64	Length (in seconds) of the connection.
`protocol_type`	object	Protocol used (e.g., tcp, udp, icmp).
`service`	object	Network service on the destination (e.g., http, telnet).
`flag`	object	Status flag of the connection.
`src_bytes`	int64	Number of data bytes sent from source to destination.
`dst_bytes`	int64	Number of data bytes sent from destination to source.
`land`	int64	1 if connection is from/to the same host/port; 0 otherwise.
`wrong_fragment`	int64	Number of wrong fragments.
`urgent`	int64	Number of urgent packets.
`hot`	int64	Number of “hot” indicators.
`num_failed_logins`	int64	Number of failed login attempts.
`logged_in`	int64	1 if successfully logged in; 0 otherwise.
`num_compromised`	int64	Number of compromised conditions.
`root_shell`	int64	1 if root shell is obtained; 0 otherwise.
`su_attempted`	int64	1 if “su root” command attempted; 0 otherwise.
`num_root`	int64	Number of “root” accesses.
`num_file_creations`	int64	Number of file creation operations.
`num_shells`	int64	Number of shell prompts invoked.
`num_access_files`	int64	Number of accesses to control files.
`num_outbound_cmds`	int64	Number of outbound commands (always 0 in KDD99).
`is_host_login`	int64	1 if login is to a host account; 0 otherwise.
`is_guest_login`	int64	1 if login is to a guest account; 0 otherwise.
`count`	int64	Number of connections to the same host in the past 2 seconds.
`srv_count`	int64	Number of connections to the same service in the past 2 seconds.
`serror_rate`	float64	% of connections with SYN errors.
`srv_serror_rate`	float64	% of connections to the same service with SYN errors.
`rerror_rate`	float64	% of connections with REJ errors.
`srv_rerror_rate`	float64	% of connections to the same service with REJ errors.
`same_srv_rate`	float64	% of connections to the same service.
`diff_srv_rate`	float64	% of connections to different services.
`srv_diff_host_rate`	float64	% of connections to different hosts on the same service.
`dst_host_count`	int64	Number of connections to the destination host.
`dst_host_srv_count`	int64	Number of connections to the destination host and service.
`dst_host_same_srv_rate`	float64	% of connections to the same service on the destination host.
`dst_host_diff_srv_rate`	float64	% of connections to different services on the destination host.
`dst_host_same_src_port_rate`	float64	% of connections from the same source port.
`dst_host_srv_diff_host_rate`	float64	% of connections to the same service from different hosts.
`dst_host_serror_rate`	float64	% of connections with SYN errors to the destination host.
`dst_host_srv_serror_rate`	float64	% of connections with SYN errors to the destination service.
`dst_host_rerror_rate`	float64	% of connections with REJ errors to the destination host.
`dst_host_srv_rerror_rate`	float64	% of connections with REJ errors to the destination service.
`attack`	object	Label indicating the type of attack or “normal”.
`level`	int64	Severity or confidence score of the attack (if available).

Exploratory Data Analysis

Evaluate training data for any obvious imbalances.

print("Shape:", df_train.shape)
print("Missing values:", df_train.isna().sum().sum())
print("Duplicates:", df_train.duplicated().sum())
print("Unique attack labels:", df_train['attack'].nunique())
print("Attack label distribution:\n", df_train['attack'].value_counts().head(5))

# Show types and non-null counts
df_train.info(verbose=False)

Shape: (125973, 43)
Missing values: 0
Duplicates: 0
Unique attack labels: 23
Attack label distribution:
 attack
normal       67343
neptune      41214
satan         3633
ipsweep       3599
portsweep     2931
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Columns: 43 entries, duration to level
dtypes: float64(15), int64(24), object(4)
memory usage: 41.3+ MB

The brief look at the data is positive. There are plenty of data points and features. Depending on time and model fitting speed, we may decrease our sample size because hyperparameter tuning using GridSearchCV can be computationally intensive. There are no missing values, and the glimpse of the attack column provides insight into why we want to collapse it into a binary column.

Normal vs Anomalous Traffic

First, let’s evaluate distribution of normal v. anomalous data.

attack
normal             67343
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: count, dtype: int64

The plot provides an idea of the specific attack types expressesd in the data. The plot communicates why it makes sense to group all non-normal traffic together.

We feature engineer a new column, is_anomalous, we identify 0 as normal activity and 1 as anomalous.

# create binary target column: 1 = attack, 0 = normal

df_train['is_anomalous'] = df_train['attack'].apply(
  lambda x: 0 if x == 'normal' else 1)

Examine the new column, is_anaomalous, to get an idea of the target frequency.

	Count	Percentage
is_anomalous
Normal	67343	53.46
Attack	58630	46.54

The is_anomalous classification target shows a near-even class distribution indicating the the dataset is well balanced. There should be no need for resampling or class weighting to correct the set. It appears this dataset will be a good candidate for learning models.

Analysis plan

Problem Introduction

The project is to build and evaluate models capable of detecting anomalous network traffic based on connection-level features from the NSL-KDD dataset. The problem is framed as a binary classification task, where each record is labeled as either normal or anomalous. This has real-world applications in intrusion detection systems and network security monitoring.

The project will explore both supervised and unsupervised machine learning techniques to assess their effectiveness in identifying attacks from structured network traffic data.

Feature Engineering Strategy

To ensure a fair and consistent comparison, we will apply the same feature engineering pipeline to both supervised and unsupervised models. All features will be assigned appropriate column names based on the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag will be one-hot encoded, and low-variance or non-informative columns will be removed. Numeric features will be standardized using a scaler to normalize their ranges.

For supervised models, these engineered features will be used alongside the binary target is_anomalous. For unsupervised models, the same processed features will be used without labels, allowing the models to explore underlying structure or detect anomalous patterns. This consistent preprocessing ensures that differences in performance and feature relevance can be attributed to the modeling approaches rather than inconsistencies in data preparation.

Dimensionality Reduction

Since the dataset contains over 40 features, we will apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to aid in visualization, mitigate the curse of dimensionality, and potentially improve model performance. We will evaluate how reduced-dimensional representations affect clustering and classification results and determine whether retaining all features or selecting a subset leads to more interpretable and higher performing models.

Q1. Supervised Learning

For the supervised learning portion of the project, we will use the labeled NSL-KDD dataset to train a model that classifies network traffic as normal or anomalous. We will implement XGBoost, a gradient-boosted decision tree algorithm known for its high accuracy, scalability, and built-in mechanisms for handling class imbalance. XGBoost is particularly well-suited for structured tabular data like NSL-KDD and provides useful feature importance measures that support model interpretation.

Model performance will be evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and ROC AUC. In addition to assessing predictive performance, we will analyze the most influential features identified by the model to better understand what drives the classification of anomalous behavior.

Q2. Unsupervised Learning

For the unsupervised learning portion of the project, we will use clustering-based approaches to detect anomalies in network traffic without relying on labeled data. We plan to use K-Means, which partitions data into k-clusters based on distance. Also, we’ll use DBSCAN, which identifies clusters of varying shape and isolates outliers as noise. These techniques will be applied to the same feature-engineered data used in the supervised learning tasks.

After clustering, we will evaluate how well the resulting groupings align with the true labels using unsupervised performance metrics such as Adjusted Rand Index (ARI) and Fowlkes-Mallows Index (FMI) since we have the labels. We will also examine which features appear most influential in driving cluster separation and anomaly detection, enabling comparison with the supervised model’s learned feature importance.

Summary Comparison

To compare the supervised and unsupervised approaches, we will evaluate their effectiveness in detecting anomalous traffic using metrics appropriate to each task. In addition to quantitative performance, we will compare the features deemed most influential by each modeling approach, revealing how different learning paradigms interpret the same network traffic. We will also consider practical factors such as scalability, interpretability, and the requirement for labeled data, to assess the trade-offs between the two strategies.

Project Timeline

Week	Task Name	Summary	Team Member(s)
1	Dataset exploration	Load the dataset, inspect features, validate schema, sanity-check label balance.	Both
1	Define research questions and proposal	Clarify goals for supervised (XGBoost) and unsupervised (K-Means/DBSCAN) anomaly detection.	Both
2	EDA & Feature engineering	Explore distributions, correlations, missing data, and outliers; then one-hot encode categoricals, standardize numerics, and remove low-variance/redundant columns to build a reusable pipeline.	Both
3	Supervised model development	Train XGBoost to classify traffic as normal vs. anomalous.	David
3	Evaluation of supervised models	Use accuracy, precision, recall, F-1, and ROC-AUC to assess performance.	David
3	Unsupervised model development	Explore K-Means and DBSCAN to cluster traffic and detect anomalies.	Joey
4	Evaluation of unsupervised models	Compare cluster assignments to labels using Adjusted Rand Index; report silhouette where relevant.	Joey
4	Comparative analysis	Contrast performance and feature influence across supervised vs. unsupervised approaches.	Both
5	Final report & presentation	Compile results, figures, and discussion into final deliverables.	Both

File System Organization

There are no changes to the provided template other than adding the .csv files to the data folder.

--- title: "Packet Traffic Learning" subtitle: "Proposal" author: - name: "The Anomalists - Joey Garcia, David Kyle" affiliations: - name: "College of Information Science, University of Arizona" description: "Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation." format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false jupyter: python3 --- ```{python} #| label: load-pkgs #| message: false import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy import stats # for analysis plan ``` ## Dataset Our dataset doesn't include column names, we'll add the column names to the in-memory dataframes. ```{python} #| label: load-dataset #| message: false df_train = pd.read_csv('data/KDDTrain.csv') df_test = pd.read_csv('data/KDDTest.csv') ''' Columns recieved from kaggle project https://www.kaggle.com/code/faizankhandeshmukh/intrusion-detection-system ''' # Define the list of column names based on the NSL-KDD dataset description columns = [ 'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack', 'level' ] # Assign the column names to the dataframe df_train.columns = columns df_test.columns = columns print('Shapes (train, test):', df_train.shape, df_test.shape) ``` We are using a training and testing dataset of network intrusion detection from [NSL-KDD from Kaggle](https://www.kaggle.com/datasets/hassan06/nslkdd/data?select=KDDTrain1.jpg). The intrusion detection network traffic training dataset contains **125,972 rows** and **43 columns**, and **22,543 rows** and **43 columns** in the test dataset. The `attack` field indicates normal or anomalous (multi-class) observations which allows us to use learning approaches for classifying anomalous network activity. A new binary classification feature, `is_anomalous`, will be added to indicate if the network connection was anomalous or not. This will be the target field for the project. We chose this dataset because it provides a rich and realistic representation of network traffic data. The presence of labeled data allows us to train and evaluate supervised models; the diversity and volume of traffic patterns make it well-suited for exploring unsupervised anomaly detection techniques as well. This balance between complexity and feature richness aligns well with our research questions and modeling goals. ## Questions Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model's predictions? Q2. Can unsupervised learning methods such as K-Means Clustering and Density-based spatial clustering of applications with noise (DBSCAN) detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models? Summary. In addition to evaluating the performance of supervised and unsupervised models on the task of anomaly detection, we compare the most influential features identified by each approach. This allows us to investigate how different learning paradigms "perceive" and prioritize threat indicators (features) within the same dataset. ## Dataset Analysis ### Variables | Column Name | Data Type | Description | |------------------|----------------|--------------------------------------| | `duration` | int64 | Length (in seconds) of the connection. | | `protocol_type` | object | Protocol used (e.g., tcp, udp, icmp). | | `service` | object | Network service on the destination (e.g., http, telnet). | | `flag` | object | Status flag of the connection. | | `src_bytes` | int64 | Number of data bytes sent from source to destination. | | `dst_bytes` | int64 | Number of data bytes sent from destination to source. | | `land` | int64 | 1 if connection is from/to the same host/port; 0 otherwise. | | `wrong_fragment` | int64 | Number of wrong fragments. | | `urgent` | int64 | Number of urgent packets. | | `hot` | int64 | Number of "hot" indicators. | | `num_failed_logins` | int64 | Number of failed login attempts. | | `logged_in` | int64 | 1 if successfully logged in; 0 otherwise. | | `num_compromised` | int64 | Number of compromised conditions. | | `root_shell` | int64 | 1 if root shell is obtained; 0 otherwise. | | `su_attempted` | int64 | 1 if "su root" command attempted; 0 otherwise. | | `num_root` | int64 | Number of "root" accesses. | | `num_file_creations` | int64 | Number of file creation operations. | | `num_shells` | int64 | Number of shell prompts invoked. | | `num_access_files` | int64 | Number of accesses to control files. | | `num_outbound_cmds` | int64 | Number of outbound commands (always 0 in KDD99). | | `is_host_login` | int64 | 1 if login is to a host account; 0 otherwise. | | `is_guest_login` | int64 | 1 if login is to a guest account; 0 otherwise. | | `count` | int64 | Number of connections to the same host in the past 2 seconds. | | `srv_count` | int64 | Number of connections to the same service in the past 2 seconds. | | `serror_rate` | float64 | \% of connections with SYN errors. | | `srv_serror_rate` | float64 | \% of connections to the same service with SYN errors. | | `rerror_rate` | float64 | \% of connections with REJ errors. | | `srv_rerror_rate` | float64 | \% of connections to the same service with REJ errors. | | `same_srv_rate` | float64 | \% of connections to the same service. | | `diff_srv_rate` | float64 | \% of connections to different services. | | `srv_diff_host_rate` | float64 | \% of connections to different hosts on the same service. | | `dst_host_count` | int64 | Number of connections to the destination host. | | `dst_host_srv_count` | int64 | Number of connections to the destination host and service. | | `dst_host_same_srv_rate` | float64 | \% of connections to the same service on the destination host. | | `dst_host_diff_srv_rate` | float64 | \% of connections to different services on the destination host. | | `dst_host_same_src_port_rate` | float64 | \% of connections from the same source port. | | `dst_host_srv_diff_host_rate` | float64 | \% of connections to the same service from different hosts. | | `dst_host_serror_rate` | float64 | \% of connections with SYN errors to the destination host. | | `dst_host_srv_serror_rate` | float64 | \% of connections with SYN errors to the destination service. | | `dst_host_rerror_rate` | float64 | \% of connections with REJ errors to the destination host. | | `dst_host_srv_rerror_rate` | float64 | \% of connections with REJ errors to the destination service. | | `attack` | object | Label indicating the type of attack or "normal". | | `level` | int64 | Severity or confidence score of the attack (if available). | ### Exploratory Data Analysis Evaluate training data for any obvious imbalances. ```{python} #| label: Exploratory Data Analysis print("Shape:", df_train.shape) print("Missing values:", df_train.isna().sum().sum()) print("Duplicates:", df_train.duplicated().sum()) print("Unique attack labels:", df_train['attack'].nunique()) print("Attack label distribution:\n", df_train['attack'].value_counts().head(5)) # Show types and non-null counts df_train.info(verbose=False) ``` The brief look at the data is positive. There are plenty of data points and features. Depending on time and model fitting speed, we may decrease our sample size because hyperparameter tuning using GridSearchCV can be computationally intensive. There are no missing values, and the glimpse of the attack column provides insight into why we want to collapse it into a binary column. ### Normal vs Anomalous Traffic First, let's evaluate distribution of normal v. anomalous data. ```{python} #| label: class-distribution #| echo: false #| message: false #| warning: false # plot style sns.set(style="whitegrid") # Gtop 10 attack types top_10_attacks = df_train['attack'].value_counts().head(10).index # plot plt.figure(figsize=(8, 5)) sns.countplot( data=df_train[df_train['attack'].isin(top_10_attacks)], x="attack", hue="attack", palette="Set2", order=top_10_attacks ) plt.title("Top 10 Attack Types in Training Data") plt.xlabel("Attack Type") plt.ylabel("Count") plt.xticks(rotation=45) plt.tight_layout() plt.show() # Print the full frequency table print(df_train['attack'].value_counts()) ``` The plot provides an idea of the specific attack types expressesd in the data. The plot communicates why it makes sense to group all non-normal traffic together. We feature engineer a new column, `is_anomalous`, we identify 0 as `normal` activity and 1 as `anomalous`. ```{python} #| label: feature-engineering # create binary target column: 1 = attack, 0 = normal df_train['is_anomalous'] = df_train['attack'].apply( lambda x: 0 if x == 'normal' else 1) ``` Examine the new column, `is_anaomalous`, to get an idea of the target frequency. ```{python} #| label: countplot-anomalous #| echo: false #| message: false #| warning: false plt.figure(figsize = (8, 5)) ax = sns.countplot(x = 'is_anomalous', data = df_train, palette = "Set2", hue = "is_anomalous" ) plt.xticks([0, 1], ['Normal (0)', 'Anomalous (1)']) plt.xlabel('Network Traffic Type') plt.ylabel('Count') plt.title('Normal vs Anomalous Network Traffic in Training Data') ax.legend_.remove() ax.grid(False, axis = 'y') plt.tight_layout() plt.show() # Create summary table for is_anomalous value_counts = df_train["is_anomalous"].value_counts() percentages = (value_counts / value_counts.sum()) * 100 summary_table = pd.DataFrame({ "Count": value_counts, "Percentage": percentages.round(2) }).rename(index={0: "Normal", 1: "Attack"}) summary_table ``` The `is_anomalous` classification target shows a near-even class distribution indicating the the dataset is well balanced. There should be no need for resampling or class weighting to correct the set. It appears this dataset will be a good candidate for learning models. ## Analysis plan ### Problem Introduction The project is to build and evaluate models capable of detecting anomalous network traffic based on connection-level features from the NSL-KDD dataset. The problem is framed as a binary classification task, where each record is labeled as either normal or anomalous. This has real-world applications in intrusion detection systems and network security monitoring. The project will explore both supervised and unsupervised machine learning techniques to assess their effectiveness in identifying attacks from structured network traffic data. ### Feature Engineering Strategy To ensure a fair and consistent comparison, we will apply the same feature engineering pipeline to both supervised and unsupervised models. All features will be assigned appropriate column names based on the NSL-KDD documentation. Categorical variables such as `protocol_type`, `service`, and `flag` will be one-hot encoded, and low-variance or non-informative columns will be removed. Numeric features will be standardized using a scaler to normalize their ranges. For supervised models, these engineered features will be used alongside the binary target `is_anomalous`. For unsupervised models, the same processed features will be used without labels, allowing the models to explore underlying structure or detect anomalous patterns. This consistent preprocessing ensures that differences in performance and feature relevance can be attributed to the modeling approaches rather than inconsistencies in data preparation. ### Dimensionality Reduction Since the dataset contains over 40 features, we will apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to aid in visualization, mitigate the curse of dimensionality, and potentially improve model performance. We will evaluate how reduced-dimensional representations affect clustering and classification results and determine whether retaining all features or selecting a subset leads to more interpretable and higher performing models. ### Q1. Supervised Learning For the supervised learning portion of the project, we will use the labeled NSL-KDD dataset to train a model that classifies network traffic as normal or anomalous. We will implement XGBoost, a gradient-boosted decision tree algorithm known for its high accuracy, scalability, and built-in mechanisms for handling class imbalance. XGBoost is particularly well-suited for structured tabular data like NSL-KDD and provides useful feature importance measures that support model interpretation. Model performance will be evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and ROC AUC. In addition to assessing predictive performance, we will analyze the most influential features identified by the model to better understand what drives the classification of anomalous behavior. ### Q2. Unsupervised Learning For the unsupervised learning portion of the project, we will use clustering-based approaches to detect anomalies in network traffic without relying on labeled data. We plan to use K-Means, which partitions data into k-clusters based on distance. Also, we'll use DBSCAN, which identifies clusters of varying shape and isolates outliers as noise. These techniques will be applied to the same feature-engineered data used in the supervised learning tasks. After clustering, we will evaluate how well the resulting groupings align with the true labels using unsupervised performance metrics such as Adjusted Rand Index (ARI) and Fowlkes-Mallows Index (FMI) since we have the labels. We will also examine which features appear most influential in driving cluster separation and anomaly detection, enabling comparison with the supervised model’s learned feature importance. ### Summary Comparison To compare the supervised and unsupervised approaches, we will evaluate their effectiveness in detecting anomalous traffic using metrics appropriate to each task. In addition to quantitative performance, we will compare the features deemed most influential by each modeling approach, revealing how different learning paradigms interpret the same network traffic. We will also consider practical factors such as scalability, interpretability, and the requirement for labeled data, to assess the trade-offs between the two strategies. ### Project Timeline ```{=html} <style> .project-timeline { border-collapse: collapse; width: 100%; font-size: 0.85em; /* Compact font size */ line-height: 1.3; /* Matches paragraph line spacing closely */ } .project-timeline th, .project-timeline td { border: 1px solid #ccc; padding: 5px 6px; /* Vertical 5px, horizontal 6px */ vertical-align: top; } .project-timeline th { background-color: #f8f8f8; text-align: left; } .project-timeline td:first-child { text-align: center; width: 5%; } .project-timeline th:nth-child(2) { width: 20%; } .project-timeline th:nth-child(3) { width: 55%; } .project-timeline th:nth-child(4) { width: 20%; } </style> ``` | Week | Task Name | Summary | Team Member(s) | |----------|----------|-------------------------------------------|----------| | 1 | Dataset exploration | Load the dataset, inspect features, validate schema, sanity-check label balance. | Both | | 1 | Define research questions and proposal | Clarify goals for supervised (XGBoost) and unsupervised (K-Means/DBSCAN) anomaly detection. | Both | | 2 | EDA & Feature engineering | Explore distributions, correlations, missing data, and outliers; then one-hot encode categoricals, standardize numerics, and remove low-variance/redundant columns to build a reusable pipeline. | Both | | 3 | Supervised model development | Train XGBoost to classify traffic as normal vs. anomalous. | David | | 3 | Evaluation of supervised models | Use accuracy, precision, recall, F-1, and ROC-AUC to assess performance. | David | | 3 | Unsupervised model development | Explore K-Means and DBSCAN to cluster traffic and detect anomalies. | Joey | | 4 | Evaluation of unsupervised models | Compare cluster assignments to labels using Adjusted Rand Index; report silhouette where relevant. | Joey | | 4 | Comparative analysis | Contrast performance and feature influence across supervised vs. unsupervised approaches. | Both | | 5 | Final report & presentation | Compile results, figures, and discussion into final deliverables. | Both | ### File System Organization There are no changes to the provided template other than adding the .csv files to the data folder.