Packet Traffic Learning

Proposal

Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.
Author
Affiliation

The Anomalists - Joey Garcia, David Kyle

College of Information Science, University of Arizona

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats # for analysis plan

Dataset

Our dataset doesn’t include column names, we’ll add the column names to the in-memory dataframes.

df_train = pd.read_csv('data/KDDTrain.csv')
df_test = pd.read_csv('data/KDDTest.csv')

'''
Columns recieved from kaggle project 
https://www.kaggle.com/code/faizankhandeshmukh/intrusion-detection-system

'''

# Define the list of column names based on the NSL-KDD dataset description
columns = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
    'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins',
    'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root',
    'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
    'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
    'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
    'dst_host_srv_rerror_rate', 'attack', 'level'
]

# Assign the column names to the dataframe
df_train.columns = columns
df_test.columns = columns


print('Shapes (train, test):', df_train.shape, df_test.shape)
Shapes (train, test): (125973, 43) (22544, 43)

We are using a training and testing dataset of network intrusion detection from NSL-KDD from Kaggle. The intrusion detection network traffic training dataset contains 125,972 rows and 43 columns, and 22,543 rows and 43 columns in the test dataset.

The attack field indicates normal or anomalous (multi-class) observations which allows us to use learning approaches for classifying anomalous network activity. A new binary classification feature, is_anomalous, will be added to indicate if the network connection was anomalous or not. This will be the target field for the project.

We chose this dataset because it provides a rich and realistic representation of network traffic data. The presence of labeled data allows us to train and evaluate supervised models; the diversity and volume of traffic patterns make it well-suited for exploring unsupervised anomaly detection techniques as well. This balance between complexity and feature richness aligns well with our research questions and modeling goals.

Questions

Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model’s predictions?

Q2. Can unsupervised learning methods such as K-Means Clustering and Density-based spatial clustering of applications with noise (DBSCAN) detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models?

Summary. In addition to evaluating the performance of supervised and unsupervised models on the task of anomaly detection, we compare the most influential features identified by each approach. This allows us to investigate how different learning paradigms “perceive” and prioritize threat indicators (features) within the same dataset.

Dataset Analysis

Variables

Column Name Data Type Description
duration int64 Length (in seconds) of the connection.
protocol_type object Protocol used (e.g., tcp, udp, icmp).
service object Network service on the destination (e.g., http, telnet).
flag object Status flag of the connection.
src_bytes int64 Number of data bytes sent from source to destination.
dst_bytes int64 Number of data bytes sent from destination to source.
land int64 1 if connection is from/to the same host/port; 0 otherwise.
wrong_fragment int64 Number of wrong fragments.
urgent int64 Number of urgent packets.
hot int64 Number of “hot” indicators.
num_failed_logins int64 Number of failed login attempts.
logged_in int64 1 if successfully logged in; 0 otherwise.
num_compromised int64 Number of compromised conditions.
root_shell int64 1 if root shell is obtained; 0 otherwise.
su_attempted int64 1 if “su root” command attempted; 0 otherwise.
num_root int64 Number of “root” accesses.
num_file_creations int64 Number of file creation operations.
num_shells int64 Number of shell prompts invoked.
num_access_files int64 Number of accesses to control files.
num_outbound_cmds int64 Number of outbound commands (always 0 in KDD99).
is_host_login int64 1 if login is to a host account; 0 otherwise.
is_guest_login int64 1 if login is to a guest account; 0 otherwise.
count int64 Number of connections to the same host in the past 2 seconds.
srv_count int64 Number of connections to the same service in the past 2 seconds.
serror_rate float64 % of connections with SYN errors.
srv_serror_rate float64 % of connections to the same service with SYN errors.
rerror_rate float64 % of connections with REJ errors.
srv_rerror_rate float64 % of connections to the same service with REJ errors.
same_srv_rate float64 % of connections to the same service.
diff_srv_rate float64 % of connections to different services.
srv_diff_host_rate float64 % of connections to different hosts on the same service.
dst_host_count int64 Number of connections to the destination host.
dst_host_srv_count int64 Number of connections to the destination host and service.
dst_host_same_srv_rate float64 % of connections to the same service on the destination host.
dst_host_diff_srv_rate float64 % of connections to different services on the destination host.
dst_host_same_src_port_rate float64 % of connections from the same source port.
dst_host_srv_diff_host_rate float64 % of connections to the same service from different hosts.
dst_host_serror_rate float64 % of connections with SYN errors to the destination host.
dst_host_srv_serror_rate float64 % of connections with SYN errors to the destination service.
dst_host_rerror_rate float64 % of connections with REJ errors to the destination host.
dst_host_srv_rerror_rate float64 % of connections with REJ errors to the destination service.
attack object Label indicating the type of attack or “normal”.
level int64 Severity or confidence score of the attack (if available).

Exploratory Data Analysis

Evaluate training data for any obvious imbalances.

print("Shape:", df_train.shape)
print("Missing values:", df_train.isna().sum().sum())
print("Duplicates:", df_train.duplicated().sum())
print("Unique attack labels:", df_train['attack'].nunique())
print("Attack label distribution:\n", df_train['attack'].value_counts().head(5))

# Show types and non-null counts
df_train.info(verbose=False)
Shape: (125973, 43)
Missing values: 0
Duplicates: 0
Unique attack labels: 23
Attack label distribution:
 attack
normal       67343
neptune      41214
satan         3633
ipsweep       3599
portsweep     2931
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Columns: 43 entries, duration to level
dtypes: float64(15), int64(24), object(4)
memory usage: 41.3+ MB

The brief look at the data is positive. There are plenty of data points and features. Depending on time and model fitting speed, we may decrease our sample size because hyperparameter tuning using GridSearchCV can be computationally intensive. There are no missing values, and the glimpse of the attack column provides insight into why we want to collapse it into a binary column.

Normal vs Anomalous Traffic

First, let’s evaluate distribution of normal v. anomalous data.

attack
normal             67343
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: count, dtype: int64

The plot provides an idea of the specific attack types expressesd in the data. The plot communicates why it makes sense to group all non-normal traffic together.

We feature engineer a new column, is_anomalous, we identify 0 as normal activity and 1 as anomalous.

# create binary target column: 1 = attack, 0 = normal

df_train['is_anomalous'] = df_train['attack'].apply(
  lambda x: 0 if x == 'normal' else 1)

Examine the new column, is_anaomalous, to get an idea of the target frequency.

Count Percentage
is_anomalous
Normal 67343 53.46
Attack 58630 46.54

The is_anomalous classification target shows a near-even class distribution indicating the the dataset is well balanced. There should be no need for resampling or class weighting to correct the set. It appears this dataset will be a good candidate for learning models.

Analysis plan

Problem Introduction

The project is to build and evaluate models capable of detecting anomalous network traffic based on connection-level features from the NSL-KDD dataset. The problem is framed as a binary classification task, where each record is labeled as either normal or anomalous. This has real-world applications in intrusion detection systems and network security monitoring.

The project will explore both supervised and unsupervised machine learning techniques to assess their effectiveness in identifying attacks from structured network traffic data.

Feature Engineering Strategy

To ensure a fair and consistent comparison, we will apply the same feature engineering pipeline to both supervised and unsupervised models. All features will be assigned appropriate column names based on the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag will be one-hot encoded, and low-variance or non-informative columns will be removed. Numeric features will be standardized using a scaler to normalize their ranges.

For supervised models, these engineered features will be used alongside the binary target is_anomalous. For unsupervised models, the same processed features will be used without labels, allowing the models to explore underlying structure or detect anomalous patterns. This consistent preprocessing ensures that differences in performance and feature relevance can be attributed to the modeling approaches rather than inconsistencies in data preparation.

Dimensionality Reduction

Since the dataset contains over 40 features, we will apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to aid in visualization, mitigate the curse of dimensionality, and potentially improve model performance. We will evaluate how reduced-dimensional representations affect clustering and classification results and determine whether retaining all features or selecting a subset leads to more interpretable and higher performing models.

Q1. Supervised Learning

For the supervised learning portion of the project, we will use the labeled NSL-KDD dataset to train a model that classifies network traffic as normal or anomalous. We will implement XGBoost, a gradient-boosted decision tree algorithm known for its high accuracy, scalability, and built-in mechanisms for handling class imbalance. XGBoost is particularly well-suited for structured tabular data like NSL-KDD and provides useful feature importance measures that support model interpretation.

Model performance will be evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and ROC AUC. In addition to assessing predictive performance, we will analyze the most influential features identified by the model to better understand what drives the classification of anomalous behavior.

Q2. Unsupervised Learning

For the unsupervised learning portion of the project, we will use clustering-based approaches to detect anomalies in network traffic without relying on labeled data. We plan to use K-Means, which partitions data into k-clusters based on distance. Also, we’ll use DBSCAN, which identifies clusters of varying shape and isolates outliers as noise. These techniques will be applied to the same feature-engineered data used in the supervised learning tasks.

After clustering, we will evaluate how well the resulting groupings align with the true labels using unsupervised performance metrics such as Adjusted Rand Index (ARI) and Fowlkes-Mallows Index (FMI) since we have the labels. We will also examine which features appear most influential in driving cluster separation and anomaly detection, enabling comparison with the supervised model’s learned feature importance.

Summary Comparison

To compare the supervised and unsupervised approaches, we will evaluate their effectiveness in detecting anomalous traffic using metrics appropriate to each task. In addition to quantitative performance, we will compare the features deemed most influential by each modeling approach, revealing how different learning paradigms interpret the same network traffic. We will also consider practical factors such as scalability, interpretability, and the requirement for labeled data, to assess the trade-offs between the two strategies.

Project Timeline

Week Task Name Summary Team Member(s)
1 Dataset exploration Load the dataset, inspect features, validate schema, sanity-check label balance. Both
1 Define research questions and proposal Clarify goals for supervised (XGBoost) and unsupervised (K-Means/DBSCAN) anomaly detection. Both
2 EDA & Feature engineering Explore distributions, correlations, missing data, and outliers; then one-hot encode categoricals, standardize numerics, and remove low-variance/redundant columns to build a reusable pipeline. Both
3 Supervised model development Train XGBoost to classify traffic as normal vs. anomalous. David
3 Evaluation of supervised models Use accuracy, precision, recall, F-1, and ROC-AUC to assess performance. David
3 Unsupervised model development Explore K-Means and DBSCAN to cluster traffic and detect anomalies. Joey
4 Evaluation of unsupervised models Compare cluster assignments to labels using Adjusted Rand Index; report silhouette where relevant. Joey
4 Comparative analysis Contrast performance and feature influence across supervised vs. unsupervised approaches. Both
5 Final report & presentation Compile results, figures, and discussion into final deliverables. Both

File System Organization

There are no changes to the provided template other than adding the .csv files to the data folder.