Packet Traffic Learning
Proposal
Dataset
Our dataset doesn’t include column names, we’ll add the column names to the in-memory dataframes.
df_train = pd.read_csv('data/KDDTrain.csv')
df_test = pd.read_csv('data/KDDTest.csv')
'''
Columns recieved from kaggle project
https://www.kaggle.com/code/faizankhandeshmukh/intrusion-detection-system
'''
# Define the list of column names based on the NSL-KDD dataset description
columns = [
'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins',
'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root',
'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate', 'attack', 'level'
]
# Assign the column names to the dataframe
df_train.columns = columns
df_test.columns = columns
print('Shapes (train, test):', df_train.shape, df_test.shape)
Shapes (train, test): (125973, 43) (22544, 43)
We are using a training and testing dataset of network intrusion detection from NSL-KDD from Kaggle. The intrusion detection network traffic training dataset contains 125,972 rows and 43 columns, and 22,543 rows and 43 columns in the test dataset.
The attack
field indicates normal or anomalous (multi-class) observations which allows us to use learning approaches for classifying anomalous network activity. A new binary classification feature, is_anomalous
, will be added to indicate if the network connection was anomalous or not. This will be the target field for the project.
We chose this dataset because it provides a rich and realistic representation of network traffic data. The presence of labeled data allows us to train and evaluate supervised models; the diversity and volume of traffic patterns make it well-suited for exploring unsupervised anomaly detection techniques as well. This balance between complexity and feature richness aligns well with our research questions and modeling goals.
Questions
Q1. Using a supervised learning model such as XGBoost, how accurate can we classify network traffic as normal or anomalous? What features are most influential in driving the model’s predictions?
Q2. Can unsupervised learning methods such as K-Means Clustering and Density-based spatial clustering of applications with noise (DBSCAN) detect anomalous network traffic without labeled data? How do they group the traffic network traffic, and how do the results compare to the supervised models?
Summary. In addition to evaluating the performance of supervised and unsupervised models on the task of anomaly detection, we compare the most influential features identified by each approach. This allows us to investigate how different learning paradigms “perceive” and prioritize threat indicators (features) within the same dataset.
Dataset Analysis
Variables
Column Name | Data Type | Description |
---|---|---|
duration |
int64 | Length (in seconds) of the connection. |
protocol_type |
object | Protocol used (e.g., tcp, udp, icmp). |
service |
object | Network service on the destination (e.g., http, telnet). |
flag |
object | Status flag of the connection. |
src_bytes |
int64 | Number of data bytes sent from source to destination. |
dst_bytes |
int64 | Number of data bytes sent from destination to source. |
land |
int64 | 1 if connection is from/to the same host/port; 0 otherwise. |
wrong_fragment |
int64 | Number of wrong fragments. |
urgent |
int64 | Number of urgent packets. |
hot |
int64 | Number of “hot” indicators. |
num_failed_logins |
int64 | Number of failed login attempts. |
logged_in |
int64 | 1 if successfully logged in; 0 otherwise. |
num_compromised |
int64 | Number of compromised conditions. |
root_shell |
int64 | 1 if root shell is obtained; 0 otherwise. |
su_attempted |
int64 | 1 if “su root” command attempted; 0 otherwise. |
num_root |
int64 | Number of “root” accesses. |
num_file_creations |
int64 | Number of file creation operations. |
num_shells |
int64 | Number of shell prompts invoked. |
num_access_files |
int64 | Number of accesses to control files. |
num_outbound_cmds |
int64 | Number of outbound commands (always 0 in KDD99). |
is_host_login |
int64 | 1 if login is to a host account; 0 otherwise. |
is_guest_login |
int64 | 1 if login is to a guest account; 0 otherwise. |
count |
int64 | Number of connections to the same host in the past 2 seconds. |
srv_count |
int64 | Number of connections to the same service in the past 2 seconds. |
serror_rate |
float64 | % of connections with SYN errors. |
srv_serror_rate |
float64 | % of connections to the same service with SYN errors. |
rerror_rate |
float64 | % of connections with REJ errors. |
srv_rerror_rate |
float64 | % of connections to the same service with REJ errors. |
same_srv_rate |
float64 | % of connections to the same service. |
diff_srv_rate |
float64 | % of connections to different services. |
srv_diff_host_rate |
float64 | % of connections to different hosts on the same service. |
dst_host_count |
int64 | Number of connections to the destination host. |
dst_host_srv_count |
int64 | Number of connections to the destination host and service. |
dst_host_same_srv_rate |
float64 | % of connections to the same service on the destination host. |
dst_host_diff_srv_rate |
float64 | % of connections to different services on the destination host. |
dst_host_same_src_port_rate |
float64 | % of connections from the same source port. |
dst_host_srv_diff_host_rate |
float64 | % of connections to the same service from different hosts. |
dst_host_serror_rate |
float64 | % of connections with SYN errors to the destination host. |
dst_host_srv_serror_rate |
float64 | % of connections with SYN errors to the destination service. |
dst_host_rerror_rate |
float64 | % of connections with REJ errors to the destination host. |
dst_host_srv_rerror_rate |
float64 | % of connections with REJ errors to the destination service. |
attack |
object | Label indicating the type of attack or “normal”. |
level |
int64 | Severity or confidence score of the attack (if available). |
Exploratory Data Analysis
Evaluate training data for any obvious imbalances.
print("Shape:", df_train.shape)
print("Missing values:", df_train.isna().sum().sum())
print("Duplicates:", df_train.duplicated().sum())
print("Unique attack labels:", df_train['attack'].nunique())
print("Attack label distribution:\n", df_train['attack'].value_counts().head(5))
# Show types and non-null counts
df_train.info(verbose=False)
Shape: (125973, 43)
Missing values: 0
Duplicates: 0
Unique attack labels: 23
Attack label distribution:
attack
normal 67343
neptune 41214
satan 3633
ipsweep 3599
portsweep 2931
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Columns: 43 entries, duration to level
dtypes: float64(15), int64(24), object(4)
memory usage: 41.3+ MB
The brief look at the data is positive. There are plenty of data points and features. Depending on time and model fitting speed, we may decrease our sample size because hyperparameter tuning using GridSearchCV can be computationally intensive. There are no missing values, and the glimpse of the attack column provides insight into why we want to collapse it into a binary column.
Normal vs Anomalous Traffic
First, let’s evaluate distribution of normal v. anomalous data.
attack
normal 67343
neptune 41214
satan 3633
ipsweep 3599
portsweep 2931
smurf 2646
nmap 1493
back 956
teardrop 892
warezclient 890
pod 201
guess_passwd 53
buffer_overflow 30
warezmaster 20
land 18
imap 11
rootkit 10
loadmodule 9
ftp_write 8
multihop 7
phf 4
perl 3
spy 2
Name: count, dtype: int64
The plot provides an idea of the specific attack types expressesd in the data. The plot communicates why it makes sense to group all non-normal traffic together.
We feature engineer a new column, is_anomalous
, we identify 0 as normal
activity and 1 as anomalous
.
Examine the new column, is_anaomalous
, to get an idea of the target frequency.
Count | Percentage | |
---|---|---|
is_anomalous | ||
Normal | 67343 | 53.46 |
Attack | 58630 | 46.54 |
The is_anomalous
classification target shows a near-even class distribution indicating the the dataset is well balanced. There should be no need for resampling or class weighting to correct the set. It appears this dataset will be a good candidate for learning models.
Analysis plan
Problem Introduction
The project is to build and evaluate models capable of detecting anomalous network traffic based on connection-level features from the NSL-KDD dataset. The problem is framed as a binary classification task, where each record is labeled as either normal or anomalous. This has real-world applications in intrusion detection systems and network security monitoring.
The project will explore both supervised and unsupervised machine learning techniques to assess their effectiveness in identifying attacks from structured network traffic data.
Feature Engineering Strategy
To ensure a fair and consistent comparison, we will apply the same feature engineering pipeline to both supervised and unsupervised models. All features will be assigned appropriate column names based on the NSL-KDD documentation. Categorical variables such as protocol_type
, service
, and flag
will be one-hot encoded, and low-variance or non-informative columns will be removed. Numeric features will be standardized using a scaler to normalize their ranges.
For supervised models, these engineered features will be used alongside the binary target is_anomalous
. For unsupervised models, the same processed features will be used without labels, allowing the models to explore underlying structure or detect anomalous patterns. This consistent preprocessing ensures that differences in performance and feature relevance can be attributed to the modeling approaches rather than inconsistencies in data preparation.
Dimensionality Reduction
Since the dataset contains over 40 features, we will apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to aid in visualization, mitigate the curse of dimensionality, and potentially improve model performance. We will evaluate how reduced-dimensional representations affect clustering and classification results and determine whether retaining all features or selecting a subset leads to more interpretable and higher performing models.
Q1. Supervised Learning
For the supervised learning portion of the project, we will use the labeled NSL-KDD dataset to train a model that classifies network traffic as normal or anomalous. We will implement XGBoost, a gradient-boosted decision tree algorithm known for its high accuracy, scalability, and built-in mechanisms for handling class imbalance. XGBoost is particularly well-suited for structured tabular data like NSL-KDD and provides useful feature importance measures that support model interpretation.
Model performance will be evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and ROC AUC. In addition to assessing predictive performance, we will analyze the most influential features identified by the model to better understand what drives the classification of anomalous behavior.
Q2. Unsupervised Learning
For the unsupervised learning portion of the project, we will use clustering-based approaches to detect anomalies in network traffic without relying on labeled data. We plan to use K-Means, which partitions data into k-clusters based on distance. Also, we’ll use DBSCAN, which identifies clusters of varying shape and isolates outliers as noise. These techniques will be applied to the same feature-engineered data used in the supervised learning tasks.
After clustering, we will evaluate how well the resulting groupings align with the true labels using unsupervised performance metrics such as Adjusted Rand Index (ARI) and Fowlkes-Mallows Index (FMI) since we have the labels. We will also examine which features appear most influential in driving cluster separation and anomaly detection, enabling comparison with the supervised model’s learned feature importance.
Summary Comparison
To compare the supervised and unsupervised approaches, we will evaluate their effectiveness in detecting anomalous traffic using metrics appropriate to each task. In addition to quantitative performance, we will compare the features deemed most influential by each modeling approach, revealing how different learning paradigms interpret the same network traffic. We will also consider practical factors such as scalability, interpretability, and the requirement for labeled data, to assess the trade-offs between the two strategies.
Project Timeline
Week | Task Name | Summary | Team Member(s) |
---|---|---|---|
1 | Dataset exploration | Load the dataset, inspect features, validate schema, sanity-check label balance. | Both |
1 | Define research questions and proposal | Clarify goals for supervised (XGBoost) and unsupervised (K-Means/DBSCAN) anomaly detection. | Both |
2 | EDA & Feature engineering | Explore distributions, correlations, missing data, and outliers; then one-hot encode categoricals, standardize numerics, and remove low-variance/redundant columns to build a reusable pipeline. | Both |
3 | Supervised model development | Train XGBoost to classify traffic as normal vs. anomalous. | David |
3 | Evaluation of supervised models | Use accuracy, precision, recall, F-1, and ROC-AUC to assess performance. | David |
3 | Unsupervised model development | Explore K-Means and DBSCAN to cluster traffic and detect anomalies. | Joey |
4 | Evaluation of unsupervised models | Compare cluster assignments to labels using Adjusted Rand Index; report silhouette where relevant. | Joey |
4 | Comparative analysis | Contrast performance and feature influence across supervised vs. unsupervised approaches. | Both |
5 | Final report & presentation | Compile results, figures, and discussion into final deliverables. | Both |
File System Organization
There are no changes to the provided template other than adding the .csv files to the data folder.