(125973, 43) (22544, 43)
Missing values: 0 0
Attack classes (train, sample): ['normal' 'neptune' 'warezclient' 'ipsweep' 'portsweep' 'teardrop' 'nmap'
'satan' 'smurf' 'pod' 'back' 'guess_passwd' 'ftp_write' 'multihop'
'rootkit' 'buffer_overflow' 'imap' 'warezmaster' 'phf' 'land'
'loadmodule' 'spy' 'perl']
Packet Traffic Learning
INFO 523 - Final Project
Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.
Explore Data
There are four object columns. Let’s look at their categorical data.
Top protocol_type values:
protocol_type
tcp 102689
udp 14993
icmp 8291
Name: count, dtype: int64
Top service values:
service
http 40338
private 21853
domain_u 9043
smtp 7313
ftp_data 6860
eco_i 4586
other 4359
ecr_i 3077
telnet 2353
finger 1767
Name: count, dtype: int64
Top flag values:
flag
SF 74945
S0 34851
REJ 11233
RSTR 2421
RSTO 1562
S1 365
SH 271
S2 127
RSTOS0 103
S3 49
Name: count, dtype: int64
attack
normal 67343
neptune 41214
satan 3633
ipsweep 3599
portsweep 2931
smurf 2646
nmap 1493
back 956
teardrop 892
warezclient 890
pod 201
guess_passwd 53
buffer_overflow 30
warezmaster 20
land 18
imap 11
rootkit 10
loadmodule 9
ftp_write 8
multihop 7
phf 4
perl 3
spy 2
Name: count, dtype: int64
Feature Engineer
Count | Percentage | |
---|---|---|
is_anomalous | ||
Normal | 67343 | 53.46 |
Attack | 58630 | 46.54 |
Preprocessing
Identify Categorical and Numerical Data
Categorical columns: ['protocol_type', 'service', 'flag']
Numeric columns: ['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']
Categorical Length: 3
Numeric Length: 38
Scaling and one-hot encodeing
Train Shape: (125973, 122)
Test Shape: (22544, 122)
Train Set Scaling Check:
Max |mean|: 0.000000
Max |std - 1|: 1.000000
Test Set Scaling Check:
Max |mean|: 0.451471
Max |std - 1|: 6.838182
Feature Selection
Variance Threshold
Variance threshold = 0.01
Features before: 122
Features after: 56
Dropped: 66 features
First 10 kept features:
Index(['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised'],
dtype='object')
High-Correlation Feature Pruning
Correlation threshold = 0.95
Features before: 56
Features after: 49
Dropped: 7 features
Dropped features due to high correlation:
['num_root', 'srv_serror_rate', 'srv_rerror_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_srv_rerror_rate', 'flag_S0']
Mutual Information Scores
Top 30 Features by Mutual Information:
Feature MI_Score
src_bytes 0.564592
dst_bytes 0.438462
diff_srv_rate 0.360036
same_srv_rate 0.355225
dst_host_srv_count 0.331552
flag_SF 0.330428
dst_host_same_srv_rate 0.304843
dst_host_diff_srv_rate 0.285189
logged_in 0.283532
serror_rate 0.276421
count 0.265388
dst_host_srv_diff_host_rate 0.189679
service_http 0.185479
dst_host_count 0.137209
dst_host_same_src_port_rate 0.132210
service_private 0.119282
srv_diff_host_rate 0.100504
Train shape with top 30 features: (125973, 18)
Test shape with top 30 features: (22544, 18)
Confirm everything is still scaled.
min max mean std
src_bytes -0.007762 235.067459 1.394409e-10 1.000004
dst_host_diff_srv_rate -0.439078 4.854138 -2.108072e-09 1.000004
dst_host_same_srv_rate -1.161030 1.066401 3.018620e-09 1.000004
dst_host_same_src_port_rate -0.480197 2.756092 1.234607e-08 1.000004
dst_host_count -1.836071 0.734343 8.007745e-09 1.000004
serror_rate -0.637209 1.602664 -2.351370e-08 1.000004
count -0.734511 3.728053 -5.835707e-09 1.000004
same_srv_rate -1.503403 0.771283 -1.697443e-08 1.000004
logged_in -0.809262 1.235694 1.928198e-08 1.000004
diff_srv_rate -0.349683 5.196208 9.552657e-09 1.000004
srv_diff_host_rate -0.374560 3.474118 2.159305e-09 1.000004
dst_host_srv_count -1.044721 1.258754 -7.055719e-09 1.000004
dst_bytes -0.004919 325.748596 2.518628e-11 1.000004
dst_host_srv_diff_host_rate -0.289103 8.594782 2.340235e-09 1.000004
flag_SF 0.000000 1.000000 5.949291e-01 0.490908
service_http 0.000000 1.000000 3.202115e-01 0.466560
service_private 0.000000 1.000000 1.734737e-01 0.378658
Everything is still scaled. This is the feature set we will use to model.
Proposed Feature Set for Supervised and Unsupervised Learning
DataFrames: df_train_top17 and df_test_top17.
Column Name | Data Type | Description | Notes |
---|---|---|---|
src_bytes |
int64 | Number of data bytes sent from source to destination. | Numeric (scaled) |
dst_bytes |
int64 | Number of data bytes sent from destination to source. | Numeric (scaled) |
count |
int64 | Number of connections to the same host in the past 2 seconds. | Numeric (scaled) |
srv_diff_host_rate |
float64 | % of connections to different hosts on the same service. | Numeric (scaled) |
serror_rate |
float64 | % of connections with SYN errors. | Numeric (scaled) |
same_srv_rate |
float64 | % of connections to the same service. | Numeric (scaled) |
diff_srv_rate |
float64 | % of connections to different services. | Numeric (scaled) |
dst_host_count |
int64 | Number of connections to the destination host. | Numeric (scaled) |
dst_host_srv_count |
int64 | Number of connections to the destination host and service. | Numeric (scaled) |
dst_host_same_srv_rate |
float64 | % of connections to the same service on the destination host. | Numeric (scaled) |
dst_host_diff_srv_rate |
float64 | % of connections to different services on the destination host. | Numeric (scaled) |
dst_host_same_src_port_rate |
float64 | % of connections from the same source port. | Numeric (scaled) |
dst_host_srv_diff_host_rate |
float64 | % of connections to the same service from different hosts. | Numeric (scaled) |
logged_in |
int64 | 1 if successfully logged in; 0 otherwise. | Binary indicator |
flag_SF |
int64 | One-hot encoded: Status flag “SF” of the connection. | One-hot encoded categorical |
service_http |
int64 | One-hot encoded: Network service is HTTP. | One-hot encoded categorical |
service_private |
int64 | One-hot encoded: Network service is “private”. | One-hot encoded categorical |
is_anomalous |
int64 | Target: 1 if connection is anomalous; 0 if normal. | Target variable |
Visualizations
Correlation Heatmap
MI Information Bar Plot
PCA Projection
Machine Learning Models
Supervised Model and tune
Computed scale_pos_weight: 1.15
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time= 0.6s
[CV 2/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time= 0.5s
[CV 3/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time= 0.5s
[CV 4/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time= 0.5s
[CV 5/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time= 0.6s
[CV 1/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time= 0.5s
[CV 2/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time= 0.5s
[CV 3/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time= 0.5s
[CV 4/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time= 0.5s
[CV 5/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time= 0.5s
[CV 1/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time= 0.6s
[CV 2/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time= 0.5s
[CV 3/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time= 0.5s
[CV 4/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time= 0.5s
[CV 5/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time= 0.5s
Best F1 score: 0.998148534307586
Best params: {'subsample': 0.97, 'reg_lambda': 2.75, 'reg_alpha': 2.75, 'n_estimators': 340, 'max_depth': 13, 'learning_rate': 0.055, 'gamma': 0.05, 'colsample_bytree': 0.65}
Predict Against Test Set
Test Accuracy: 0.778388928317956
Test ROC AUC: 0.963783985241748
Classification Report:
precision recall f1-score support
0 0.67 0.97 0.79 9711
1 0.97 0.63 0.76 12833
accuracy 0.78 22544
macro avg 0.82 0.80 0.78 22544
weighted avg 0.84 0.78 0.78 22544
SHAP Visualization of Feature Importance
Missed vs. Accurate Predictions
Missing indexes from df_test: 0
Indexes in same order: True
Unsupervised Learning Model
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Suggested eps: 2.938
DBSCAN Epsilon: 0.8
DBSCAN Clusters: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), np.int64(56), np.int64(57), np.int64(58), np.int64(59), np.int64(60), np.int64(61), np.int64(62), np.int64(63), np.int64(64), np.int64(65), np.int64(66), np.int64(67), np.int64(68), np.int64(69), np.int64(70), np.int64(71), np.int64(72), np.int64(73), np.int64(74), np.int64(75), np.int64(76), np.int64(77), np.int64(78), np.int64(79), np.int64(80), np.int64(81), np.int64(82), np.int64(83), np.int64(84), np.int64(85), np.int64(86), np.int64(87), np.int64(88), np.int64(89), np.int64(90), np.int64(91), np.int64(92), np.int64(93), np.int64(94), np.int64(95), np.int64(96), np.int64(97), np.int64(98), np.int64(99), np.int64(100), np.int64(101), np.int64(102), np.int64(103), np.int64(104), np.int64(105), np.int64(106), np.int64(107), np.int64(108), np.int64(109), np.int64(110), np.int64(111), np.int64(112), np.int64(113), np.int64(114), np.int64(115), np.int64(116), np.int64(117), np.int64(118), np.int64(119), np.int64(120), np.int64(121), np.int64(122), np.int64(123), np.int64(124), np.int64(125), np.int64(126), np.int64(127), np.int64(128), np.int64(129), np.int64(130), np.int64(131), np.int64(132), np.int64(133), np.int64(134), np.int64(135), np.int64(136), np.int64(137), np.int64(138), np.int64(139), np.int64(140), np.int64(141), np.int64(142), np.int64(143), np.int64(144), np.int64(145), np.int64(-1)}
DBSCAN Silhouette Score: 0.11440226511971831
K-Means Clustering
Optimal number of clusters (lowest DBI): 19
KMeans Inertia: 249999.30905752297
KMeans Silhouette Score: 0.41426751982683147
Soft Clustering
Optimal number of clusters (BIC): 29
Average silhouette score for GMM clustering is: 0.307
BIC Score: -14494533.421
Converged: True
Number of iterations: 41
Cluster Weights: [3.33901886e-02 8.89795095e-02 9.21080112e-02 2.82282711e-02
1.48841276e-02 5.20136320e-02 7.93820898e-06 1.96855283e-02
1.58764180e-05 5.61327454e-02 1.45664228e-02 5.19924698e-02
3.06276380e-02 5.22593427e-02 2.57048008e-02 9.58318073e-03
2.38146269e-05 1.58764180e-05 4.55624762e-03 8.93401549e-03
1.40351874e-01 2.21067998e-02 1.54294993e-01 1.72366187e-02
2.47055959e-02 6.54053309e-03 2.29213618e-02 4.11953659e-03
2.40130499e-02]
Comparing Unsupervised Learning Results
Comparing silhouette scores:
DBSCAN: 0.114
KMeans clustering : 0.414
GMM clustering: 0.307
Comparing Adjusted Rand Index:
DBSCAN ARI: 0.24221704799207797
KMeans ARI: 0.17869312477322208
GMM ARI: 0.1433135299744891
Visualize Unsupervised Learning Results
Feature Importance for Unsupervised Learning
Top 5 K-Means Feature Importances:
Feature Importance
1 dst_bytes 74.869696
0 src_bytes 51.907870
11 dst_host_srv_diff_host_rate 2.003900
2 diff_srv_rate 1.520453
7 dst_host_diff_srv_rate 1.456444
Top 5 GMM Feature Importances:
Feature Importance
1 dst_bytes 61.527790
0 src_bytes 42.739940
11 dst_host_srv_diff_host_rate 1.459121
7 dst_host_diff_srv_rate 1.325182
2 diff_srv_rate 1.198797