Packet Traffic Learning

INFO 523 - Final Project

Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.
Author
Affiliation

The Anomalists - Joey Garcia, David Kyle

College of Information Science, University of Arizona

Explore Data

(125973, 43) (22544, 43)
Missing values: 0 0
Attack classes (train, sample): ['normal' 'neptune' 'warezclient' 'ipsweep' 'portsweep' 'teardrop' 'nmap'
 'satan' 'smurf' 'pod' 'back' 'guess_passwd' 'ftp_write' 'multihop'
 'rootkit' 'buffer_overflow' 'imap' 'warezmaster' 'phf' 'land'
 'loadmodule' 'spy' 'perl']

There are four object columns. Let’s look at their categorical data.


Top protocol_type values:
protocol_type
tcp     102689
udp      14993
icmp      8291
Name: count, dtype: int64

Top service values:
service
http        40338
private     21853
domain_u     9043
smtp         7313
ftp_data     6860
eco_i        4586
other        4359
ecr_i        3077
telnet       2353
finger       1767
Name: count, dtype: int64

Top flag values:
flag
SF        74945
S0        34851
REJ       11233
RSTR       2421
RSTO       1562
S1          365
SH          271
S2          127
RSTOS0      103
S3           49
Name: count, dtype: int64

attack
normal             67343
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: count, dtype: int64

Feature Engineer

Count Percentage
is_anomalous
Normal 67343 53.46
Attack 58630 46.54

Preprocessing

Identify Categorical and Numerical Data

Categorical columns: ['protocol_type', 'service', 'flag'] 
Numeric columns: ['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']
Categorical Length: 3 
Numeric Length: 38

Scaling and one-hot encodeing

Train Shape: (125973, 122) 
Test Shape: (22544, 122)

Train Set Scaling Check:
Max |mean|: 0.000000
Max |std - 1|: 1.000000

Test Set Scaling Check:
Max |mean|: 0.451471
Max |std - 1|: 6.838182

Feature Selection

Variance Threshold

Variance threshold = 0.01
Features before: 122
Features after:  56
Dropped: 66 features

First 10 kept features:
Index(['duration', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment',
       'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised'],
      dtype='object')

High-Correlation Feature Pruning

Correlation threshold = 0.95
Features before: 56
Features after:  49
Dropped: 7 features

Dropped features due to high correlation:
['num_root', 'srv_serror_rate', 'srv_rerror_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_srv_rerror_rate', 'flag_S0']

Mutual Information Scores

Top 30 Features by Mutual Information:

                    Feature  MI_Score
                  src_bytes  0.564592
                  dst_bytes  0.438462
              diff_srv_rate  0.360036
              same_srv_rate  0.355225
         dst_host_srv_count  0.331552
                    flag_SF  0.330428
     dst_host_same_srv_rate  0.304843
     dst_host_diff_srv_rate  0.285189
                  logged_in  0.283532
                serror_rate  0.276421
                      count  0.265388
dst_host_srv_diff_host_rate  0.189679
               service_http  0.185479
             dst_host_count  0.137209
dst_host_same_src_port_rate  0.132210
            service_private  0.119282
         srv_diff_host_rate  0.100504

Train shape with top 30 features: (125973, 18)
Test shape with top 30 features:  (22544, 18)

Confirm everything is still scaled.

                                  min         max          mean       std
src_bytes                   -0.007762  235.067459  1.394409e-10  1.000004
dst_host_diff_srv_rate      -0.439078    4.854138 -2.108072e-09  1.000004
dst_host_same_srv_rate      -1.161030    1.066401  3.018620e-09  1.000004
dst_host_same_src_port_rate -0.480197    2.756092  1.234607e-08  1.000004
dst_host_count              -1.836071    0.734343  8.007745e-09  1.000004
serror_rate                 -0.637209    1.602664 -2.351370e-08  1.000004
count                       -0.734511    3.728053 -5.835707e-09  1.000004
same_srv_rate               -1.503403    0.771283 -1.697443e-08  1.000004
logged_in                   -0.809262    1.235694  1.928198e-08  1.000004
diff_srv_rate               -0.349683    5.196208  9.552657e-09  1.000004
srv_diff_host_rate          -0.374560    3.474118  2.159305e-09  1.000004
dst_host_srv_count          -1.044721    1.258754 -7.055719e-09  1.000004
dst_bytes                   -0.004919  325.748596  2.518628e-11  1.000004
dst_host_srv_diff_host_rate -0.289103    8.594782  2.340235e-09  1.000004
flag_SF                      0.000000    1.000000  5.949291e-01  0.490908
service_http                 0.000000    1.000000  3.202115e-01  0.466560
service_private              0.000000    1.000000  1.734737e-01  0.378658

Everything is still scaled. This is the feature set we will use to model.

Proposed Feature Set for Supervised and Unsupervised Learning

DataFrames: df_train_top17 and df_test_top17.

Column Name Data Type Description Notes
src_bytes int64 Number of data bytes sent from source to destination. Numeric (scaled)
dst_bytes int64 Number of data bytes sent from destination to source. Numeric (scaled)
count int64 Number of connections to the same host in the past 2 seconds. Numeric (scaled)
srv_diff_host_rate float64 % of connections to different hosts on the same service. Numeric (scaled)
serror_rate float64 % of connections with SYN errors. Numeric (scaled)
same_srv_rate float64 % of connections to the same service. Numeric (scaled)
diff_srv_rate float64 % of connections to different services. Numeric (scaled)
dst_host_count int64 Number of connections to the destination host. Numeric (scaled)
dst_host_srv_count int64 Number of connections to the destination host and service. Numeric (scaled)
dst_host_same_srv_rate float64 % of connections to the same service on the destination host. Numeric (scaled)
dst_host_diff_srv_rate float64 % of connections to different services on the destination host. Numeric (scaled)
dst_host_same_src_port_rate float64 % of connections from the same source port. Numeric (scaled)
dst_host_srv_diff_host_rate float64 % of connections to the same service from different hosts. Numeric (scaled)
logged_in int64 1 if successfully logged in; 0 otherwise. Binary indicator
flag_SF int64 One-hot encoded: Status flag “SF” of the connection. One-hot encoded categorical
service_http int64 One-hot encoded: Network service is HTTP. One-hot encoded categorical
service_private int64 One-hot encoded: Network service is “private”. One-hot encoded categorical
is_anomalous int64 Target: 1 if connection is anomalous; 0 if normal. Target variable

Visualizations

Correlation Heatmap

MI Information Bar Plot

PCA Projection

Machine Learning Models

Supervised Model and tune

Computed scale_pos_weight: 1.15
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time=   0.6s
[CV 2/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time=   0.5s
[CV 3/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time=   0.5s
[CV 4/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time=   0.5s
[CV 5/5] END colsample_bytree=0.6, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=3.0, reg_lambda=2.5, subsample=0.97;, score=0.998 total time=   0.6s
[CV 1/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time=   0.5s
[CV 2/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time=   0.5s
[CV 3/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time=   0.5s
[CV 4/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time=   0.5s
[CV 5/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.055, max_depth=13, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.97;, score=0.998 total time=   0.5s
[CV 1/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time=   0.6s
[CV 2/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time=   0.5s
[CV 3/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time=   0.5s
[CV 4/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time=   0.5s
[CV 5/5] END colsample_bytree=0.65, gamma=0.05, learning_rate=0.05, max_depth=14, n_estimators=340, reg_alpha=2.75, reg_lambda=2.75, subsample=0.93;, score=0.998 total time=   0.5s
Best F1 score: 0.998148534307586
Best params: {'subsample': 0.97, 'reg_lambda': 2.75, 'reg_alpha': 2.75, 'n_estimators': 340, 'max_depth': 13, 'learning_rate': 0.055, 'gamma': 0.05, 'colsample_bytree': 0.65}

Predict Against Test Set

Test Accuracy: 0.778388928317956
Test ROC AUC: 0.963783985241748
Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.97      0.79      9711
           1       0.97      0.63      0.76     12833

    accuracy                           0.78     22544
   macro avg       0.82      0.80      0.78     22544
weighted avg       0.84      0.78      0.78     22544

SHAP Visualization of Feature Importance

Missed vs. Accurate Predictions

Missing indexes from df_test: 0
Indexes in same order: True

Unsupervised Learning Model

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Suggested eps: 2.938
DBSCAN Epsilon: 0.8
DBSCAN Clusters: {np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12), np.int64(13), np.int64(14), np.int64(15), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(21), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(26), np.int64(27), np.int64(28), np.int64(29), np.int64(30), np.int64(31), np.int64(32), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(37), np.int64(38), np.int64(39), np.int64(40), np.int64(41), np.int64(42), np.int64(43), np.int64(44), np.int64(45), np.int64(46), np.int64(47), np.int64(48), np.int64(49), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), np.int64(56), np.int64(57), np.int64(58), np.int64(59), np.int64(60), np.int64(61), np.int64(62), np.int64(63), np.int64(64), np.int64(65), np.int64(66), np.int64(67), np.int64(68), np.int64(69), np.int64(70), np.int64(71), np.int64(72), np.int64(73), np.int64(74), np.int64(75), np.int64(76), np.int64(77), np.int64(78), np.int64(79), np.int64(80), np.int64(81), np.int64(82), np.int64(83), np.int64(84), np.int64(85), np.int64(86), np.int64(87), np.int64(88), np.int64(89), np.int64(90), np.int64(91), np.int64(92), np.int64(93), np.int64(94), np.int64(95), np.int64(96), np.int64(97), np.int64(98), np.int64(99), np.int64(100), np.int64(101), np.int64(102), np.int64(103), np.int64(104), np.int64(105), np.int64(106), np.int64(107), np.int64(108), np.int64(109), np.int64(110), np.int64(111), np.int64(112), np.int64(113), np.int64(114), np.int64(115), np.int64(116), np.int64(117), np.int64(118), np.int64(119), np.int64(120), np.int64(121), np.int64(122), np.int64(123), np.int64(124), np.int64(125), np.int64(126), np.int64(127), np.int64(128), np.int64(129), np.int64(130), np.int64(131), np.int64(132), np.int64(133), np.int64(134), np.int64(135), np.int64(136), np.int64(137), np.int64(138), np.int64(139), np.int64(140), np.int64(141), np.int64(142), np.int64(143), np.int64(144), np.int64(145), np.int64(-1)}
DBSCAN Silhouette Score: 0.11440226511971831

K-Means Clustering

Optimal number of clusters (lowest DBI): 19
KMeans Inertia: 249999.30905752297
KMeans Silhouette Score: 0.41426751982683147

Soft Clustering

Optimal number of clusters (BIC): 29
Average silhouette score for GMM clustering is: 0.307
BIC Score: -14494533.421
Converged: True
Number of iterations: 41
Cluster Weights: [3.33901886e-02 8.89795095e-02 9.21080112e-02 2.82282711e-02
 1.48841276e-02 5.20136320e-02 7.93820898e-06 1.96855283e-02
 1.58764180e-05 5.61327454e-02 1.45664228e-02 5.19924698e-02
 3.06276380e-02 5.22593427e-02 2.57048008e-02 9.58318073e-03
 2.38146269e-05 1.58764180e-05 4.55624762e-03 8.93401549e-03
 1.40351874e-01 2.21067998e-02 1.54294993e-01 1.72366187e-02
 2.47055959e-02 6.54053309e-03 2.29213618e-02 4.11953659e-03
 2.40130499e-02]

Comparing Unsupervised Learning Results

Comparing silhouette scores: 
DBSCAN: 0.114
KMeans clustering : 0.414
GMM clustering: 0.307

Comparing Adjusted Rand Index:
DBSCAN ARI: 0.24221704799207797
KMeans ARI: 0.17869312477322208
GMM ARI: 0.1433135299744891

Visualize Unsupervised Learning Results

Feature Importance for Unsupervised Learning

Top 5 K-Means Feature Importances:
                        Feature  Importance
1                     dst_bytes   74.869696
0                     src_bytes   51.907870
11  dst_host_srv_diff_host_rate    2.003900
2                 diff_srv_rate    1.520453
7        dst_host_diff_srv_rate    1.456444
Top 5 GMM Feature Importances:
                        Feature  Importance
1                     dst_bytes   61.527790
0                     src_bytes   42.739940
11  dst_host_srv_diff_host_rate    1.459121
7        dst_host_diff_srv_rate    1.325182
2                 diff_srv_rate    1.198797