INFO 523 - Summer 2025 - Final Project
“Life isn’t chess, a game of perfect information, one that can in theory be “solved.” It’s poker, a game where you’re trying to make the best decisions using the limited information you have.”
― Tom Chivers, Everything Is Predictable: How Bayesian Statistics Explain Our World
What are we solving?
Can we detect anomalous network traffic using supervised and unsupervised machine learning?
Why does it matter?
- Network attacks are increasingly sophisticated and harder to detect
- Faster, more accurate anomaly detection strengthens cybersecurity
- Comparing supervised vs. unsupervised methods reveals complementary insights
- Original dataset contained multiple attack types
- Collapsed into a binary column: is_anomalous
- 0 = normal traffic, 1 = attack (any type)
- Provides a clear target for supervised and unsupervised learning
Process to prepare features
- Recall weaker for anomalies: 0.63 => some attacks missed
- Conservative model: low false positives, higher false negatives
Training F1-score: ~99.81%
Test F1-score: ~76%
- Data Drift implications from new cyber attack methods
Attack Types:
- Training contains 22
- Testing contains 17 original + 21 new attacks
Models: K-Means, DBSCAN, Gaussian Mixture (GMM)
Goal: detect anomalies without labels
Evaluation Metrics: silhouette score + Adjusted Rand Index (ARI)
Silhouette Scores:
- DBSCAN: 0.114
- KMeans: 0.414 (Best cluster cohesion)
- GMM: 0.307
Adjusted Rand Index:
- DBSCAN: 0.242 (Best at matching true anomalies)
- KMeans: 0.179
- GMM: 0.143 (Moderate on both metrics)
Conclusion
- Unsupervised models are less accurate than supervised methods
- Insights may help uncover new threats and support proactive network defense
Methods:
- Supervised (XGBoost): SHAP values explain feature importance
- Unsupervised (K-Means, GMM): analyzing variation of feature values across clusters centers
Takeaway:
- consistent features src_bytes
and dst_bytes
- different influential service-level rates
- both are inluenced differently by varying service-level rates
Supervised (XGBoost): high f1-score and some anomalies missed (False positives)
Unsupervised (K-Means, GMM): weaker overall, but reveal different service-level patterns
Takeaways:
- Feature overlap: traffic volume (src_bytes
, dst_bytes
) dominant across methods
- Data Drift: F1-score highlights train (f1=~99%) and test (f1=~76%) performance metrics. This is an area for continued research.
- Effective Practices: A well-rounded approach would be to use both unsupervised and supervised methods
- Continued Research: As attackers evolve, models must adapt. Ongoing monitoring and retraining are essential to address data drift and maintain strong network defenses.