Predicting MLB Hall of Fame Induction

INFO 523 - Summer 2025 - Final Project

Austin Cortopassi, David Pelley, Nathan Harville

Abstract & Motivation

  • Hall of Fame selection is one of the most prestigious yet subjective honors in Major League Baseball.

Project Goal

  • Build a machine learning model to predict Hall of Fame induction using career-level statistics and achievements.

Data Source

  • Lahman Baseball Database — player records spanning 100+ years

Approach

  • Data cleaning & feature engineering
  • Model training with classification algorithms

Outcome - Evaluate prediction accuracy
- Identify key career metrics that influence Hall of Fame voting outcomes

Exploratory Data Analysis

  • Visualized metric distributions and adjusted data for modeling
  • Built correlation matrices to guide feature engineering
  • Split models:
    • Pitchers (ERA, Wins, Strikeouts)
    • Position Players (AVG, HR, Hits)

Feature Engineering - Batting Avg, Slugging %, On-Base %
- WHIP, K/BB

Modeling Approach

  • Random Forest → interpretable, feature importance
  • XGBoost → handles complex nonlinear patterns
  • GaussianNB → simple baseline model

Hyperparameter Tuning - RandomizedSearchCV

Preparation - Train/test split
- Scaling + PCA experiments

Modeling Evaluation

Metrics Used - Accuracy → overall correctness
- Precision & Recall → false positives/negatives
- F1 Score → balance of precision & recall

Random Forest Results

Batting Random Forest Results Pitching Random Forest Results

XGBoost Results

Batting XGBoost Results Pitching XGBoost Results

Naive Bayes Results

Batting Naive Bayes Results Pitching Naive Bayes Results

Model Accuracy Results

Model Accuracy

Conclusion

  • Career stats can predict Hall of Fame induction with high accuracy
  • Models capture patterns but miss subjective factors (awards, reputation, historical context)
  • Next steps: add advanced metrics & era adjustments for better accuracy
  • Data analytics offers valuable insight into what drives Hall of Fame voting