Predicting MLB Hall of Fame Inductions
INFO 523 - Final Project
Abstract
Major League Baseball’s Hall of Fame selection process remains one of the most prestigious yet subjective honors in professional sports.
In this project, we aim to develop a machine learning model capable of predicting whether a player will be inducted into the Hall of Fame based on career-level statistics and achievements.
Utilizing the Lahman Baseball Database, which includes detailed player records spanning over a century, we will perform data cleaning, feature engineering, and model training using logistic regression and decision trees.
Our goal is not only to achieve high prediction accuracy, but also to identify which career metrics most significantly influence Hall of Fame voting outcomes.
Project Write Up
Data Preparation
We utilized the Lahman Baseball Database, a publicly available dataset containing a wide array of career statistics for MLB players dating all the way back to the late 1800s.
The primary three statistics we used were:
Batting Statistics: includes hits, homeruns, batting average, RBIs, and other offensive statistics.
Pitching Statistics: includes earned run average (ERA), stirkeouts, wins, saves, and related metrics.
Fielding Statistics: includes putouts, assists, errors, and fielding percentage.
Our goal was to create two seperate models, between position players (batters) and pitchers. To do this we needed to seperate datasets that contain statistics oriented more towards each of the two categories of players.
Our target variable for the project was the column “Inducted” which indicates if a player was elected to the Hall of Fame. This is a binary statistic 1 for elected 0 for not elected.
Exploratory Data Analysis (EDA)
Before training we needed to do some data analysis and implement some adjustments to better suit modeling. We addressed the following areas while working on the EDA portion of our project.
- Distribution
- Visualized the distribuitions of relevant metrics and analyzed which adjustments were needed to make the data we had better for modeling.
- Correlations
- In order to engineer beter features for modeling, our team created correlation matrices to analyze relationships between data features.
- Separation
- We split the datasets and models between positional players and pitcher. Key pitching metrics include ERA, wins, strikes. Positional players, which include all fielding positions as well as the DH were evaluated in terms of batting average, career home runs and career hits.
- Feature Engineering
- While the Lahman dataset is extremely robust it does not account for metrics that are generally reported by sports broadcasters and journalists.
- For positional players we created batting average, slugging percentage which represents the total number of bases a player achieves per at bat, and on base pecentage which is similar to batting average but accounts for walks as well.
- For pitchers we created WHIP which is equal to walks plus hits per inning pitched and K/BB which is the ratio of strikeouts to walks given by a pitcher.
Modeling Approach
To predict Hall of Fame inductions we tested the following supervised learning models.
- RandomForestClassifier
- Decision trees that improve accuracy and helps identify which stats matter most.
- XGBClassifier
- A gradient-boosted tree model that handles complex, nonlinear relationships in the data.
- Gaussian Naive Bayes (GaussianNB)
- A simple probabilistic model used here as a baseline for comparison.
- RandomizedSearchCV
- Applied to tune hyperparameters for Random Forest and XGBoost. This method tests a random set of parameter combinations, which is faster and more efficient than a full grid search.
Before modeling, the data sets were split into training and test sets. We also experimented with scaling and using PCA on the datasets to improve perfomance. As we speak about later, results varied for different models.
Modeling Evaluation
We evaluated model performance using several classification metrics:
- Accuracy
- Overall percentage of correct predictions.
- Overall percentage of correct predictions.
- Precision and Recall
- Useful for checking how well the models avoid false positives and false negatives.
- Useful for checking how well the models avoid false positives and false negatives.
- F1 Score
- Balances precision and recall into a single measure.
- Model Accuracy
- The table below shows the accuracy of each model in correctly predicting hall of fame induction for all eligible MLB players.
Model | Batting Accuracy | Pitching Accuracy |
---|---|---|
Random Forest | 0.92 | 0.91 |
XGBoost | 0.92 | 0.91 |
Naive Bayes | 0.68 | 0.84 |
Batting PCA
- As shown in the figure below 5 metrics accounted for 90% of the variance in batting statistics.
- As shown in the figure below 5 metrics accounted for 90% of the variance in batting statistics.
Pitching PCA
- As shown in the figure below 4 metrics accounted for 90% of the variance in pitching statistics.
- As shown in the figure below 4 metrics accounted for 90% of the variance in pitching statistics.
Batting Random Forest Results
Batting XGBoost Results
Batting Naive Bayes Results
Pitching Random Forest Results
Pitching XGBoost Results
Pitching Naive Bayes Results
Conclusion
Our project showed that career statistics can be used to build models that predict Hall of Fame induction with significant accuracy.
While the models capture many important patterns, they do not fully reflect the subjective nature of the voting process. Factors like awards, reputation, or historical context are not represented in the data but clearly influence outcomes. A possible next step would be to integrate some of these advanced metrics or account for era adjustments to improve accuracy.
Overall, our project demonstrates that data analytics can provide useful insights into what drives Hall of Fame voting and sets the stage for more detailed models in the future.