Predicting MLB Hall of Fame Inductions

INFO 523 - Final Project

Using historical player data from the Lahman Baseball Database to predict Hall of Fame inductions through machine learning.

Author

Affiliation

Team Fastball - Austin Cortopassi, David Pelley, Nathan Harville

College of Information Science, University of Arizona

Abstract

Major League Baseball’s Hall of Fame selection process remains one of the most prestigious yet subjective honors in professional sports.
In this project, we aim to develop a machine learning model capable of predicting whether a player will be inducted into the Hall of Fame based on career-level statistics and achievements.
Utilizing the Lahman Baseball Database, which includes detailed player records spanning over a century, we will perform data cleaning, feature engineering, and model training using logistic regression and decision trees.
Our goal is not only to achieve high prediction accuracy, but also to identify which career metrics most significantly influence Hall of Fame voting outcomes.

Project Write Up

Data Preparation

We utilized the Lahman Baseball Database, a publicly available dataset containing a wide array of career statistics for MLB players dating all the way back to the late 1800s.

The primary three statistics we used were:

Batting Statistics: includes hits, homeruns, batting average, RBIs, and other offensive statistics.
Pitching Statistics: includes earned run average (ERA), stirkeouts, wins, saves, and related metrics.
Fielding Statistics: includes putouts, assists, errors, and fielding percentage.

Our goal was to create two seperate models, between position players (batters) and pitchers. To do this we needed to seperate datasets that contain statistics oriented more towards each of the two categories of players.

Our target variable for the project was the column “Inducted” which indicates if a player was elected to the Hall of Fame. This is a binary statistic 1 for elected 0 for not elected.

Exploratory Data Analysis (EDA)

Before training we needed to do some data analysis and implement some adjustments to better suit modeling. We addressed the following areas while working on the EDA portion of our project.

Distribution
- Visualized the distribuitions of relevant metrics and analyzed which adjustments were needed to make the data we had better for modeling.
Correlations
- In order to engineer beter features for modeling, our team created correlation matrices to analyze relationships between data features.
Separation
- We split the datasets and models between positional players and pitcher. Key pitching metrics include ERA, wins, strikes. Positional players, which include all fielding positions as well as the DH were evaluated in terms of batting average, career home runs and career hits.
Feature Engineering
- While the Lahman dataset is extremely robust it does not account for metrics that are generally reported by sports broadcasters and journalists.
- For positional players we created batting average, slugging percentage which represents the total number of bases a player achieves per at bat, and on base pecentage which is similar to batting average but accounts for walks as well.
- For pitchers we created WHIP which is equal to walks plus hits per inning pitched and K/BB which is the ratio of strikeouts to walks given by a pitcher.

Modeling Approach

To predict Hall of Fame inductions we tested the following supervised learning models.

RandomForestClassifier
- Decision trees that improve accuracy and helps identify which stats matter most.
XGBClassifier
- A gradient-boosted tree model that handles complex, nonlinear relationships in the data.
Gaussian Naive Bayes (GaussianNB)
- A simple probabilistic model used here as a baseline for comparison.
RandomizedSearchCV
- Applied to tune hyperparameters for Random Forest and XGBoost. This method tests a random set of parameter combinations, which is faster and more efficient than a full grid search.

Before modeling, the data sets were split into training and test sets. We also experimented with scaling and using PCA on the datasets to improve perfomance. As we speak about later, results varied for different models.

Modeling Evaluation

We evaluated model performance using several classification metrics:

Accuracy
- Overall percentage of correct predictions.
Precision and Recall
- Useful for checking how well the models avoid false positives and false negatives.
F1 Score
- Balances precision and recall into a single measure.
Model Accuracy
- The table below shows the accuracy of each model in correctly predicting hall of fame induction for all eligible MLB players.

Model	Batting Accuracy	Pitching Accuracy
Random Forest	0.92	0.91
XGBoost	0.92	0.91
Naive Bayes	0.68	0.84

Batting PCA
- As shown in the figure below 5 metrics accounted for 90% of the variance in batting statistics.
Pitching PCA
- As shown in the figure below 4 metrics accounted for 90% of the variance in pitching statistics.
Batting Random Forest Results
Batting XGBoost Results
Batting Naive Bayes Results
Pitching Random Forest Results
Pitching XGBoost Results
Pitching Naive Bayes Results

Conclusion

Our project showed that career statistics can be used to build models that predict Hall of Fame induction with significant accuracy.

While the models capture many important patterns, they do not fully reflect the subjective nature of the voting process. Factors like awards, reputation, or historical context are not represented in the data but clearly influence outcomes. A possible next step would be to integrate some of these advanced metrics or account for era adjustments to improve accuracy.

Overall, our project demonstrates that data analytics can provide useful insights into what drives Hall of Fame voting and sets the stage for more detailed models in the future.

--- title: "Predicting MLB Hall of Fame Inductions" subtitle: "INFO 523 - Final Project" author: - name: "Team Fastball - Austin Cortopassi, David Pelley, Nathan Harville" affiliations: - name: "College of Information Science, University of Arizona" description: "Using historical player data from the Lahman Baseball Database to predict Hall of Fame inductions through machine learning." format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false jupyter: python3 --- ## Abstract Major League Baseball's Hall of Fame selection process remains one of the most prestigious yet subjective honors in professional sports. In this project, we aim to develop a machine learning model capable of predicting whether a player will be inducted into the Hall of Fame based on career-level statistics and achievements. Utilizing the Lahman Baseball Database, which includes detailed player records spanning over a century, we will perform data cleaning, feature engineering, and model training using logistic regression and decision trees. Our goal is not only to achieve high prediction accuracy, but also to identify which career metrics most significantly influence Hall of Fame voting outcomes. --- ## Project Write Up ### Data Preparation We utilized the Lahman Baseball Database, a publicly available dataset containing a wide array of career statistics for MLB players dating all the way back to the late 1800s. The primary three statistics we used were: - **Batting Statistics:** includes hits, homeruns, batting average, RBIs, and other offensive statistics. - **Pitching Statistics:** includes earned run average (ERA), stirkeouts, wins, saves, and related metrics. - **Fielding Statistics:** includes putouts, assists, errors, and fielding percentage. Our goal was to create two seperate models, between position players (batters) and pitchers. To do this we needed to seperate datasets that contain statistics oriented more towards each of the two categories of players. Our target variable for the project was the column "Inducted" which indicates if a player was elected to the Hall of Fame. This is a binary statistic 1 for elected 0 for not elected. ### Exploratory Data Analysis (EDA) Before training we needed to do some data analysis and implement some adjustments to better suit modeling. We addressed the following areas while working on the EDA portion of our project. - **Distribution** - Visualized the distribuitions of relevant metrics and analyzed which adjustments were needed to make the data we had better for modeling. - **Correlations** - In order to engineer beter features for modeling, our team created correlation matrices to analyze relationships between data features. - **Separation** - We split the datasets and models between positional players and pitcher. Key pitching metrics include ERA, wins, strikes. Positional players, which include all fielding positions as well as the DH were evaluated in terms of batting average, career home runs and career hits. - **Feature Engineering** - While the Lahman dataset is extremely robust it does not account for metrics that are generally reported by sports broadcasters and journalists. - For positional players we created batting average, slugging percentage which represents the total number of bases a player achieves per at bat, and on base pecentage which is similar to batting average but accounts for walks as well. - For pitchers we created WHIP which is equal to walks plus hits per inning pitched and K/BB which is the ratio of strikeouts to walks given by a pitcher. ### Modeling Approach To predict Hall of Fame inductions we tested the following supervised learning models. - **RandomForestClassifier** - Decision trees that improve accuracy and helps identify which stats matter most. - **XGBClassifier** - A gradient-boosted tree model that handles complex, nonlinear relationships in the data. - **Gaussian Naive Bayes (GaussianNB)** - A simple probabilistic model used here as a baseline for comparison. - **RandomizedSearchCV** - Applied to tune hyperparameters for Random Forest and XGBoost. This method tests a random set of parameter combinations, which is faster and more efficient than a full grid search. Before modeling, the data sets were split into training and test sets. We also experimented with scaling and using PCA on the datasets to improve perfomance. As we speak about later, results varied for different models. ### Modeling Evaluation We evaluated model performance using several classification metrics: - **Accuracy** - Overall percentage of correct predictions. - **Precision and Recall** - Useful for checking how well the models avoid false positives and false negatives. - **F1 Score** - Balances precision and recall into a single measure. - **Model Accuracy** - The table below shows the accuracy of each model in correctly predicting hall of fame induction for all eligible MLB players. | Model | Batting Accuracy | Pitching Accuracy | |:---------|:--------:|---------:| | Random Forest| 0.92 | 0.91 | | XGBoost | 0.92 | 0.91 | | Naive Bayes | 0.68 | 0.84 | - **Batting PCA** - As shown in the figure below 5 metrics accounted for 90% of the variance in batting statistics. ![Batting PCA](/images/batting_pca.png) - **Pitching PCA** - As shown in the figure below 4 metrics accounted for 90% of the variance in pitching statistics. ![Pitching PCA](/images/pitching_pca.png) - **Batting Random Forest Results** ![Batting Random Forest Results](/images/batting_rf_results.png) - **Batting XGBoost Results** ![Batting XGBoost Results](/images/batting_xgb_results.png) - **Batting Naive Bayes Results** ![Batting Naive Bayes Results](/images/batting_nb_results.png) - **Pitching Random Forest Results** ![Pitching Random Forest Results](/images/pitching_rf_results.png) - **Pitching XGBoost Results** ![Pitching XGBoost Results](/images/pitching_xgb_results.png) - **Pitching Naive Bayes Results** ![Pitching Naive Bayes Results](/images/pitching_nb_results.png) ### Conclusion Our project showed that career statistics can be used to build models that predict Hall of Fame induction with significant accuracy. While the models capture many important patterns, they do not fully reflect the subjective nature of the voting process. Factors like awards, reputation, or historical context are not represented in the data but clearly influence outcomes. A possible next step would be to integrate some of these advanced metrics or account for era adjustments to improve accuracy. Overall, our project demonstrates that data analytics can provide useful insights into what drives Hall of Fame voting and sets the stage for more detailed models in the future.