Predicting MLB Hall of Fame Inductions

Proposal

Building a machine learning model to predict whether a Major League Baseball player will be inducted into the Hall of Fame using career statistics and achievements.

Author

Affiliation

Team Fastball - Austin Cortopassi, David Pelley, Nathan Harville

College of Information Science, University of Arizona

import numpy as np
import pandas as pd

Data Example

people = pd.read_csv("data/People.csv")
people.head(5)

	ID	playerID	birthYear	birthMonth	birthDay	birthCity	birthCountry	birthState	deathYear	deathMonth	...	nameLast	nameGiven	weight	height	bats	throws	debut	bbrefID	finalGame	retroID
0	1	aardsda01	1981.0	12.0	27.0	Denver	USA	CO	NaN	NaN	...	Aardsma	David Allan	215.0	75.0	R	R	2004-04-06	aardsda01	2015-08-23	aardd001
1	2	aaronha01	1934.0	2.0	5.0	Mobile	USA	AL	2021.0	1.0	...	Aaron	Henry Louis	180.0	72.0	R	R	1954-04-13	aaronha01	1976-10-03	aaroh101
2	3	aaronto01	1939.0	8.0	5.0	Mobile	USA	AL	1984.0	8.0	...	Aaron	Tommie Lee	190.0	75.0	R	R	1962-04-10	aaronto01	1971-09-26	aarot101
3	4	aasedo01	1954.0	9.0	8.0	Orange	USA	CA	NaN	NaN	...	Aase	Donald William	190.0	75.0	R	R	1977-07-26	aasedo01	1990-10-03	aased001
4	5	abadan01	1972.0	8.0	25.0	Palm Beach	USA	FL	NaN	NaN	...	Abad	Fausto Andres	184.0	73.0	L	L	2001-09-10	abadan01	2006-04-13	abada001

5 rows × 25 columns

We are using the Lahman Baseball Database because it contains rich, historical MLB data that spans from the 1800s to the present day. Our data includes career statistics and achievements of players in the following dimensions:

Source: Society For American Baseball Research - Lahman Baseball Dataset

People: Player biographical data
Batting, Pitching, Fielding: Regular season performance
BattingPost, PitchingPost, FieldingPost: Postseason performance
Teams: Team performance stats
HallOfFame: Hall of Fame voting results

Each table includes a playerID column, allowing us to merge across data sources.

Questions

Can we predict whether a player will be inducted into the MLB Hall of Fame based on their career statistics and achievements?
What features (e.g., batting stats, pitching metrics, postseason performance) are most predictive of a Hall of Fame induction?

Analysis Plan

Data Preparation
Merge datasets using playerID. Aggregate season-level stats into career totals. Handle missing values and perform EDA.

Feature Engineering
- Austin: Aggregate batting and batting_post, calculate derived metrics like AVG, OPS, HR/year.
- David: Aggregate pitching and pitching_post, derive metrics such as WHIP, K/BB, quality starts.
- Nate: Aggregate fielding and fielding_post, derive metrics like position flexibility, total chances, defensive WAR.

Each team member will be responsible for the following sections for their respective choice of position. Austin will work batting, David will work on pitching and Nate will work on fielding. The team will then incorporate a model to evaluate all of these against historic inductees to predict future inductees. If needed the team will adjust positional player statistics, meaning combining batting and fielding or focusing solely on starting pitchers if relief pitchers cause too much noise in the data.

Modeling & Evaluation
- Train baseline models: logistic regression, decision tree
- Tune hyperparameters and evaluate using metrics like accuracy, precision, recall, F1-score, AUC
- Analyze feature importances and model performance
- Address issues with model performance and adjust. - Train different models like random forest and evaluate their performance against our baseline model. - Incorporate new features and evaluate the model performance against baseline model. - Look into synethetic minority oversampling techniques (SMOTE) to adjust class imabalance issues. - Tune features based on model performance and standardize, normalize or drop alltogether. - Tools: Python, pandas, scikit-learn, matplotlib, seaborn, Jupyter/Quarto

Weekly Planning

Week 1:
- Load and clean all 9 datasets
- Merge into a unified DataFrame
- Initial EDA - Aggregate season stats to career-level
- Engineer new features
- Merge with Hall of Fame labels

Week 2:
- Train and tune models (logistic regression, decision tree)
- Evaluate performance - Incorporate new features as needed - Adjust cleaning and standardization as needed - Finalize features/models that show the best performance

Week 3:
- Final model assessment
- Data visualizations and feature analysis
- Write final report and create presentation

Repository Organization

.github: Issue templates and workflows
_extra: Extra tools as needed
_freeze: Site execution packages
_data: Source data and metadata
_src: Source code
_images: Storage for visualizations for paper and presentation