Predicting MLB Hall of Fame Inductions
Proposal
Data Example
ID | playerID | birthYear | birthMonth | birthDay | birthCity | birthCountry | birthState | deathYear | deathMonth | ... | nameLast | nameGiven | weight | height | bats | throws | debut | bbrefID | finalGame | retroID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | aardsda01 | 1981.0 | 12.0 | 27.0 | Denver | USA | CO | NaN | NaN | ... | Aardsma | David Allan | 215.0 | 75.0 | R | R | 2004-04-06 | aardsda01 | 2015-08-23 | aardd001 |
1 | 2 | aaronha01 | 1934.0 | 2.0 | 5.0 | Mobile | USA | AL | 2021.0 | 1.0 | ... | Aaron | Henry Louis | 180.0 | 72.0 | R | R | 1954-04-13 | aaronha01 | 1976-10-03 | aaroh101 |
2 | 3 | aaronto01 | 1939.0 | 8.0 | 5.0 | Mobile | USA | AL | 1984.0 | 8.0 | ... | Aaron | Tommie Lee | 190.0 | 75.0 | R | R | 1962-04-10 | aaronto01 | 1971-09-26 | aarot101 |
3 | 4 | aasedo01 | 1954.0 | 9.0 | 8.0 | Orange | USA | CA | NaN | NaN | ... | Aase | Donald William | 190.0 | 75.0 | R | R | 1977-07-26 | aasedo01 | 1990-10-03 | aased001 |
4 | 5 | abadan01 | 1972.0 | 8.0 | 25.0 | Palm Beach | USA | FL | NaN | NaN | ... | Abad | Fausto Andres | 184.0 | 73.0 | L | L | 2001-09-10 | abadan01 | 2006-04-13 | abada001 |
5 rows × 25 columns
We are using the Lahman Baseball Database because it contains rich, historical MLB data that spans from the 1800s to the present day. Our data includes career statistics and achievements of players in the following dimensions:
Source: Society For American Baseball Research - Lahman Baseball Dataset
- People: Player biographical data
- Batting, Pitching, Fielding: Regular season performance
- BattingPost, PitchingPost, FieldingPost: Postseason performance
- Teams: Team performance stats
- HallOfFame: Hall of Fame voting results
Each table includes a playerID column, allowing us to merge across data sources.
Questions
Can we predict whether a player will be inducted into the MLB Hall of Fame based on their career statistics and achievements?
What features (e.g., batting stats, pitching metrics, postseason performance) are most predictive of a Hall of Fame induction?
Analysis Plan
Data Preparation
Merge datasets using playerID. Aggregate season-level stats into career totals. Handle missing values and perform EDA.
Feature Engineering
- Austin: Aggregate batting and batting_post, calculate derived metrics like AVG, OPS, HR/year.
- David: Aggregate pitching and pitching_post, derive metrics such as WHIP, K/BB, quality starts.
- Nate: Aggregate fielding and fielding_post, derive metrics like position flexibility, total chances, defensive WAR.
Each team member will be responsible for the following sections for their respective choice of position. Austin will work batting, David will work on pitching and Nate will work on fielding. The team will then incorporate a model to evaluate all of these against historic inductees to predict future inductees. If needed the team will adjust positional player statistics, meaning combining batting and fielding or focusing solely on starting pitchers if relief pitchers cause too much noise in the data.
Modeling & Evaluation
- Train baseline models: logistic regression, decision tree
- Tune hyperparameters and evaluate using metrics like accuracy, precision, recall, F1-score, AUC
- Analyze feature importances and model performance
- Address issues with model performance and adjust. - Train different models like random forest and evaluate their performance against our baseline model. - Incorporate new features and evaluate the model performance against baseline model. - Look into synethetic minority oversampling techniques (SMOTE) to adjust class imabalance issues. - Tune features based on model performance and standardize, normalize or drop alltogether. - Tools: Python, pandas, scikit-learn, matplotlib, seaborn, Jupyter/Quarto
Weekly Planning
Week 1:
- Load and clean all 9 datasets
- Merge into a unified DataFrame
- Initial EDA - Aggregate season stats to career-level
- Engineer new features
- Merge with Hall of Fame labels
Week 2:
- Train and tune models (logistic regression, decision tree)
- Evaluate performance - Incorporate new features as needed - Adjust cleaning and standardization as needed - Finalize features/models that show the best performance
Week 3:
- Final model assessment
- Data visualizations and feature analysis
- Write final report and create presentation
Repository Organization
- .github: Issue templates and workflows
- _extra: Extra tools as needed
- _freeze: Site execution packages
- _data: Source data and metadata
- _src: Source code
- _images: Storage for visualizations for paper and presentation