Predicting MLB Hall of Fame Inductions

Proposal

Building a machine learning model to predict whether a Major League Baseball player will be inducted into the Hall of Fame using career statistics and achievements.
Author
Affiliation

Team Fastball - Austin Cortopassi, David Pelley, Nathan Harville

College of Information Science, University of Arizona

import numpy as np
import pandas as pd

Data Example

people = pd.read_csv("data/People.csv")
people.head(5)
ID playerID birthYear birthMonth birthDay birthCity birthCountry birthState deathYear deathMonth ... nameLast nameGiven weight height bats throws debut bbrefID finalGame retroID
0 1 aardsda01 1981.0 12.0 27.0 Denver USA CO NaN NaN ... Aardsma David Allan 215.0 75.0 R R 2004-04-06 aardsda01 2015-08-23 aardd001
1 2 aaronha01 1934.0 2.0 5.0 Mobile USA AL 2021.0 1.0 ... Aaron Henry Louis 180.0 72.0 R R 1954-04-13 aaronha01 1976-10-03 aaroh101
2 3 aaronto01 1939.0 8.0 5.0 Mobile USA AL 1984.0 8.0 ... Aaron Tommie Lee 190.0 75.0 R R 1962-04-10 aaronto01 1971-09-26 aarot101
3 4 aasedo01 1954.0 9.0 8.0 Orange USA CA NaN NaN ... Aase Donald William 190.0 75.0 R R 1977-07-26 aasedo01 1990-10-03 aased001
4 5 abadan01 1972.0 8.0 25.0 Palm Beach USA FL NaN NaN ... Abad Fausto Andres 184.0 73.0 L L 2001-09-10 abadan01 2006-04-13 abada001

5 rows × 25 columns

We are using the Lahman Baseball Database because it contains rich, historical MLB data that spans from the 1800s to the present day. Our data includes career statistics and achievements of players in the following dimensions:

Source: Society For American Baseball Research - Lahman Baseball Dataset

  • People: Player biographical data
  • Batting, Pitching, Fielding: Regular season performance
  • BattingPost, PitchingPost, FieldingPost: Postseason performance
  • Teams: Team performance stats
  • HallOfFame: Hall of Fame voting results

Each table includes a playerID column, allowing us to merge across data sources.

Questions

Can we predict whether a player will be inducted into the MLB Hall of Fame based on their career statistics and achievements?
What features (e.g., batting stats, pitching metrics, postseason performance) are most predictive of a Hall of Fame induction?

Analysis Plan

Data Preparation
Merge datasets using playerID. Aggregate season-level stats into career totals. Handle missing values and perform EDA.

Feature Engineering
- Austin: Aggregate batting and batting_post, calculate derived metrics like AVG, OPS, HR/year.
- David: Aggregate pitching and pitching_post, derive metrics such as WHIP, K/BB, quality starts.
- Nate: Aggregate fielding and fielding_post, derive metrics like position flexibility, total chances, defensive WAR.

Each team member will be responsible for the following sections for their respective choice of position. Austin will work batting, David will work on pitching and Nate will work on fielding. The team will then incorporate a model to evaluate all of these against historic inductees to predict future inductees. If needed the team will adjust positional player statistics, meaning combining batting and fielding or focusing solely on starting pitchers if relief pitchers cause too much noise in the data.

Modeling & Evaluation
- Train baseline models: logistic regression, decision tree
- Tune hyperparameters and evaluate using metrics like accuracy, precision, recall, F1-score, AUC
- Analyze feature importances and model performance
- Address issues with model performance and adjust. - Train different models like random forest and evaluate their performance against our baseline model. - Incorporate new features and evaluate the model performance against baseline model. - Look into synethetic minority oversampling techniques (SMOTE) to adjust class imabalance issues. - Tune features based on model performance and standardize, normalize or drop alltogether. - Tools: Python, pandas, scikit-learn, matplotlib, seaborn, Jupyter/Quarto

Weekly Planning

Week 1:
- Load and clean all 9 datasets
- Merge into a unified DataFrame
- Initial EDA - Aggregate season stats to career-level
- Engineer new features
- Merge with Hall of Fame labels

Week 2:
- Train and tune models (logistic regression, decision tree)
- Evaluate performance - Incorporate new features as needed - Adjust cleaning and standardization as needed - Finalize features/models that show the best performance

Week 3:
- Final model assessment
- Data visualizations and feature analysis
- Write final report and create presentation

Repository Organization

  • .github: Issue templates and workflows
  • _extra: Extra tools as needed
  • _freeze: Site execution packages
  • _data: Source data and metadata
  • _src: Source code
  • _images: Storage for visualizations for paper and presentation