Unnamed: 0 | batter | player_name | game_date | stand | p_throws | pitch_type | effective_speed | pfx_x | pfx_z | plate_x | plate_z | zone | description | launch_speed | launch_angle | hc_x | hc_y | if_fielding_alignment | events | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | FF | 97.0 | -0.57 | 1.33 | 0.08 | 3.23 | 2.0 | swinging_strike | NaN | NaN | NaN | NaN | Standard | strikeout |
1 | 1 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | SI | 96.0 | -1.18 | 0.97 | -0.04 | 1.74 | 8.0 | foul | 88.8 | -43.0 | NaN | NaN | Standard | NaN |
2 | 2 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | SL | 89.8 | 0.36 | -0.11 | 1.72 | 0.27 | 14.0 | blocked_ball | NaN | NaN | NaN | NaN | Standard | NaN |
3 | 3 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | SL | 88.2 | 0.40 | -0.31 | 0.27 | 1.74 | 8.0 | foul | 47.7 | -38.0 | NaN | NaN | Standard | NaN |
4 | 4 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | SI | 96.4 | -1.20 | 0.93 | 0.16 | 1.02 | 14.0 | ball | NaN | NaN | NaN | NaN | Standard | NaN |
5 | 5 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | SI | 95.3 | -1.23 | 0.92 | -0.03 | 1.64 | 8.0 | foul | 92.6 | -37.0 | NaN | NaN | Standard | NaN |
6 | 6 | 686668 | Ginkel, Kevin | 2024-03-31 | R | R | FF | 97.1 | -0.86 | 0.94 | 0.38 | 1.22 | 14.0 | ball | NaN | NaN | NaN | NaN | Standard | NaN |
7 | 7 | 669911 | Ginkel, Kevin | 2024-03-31 | L | R | FF | 97.1 | -0.88 | 1.34 | 0.03 | 4.00 | 12.0 | swinging_strike | NaN | NaN | NaN | NaN | Strategic | strikeout |
8 | 8 | 669911 | Ginkel, Kevin | 2024-03-31 | L | R | SL | 88.7 | 0.35 | -0.20 | 0.66 | 1.45 | 14.0 | foul | 68.9 | -22.0 | NaN | NaN | Standard | NaN |
9 | 9 | 669911 | Ginkel, Kevin | 2024-03-31 | L | R | FF | 96.9 | -0.91 | 1.37 | -0.55 | 3.28 | 1.0 | foul | 78.6 | 31.0 | NaN | NaN | Standard | NaN |
Sweet Spotting: Predicting Pitcher Effectiveness from Pitch Location Zones
Proposal
Dataset
High-Level Goal
I plan to analyze pitch-level Statcast data to model the relationship between pitch location, swing decisions, and quality contact, using spatial visualizations and zone-based metrics for both pitchers and hitters.
Dataset Description
This project proposes to use a dataset containing Major League Baseball Statcast data from the 2024 season. The data was obtained using the Python library pybaseball
via the statcast
function. It includes 709,511 rows and 119 columns, with each row representing an individual pitch event. The dataset captures detailed information on both the pitcher and hitter involved in each event, as well as contextual variables such as pitch type, exit velocity, launch angle, batted ball outcome, and pitch location.
This dataset was chosen because it enables player-level analysis by aggregating outcomes and tendencies across thousands of in-game events. It supports the exploration of how different pitching and hitting profiles relate to performance, and provides the granularity needed to evaluate behavior across specific zones, counts, and contact types. Its size and richness allow for both exploratory and explanatory analysis at the pitcher and hitter levels.
Column Definitions
batter – MLB Player Id tied to the play event.
game_year – Year game took place.
game_date – The calendar date of the game (YYYY-MM-DD).
home_team – Abbreviation of home team.
stand – Side of the plate batter is standing.
p_throws – Hand pitcher throws with.
pitch_type – The type of pitch derived from Statcast.
effective_speed – Derived speed based on the the extension of the pitcher’s release.
pfx_x – Horizontal movement in feet from the catcher’s perspective.
pfx_z – Vertical movement in feet from the catcher’s perspective.
plate_x – Horizontal position of the ball when it crosses home plate from the catcher’s perspective.
plate_z – Vertical position of the ball when it crosses home plate from the catcher’s perspective.
zone – Zone location of the ball when it crosses the plate from the catcher’s perspective.
description – Description of the resulting pitch.
launch_speed – Exit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here (http://tangotiger.com/index.php/site/article/statcast-lab-no-nulls-in-batted-balls-launch-parameters).
launch_angle – Launch angle of the batted ball as tracked by Statcast.
hc_x – Hit coordinate X of batted ball.
hc_y – Hit coordinate XY of batted ball.
if_fielding_alignment – Infield fielding alignment at the time of the pitch.
events – Event of the resulting Plate Appearance.
Questions
Can we predict the likelihood of a hitter producing a favorable outcome (e.g., hit or hard contact) based on pitch type and location?
How do swing decisions and contact quality vary by pitch characteristics and location, and can these patterns be used to identify hitter tendencies?
Analysis Plan
To address the first question, predicting the likelihood of a favorable outcome, I plan to build classification models that take into account pitch type, pitch location, and hitter swing decisions (e.g., swing/no swing, in-zone vs. out-of-zone) to predict whether a given pitch results in a hit or another favorable result (e.g., hard contact or ball in play). Potential models include logistic regression, random forest classifiers, or gradient-boosted trees. I’ll evaluate model performance using metrics such as accuracy, precision, recall, and AUC.
For the second question, identifying hitter tendencies, I will explore how swing decisions and contact quality vary across different pitch types and locations. I’ll use clustering techniques (e.g., k-means or hierarchical clustering) to group hitters based on their swing/contact profiles, which may reveal meaningful patterns or performance archetypes. Feature engineering may include creating aggregated zone-specific swing/contact rates or composite aggressiveness scores.
Plan of Attack
Week 1 (Aug 5–11):
Clean and explore the dataset
Create key visualizations to understand trends and relationships
Finalize research questions and modeling strategy
Week 2 (Aug 12–18):
Build and evaluate predictive models
Tune model parameters and assess performance
Begin drafting report and assembling visualizations
Week 3 (Aug 19–20):
Finalize code and polish documentation
Complete report and visualizations
Organize GitHub repository and submit the project
Repository Organization
The project repository is organized into clearly labeled folders for data, scripts, and outputs. The data/ folder stores raw and cleaned datasets, scripts/ contains all Python and Quarto code used for analysis, and outputs/ holds visualizations and final results. Each folder includes a README.md file to briefly explain its contents and purpose.