Sweet Spotting: Predicting Pitcher Effectiveness from Pitch Location Zones

Proposal

Model and visualize how pitch location and swing behavior influence the likelihood of quality contact, incorporating zone-based tendencies at both the pitcher and hitter level.
Author
Affiliation

Trevor Abshire

College of Information Science, University of Arizona

Dataset

Unnamed: 0 batter player_name game_date stand p_throws pitch_type effective_speed pfx_x pfx_z plate_x plate_z zone description launch_speed launch_angle hc_x hc_y if_fielding_alignment events
0 0 686668 Ginkel, Kevin 2024-03-31 R R FF 97.0 -0.57 1.33 0.08 3.23 2.0 swinging_strike NaN NaN NaN NaN Standard strikeout
1 1 686668 Ginkel, Kevin 2024-03-31 R R SI 96.0 -1.18 0.97 -0.04 1.74 8.0 foul 88.8 -43.0 NaN NaN Standard NaN
2 2 686668 Ginkel, Kevin 2024-03-31 R R SL 89.8 0.36 -0.11 1.72 0.27 14.0 blocked_ball NaN NaN NaN NaN Standard NaN
3 3 686668 Ginkel, Kevin 2024-03-31 R R SL 88.2 0.40 -0.31 0.27 1.74 8.0 foul 47.7 -38.0 NaN NaN Standard NaN
4 4 686668 Ginkel, Kevin 2024-03-31 R R SI 96.4 -1.20 0.93 0.16 1.02 14.0 ball NaN NaN NaN NaN Standard NaN
5 5 686668 Ginkel, Kevin 2024-03-31 R R SI 95.3 -1.23 0.92 -0.03 1.64 8.0 foul 92.6 -37.0 NaN NaN Standard NaN
6 6 686668 Ginkel, Kevin 2024-03-31 R R FF 97.1 -0.86 0.94 0.38 1.22 14.0 ball NaN NaN NaN NaN Standard NaN
7 7 669911 Ginkel, Kevin 2024-03-31 L R FF 97.1 -0.88 1.34 0.03 4.00 12.0 swinging_strike NaN NaN NaN NaN Strategic strikeout
8 8 669911 Ginkel, Kevin 2024-03-31 L R SL 88.7 0.35 -0.20 0.66 1.45 14.0 foul 68.9 -22.0 NaN NaN Standard NaN
9 9 669911 Ginkel, Kevin 2024-03-31 L R FF 96.9 -0.91 1.37 -0.55 3.28 1.0 foul 78.6 31.0 NaN NaN Standard NaN

Pitches in Each Zone

x and z’s

High-Level Goal

I plan to analyze pitch-level Statcast data to model the relationship between pitch location, swing decisions, and quality contact, using spatial visualizations and zone-based metrics for both pitchers and hitters.

Dataset Description

This project proposes to use a dataset containing Major League Baseball Statcast data from the 2024 season. The data was obtained using the Python library pybaseball via the statcast function. It includes 709,511 rows and 119 columns, with each row representing an individual pitch event. The dataset captures detailed information on both the pitcher and hitter involved in each event, as well as contextual variables such as pitch type, exit velocity, launch angle, batted ball outcome, and pitch location.

This dataset was chosen because it enables player-level analysis by aggregating outcomes and tendencies across thousands of in-game events. It supports the exploration of how different pitching and hitting profiles relate to performance, and provides the granularity needed to evaluate behavior across specific zones, counts, and contact types. Its size and richness allow for both exploratory and explanatory analysis at the pitcher and hitter levels.

Column Definitions

batter – MLB Player Id tied to the play event.

game_year – Year game took place.

game_date – The calendar date of the game (YYYY-MM-DD).

home_team – Abbreviation of home team.

stand – Side of the plate batter is standing.

p_throws – Hand pitcher throws with.

pitch_type – The type of pitch derived from Statcast.

effective_speed – Derived speed based on the the extension of the pitcher’s release.

pfx_x – Horizontal movement in feet from the catcher’s perspective.

pfx_z – Vertical movement in feet from the catcher’s perspective.

plate_x – Horizontal position of the ball when it crosses home plate from the catcher’s perspective.

plate_z – Vertical position of the ball when it crosses home plate from the catcher’s perspective.

zone – Zone location of the ball when it crosses the plate from the catcher’s perspective.

description – Description of the resulting pitch.

launch_speed – Exit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here (http://tangotiger.com/index.php/site/article/statcast-lab-no-nulls-in-batted-balls-launch-parameters).

launch_angle – Launch angle of the batted ball as tracked by Statcast.

hc_x – Hit coordinate X of batted ball.

hc_y – Hit coordinate XY of batted ball.

if_fielding_alignment – Infield fielding alignment at the time of the pitch.

events – Event of the resulting Plate Appearance.

Questions

Can we predict the likelihood of a hitter producing a favorable outcome (e.g., hit or hard contact) based on pitch type and location?

How do swing decisions and contact quality vary by pitch characteristics and location, and can these patterns be used to identify hitter tendencies?

Analysis Plan

To address the first question, predicting the likelihood of a favorable outcome, I plan to build classification models that take into account pitch type, pitch location, and hitter swing decisions (e.g., swing/no swing, in-zone vs. out-of-zone) to predict whether a given pitch results in a hit or another favorable result (e.g., hard contact or ball in play). Potential models include logistic regression, random forest classifiers, or gradient-boosted trees. I’ll evaluate model performance using metrics such as accuracy, precision, recall, and AUC.

For the second question, identifying hitter tendencies, I will explore how swing decisions and contact quality vary across different pitch types and locations. I’ll use clustering techniques (e.g., k-means or hierarchical clustering) to group hitters based on their swing/contact profiles, which may reveal meaningful patterns or performance archetypes. Feature engineering may include creating aggregated zone-specific swing/contact rates or composite aggressiveness scores.

Plan of Attack

Week 1 (Aug 5–11):

Clean and explore the dataset

Create key visualizations to understand trends and relationships

Finalize research questions and modeling strategy

Week 2 (Aug 12–18):

Build and evaluate predictive models

Tune model parameters and assess performance

Begin drafting report and assembling visualizations

Week 3 (Aug 19–20):

Finalize code and polish documentation

Complete report and visualizations

Organize GitHub repository and submit the project

Repository Organization

The project repository is organized into clearly labeled folders for data, scripts, and outputs. The data/ folder stores raw and cleaned datasets, scripts/ contains all Python and Quarto code used for analysis, and outputs/ holds visualizations and final results. Each folder includes a README.md file to briefly explain its contents and purpose.