Sweet Spotting: Predicting Pitcher Effectiveness from Pitch Location Zones

Proposal

Model and visualize how pitch location and swing behavior influence the likelihood of quality contact, incorporating zone-based tendencies at both the pitcher and hitter level.

Author

Affiliation

Trevor Abshire

College of Information Science, University of Arizona

Dataset

	Unnamed: 0	batter	player_name	game_date	stand	p_throws	pitch_type	effective_speed	pfx_x	pfx_z	plate_x	plate_z	zone	description	launch_speed	launch_angle	hc_x	hc_y	if_fielding_alignment	events
0	0	686668	Ginkel, Kevin	2024-03-31	R	R	FF	97.0	-0.57	1.33	0.08	3.23	2.0	swinging_strike	NaN	NaN	NaN	NaN	Standard	strikeout
1	1	686668	Ginkel, Kevin	2024-03-31	R	R	SI	96.0	-1.18	0.97	-0.04	1.74	8.0	foul	88.8	-43.0	NaN	NaN	Standard	NaN
2	2	686668	Ginkel, Kevin	2024-03-31	R	R	SL	89.8	0.36	-0.11	1.72	0.27	14.0	blocked_ball	NaN	NaN	NaN	NaN	Standard	NaN
3	3	686668	Ginkel, Kevin	2024-03-31	R	R	SL	88.2	0.40	-0.31	0.27	1.74	8.0	foul	47.7	-38.0	NaN	NaN	Standard	NaN
4	4	686668	Ginkel, Kevin	2024-03-31	R	R	SI	96.4	-1.20	0.93	0.16	1.02	14.0	ball	NaN	NaN	NaN	NaN	Standard	NaN
5	5	686668	Ginkel, Kevin	2024-03-31	R	R	SI	95.3	-1.23	0.92	-0.03	1.64	8.0	foul	92.6	-37.0	NaN	NaN	Standard	NaN
6	6	686668	Ginkel, Kevin	2024-03-31	R	R	FF	97.1	-0.86	0.94	0.38	1.22	14.0	ball	NaN	NaN	NaN	NaN	Standard	NaN
7	7	669911	Ginkel, Kevin	2024-03-31	L	R	FF	97.1	-0.88	1.34	0.03	4.00	12.0	swinging_strike	NaN	NaN	NaN	NaN	Strategic	strikeout
8	8	669911	Ginkel, Kevin	2024-03-31	L	R	SL	88.7	0.35	-0.20	0.66	1.45	14.0	foul	68.9	-22.0	NaN	NaN	Standard	NaN
9	9	669911	Ginkel, Kevin	2024-03-31	L	R	FF	96.9	-0.91	1.37	-0.55	3.28	1.0	foul	78.6	31.0	NaN	NaN	Standard	NaN

High-Level Goal

I plan to analyze pitch-level Statcast data to model the relationship between pitch location, swing decisions, and quality contact, using spatial visualizations and zone-based metrics for both pitchers and hitters.

Dataset Description

This project proposes to use a dataset containing Major League Baseball Statcast data from the 2024 season. The data was obtained using the Python library pybaseball via the statcast function. It includes 709,511 rows and 119 columns, with each row representing an individual pitch event. The dataset captures detailed information on both the pitcher and hitter involved in each event, as well as contextual variables such as pitch type, exit velocity, launch angle, batted ball outcome, and pitch location.

This dataset was chosen because it enables player-level analysis by aggregating outcomes and tendencies across thousands of in-game events. It supports the exploration of how different pitching and hitting profiles relate to performance, and provides the granularity needed to evaluate behavior across specific zones, counts, and contact types. Its size and richness allow for both exploratory and explanatory analysis at the pitcher and hitter levels.

Column Definitions

batter – MLB Player Id tied to the play event.

game_year – Year game took place.

game_date – The calendar date of the game (YYYY-MM-DD).

home_team – Abbreviation of home team.

stand – Side of the plate batter is standing.

p_throws – Hand pitcher throws with.

pitch_type – The type of pitch derived from Statcast.

effective_speed – Derived speed based on the the extension of the pitcher’s release.

pfx_x – Horizontal movement in feet from the catcher’s perspective.

pfx_z – Vertical movement in feet from the catcher’s perspective.

plate_x – Horizontal position of the ball when it crosses home plate from the catcher’s perspective.

plate_z – Vertical position of the ball when it crosses home plate from the catcher’s perspective.

zone – Zone location of the ball when it crosses the plate from the catcher’s perspective.

description – Description of the resulting pitch.

launch_speed – Exit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here (http://tangotiger.com/index.php/site/article/statcast-lab-no-nulls-in-batted-balls-launch-parameters).

launch_angle – Launch angle of the batted ball as tracked by Statcast.

hc_x – Hit coordinate X of batted ball.

hc_y – Hit coordinate XY of batted ball.

if_fielding_alignment – Infield fielding alignment at the time of the pitch.

events – Event of the resulting Plate Appearance.

Questions

Can we predict the likelihood of a hitter producing a favorable outcome (e.g., hit or hard contact) based on pitch type and location?

How do swing decisions and contact quality vary by pitch characteristics and location, and can these patterns be used to identify hitter tendencies?

Analysis Plan

To address the first question, predicting the likelihood of a favorable outcome, I plan to build classification models that take into account pitch type, pitch location, and hitter swing decisions (e.g., swing/no swing, in-zone vs. out-of-zone) to predict whether a given pitch results in a hit or another favorable result (e.g., hard contact or ball in play). Potential models include logistic regression, random forest classifiers, or gradient-boosted trees. I’ll evaluate model performance using metrics such as accuracy, precision, recall, and AUC.

For the second question, identifying hitter tendencies, I will explore how swing decisions and contact quality vary across different pitch types and locations. I’ll use clustering techniques (e.g., k-means or hierarchical clustering) to group hitters based on their swing/contact profiles, which may reveal meaningful patterns or performance archetypes. Feature engineering may include creating aggregated zone-specific swing/contact rates or composite aggressiveness scores.

Plan of Attack

Week 1 (Aug 5–11):

Clean and explore the dataset

Create key visualizations to understand trends and relationships

Finalize research questions and modeling strategy

Week 2 (Aug 12–18):

Build and evaluate predictive models

Tune model parameters and assess performance

Begin drafting report and assembling visualizations

Week 3 (Aug 19–20):

Finalize code and polish documentation

Complete report and visualizations

Organize GitHub repository and submit the project

Repository Organization

The project repository is organized into clearly labeled folders for data, scripts, and outputs. The data/ folder stores raw and cleaned datasets, scripts/ contains all Python and Quarto code used for analysis, and outputs/ holds visualizations and final results. Each folder includes a README.md file to briefly explain its contents and purpose.

--- title: "Sweet Spotting: Predicting Pitcher Effectiveness from Pitch Location Zones" subtitle: "Proposal" author: - name: "Trevor Abshire" affiliations: - name: "College of Information Science, University of Arizona" description: "Model and visualize how pitch location and swing behavior influence the likelihood of quality contact, incorporating zone-based tendencies at both the pitcher and hitter level." format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false jupyter: python3 --- ```{python} #| label: load-pkgs #| message: false #| echo: false import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from pybaseball import statcast from calendar import monthrange ``` ## Dataset ```{python} #| label: load-dataset #| message: false #| echo: false import pandas as pd data = pd.read_csv("data/pitch_2024_relevant.csv") data.head(10) ``` ```{python} #| label: hist-targets #| fig-cap: "Pitches in Each Zone" #| fig-align: center #| echo: false import seaborn as sns import matplotlib.pyplot as plt import pandas as pd import numpy as np data = pd.read_csv("data/pitch_2024_relevant.csv") # Count pitches in each zone zone_counts = data['zone'].value_counts().sort_index() # Initialize a 5x3 grid (rows: top to bottom, cols: left to right) zone_matrix = np.zeros((5, 3)) # Fill the matrix # Top row: zones 11 (left), 12 (right) zone_matrix[0, 0] = zone_counts.get(11, 0) zone_matrix[0, 2] = zone_counts.get(12, 0) # Zones 1–9: standard 3x3 grid zone_matrix[1:4, :] = np.reshape(zone_counts.reindex(range(1, 10), fill_value=0).values, (3, 3)) # Bottom row: zones 13 (left), 14 (right) zone_matrix[4, 0] = zone_counts.get(13, 0) zone_matrix[4, 2] = zone_counts.get(14, 0) # Plot heatmap plt.figure(figsize=(6, 6)) sns.heatmap(zone_matrix, annot=True, fmt='.0f', cmap='Reds', cbar=True, linewidths=0.5, xticklabels=['Left', 'Middle', 'Right'], yticklabels=['Above', 'Top', 'Middle', 'Bottom', 'Below']) plt.title('Pitch Count by Zone') plt.xlabel('Horizontal Location') plt.ylabel('Vertical Location') plt.show() ``` ```{python} #| fig-cap: "x and z's" #| fig-align: center #| echo: false import seaborn as sns import matplotlib.pyplot as plt import pandas as pd import numpy as np data = pd.read_csv("data/pitch_2024_relevant.csv") # Filter out any missing values pitch_locations = data[['plate_x', 'plate_z']] # Create scatter plot plt.figure(figsize=(8, 6)) plt.scatter(pitch_locations['plate_x'], pitch_locations['plate_z'], alpha=0.3, s=10) # Add strike zone rectangle (approximate MLB standard: 17 inches wide, 1.5 to 3.5 feet tall) plt.axhline(1.5, color='red', linestyle='--') plt.axhline(3.5, color='red', linestyle='--') plt.axvline(-0.7083, color='red', linestyle='--') plt.axvline(0.7083, color='red', linestyle='--') # Set axis limits plt.xlim(-4, 4) plt.ylim(-2.5, 7.5) # Labels and formatting plt.title('Pitch Locations: plate_x vs plate_z') plt.xlabel('plate_x (horizontal position in feet)') plt.ylabel('plate_z (vertical position in feet)') plt.grid(True) plt.show() ``` ## High-Level Goal I plan to analyze pitch-level Statcast data to model the relationship between pitch location, swing decisions, and quality contact, using spatial visualizations and zone-based metrics for both pitchers and hitters. ## Dataset Description This project proposes to use a dataset containing Major League Baseball Statcast data from the 2024 season. The data was obtained using the Python library `pybaseball` via the `statcast` function. It includes 709,511 rows and 119 columns, with each row representing an individual pitch event. The dataset captures detailed information on both the pitcher and hitter involved in each event, as well as contextual variables such as pitch type, exit velocity, launch angle, batted ball outcome, and pitch location. This dataset was chosen because it enables player-level analysis by aggregating outcomes and tendencies across thousands of in-game events. It supports the exploration of how different pitching and hitting profiles relate to performance, and provides the granularity needed to evaluate behavior across specific zones, counts, and contact types. Its size and richness allow for both exploratory and explanatory analysis at the pitcher and hitter levels. ## Column Definitions batter – MLB Player Id tied to the play event. game_year – Year game took place. game_date – The calendar date of the game (YYYY-MM-DD). home_team – Abbreviation of home team. stand – Side of the plate batter is standing. p_throws – Hand pitcher throws with. pitch_type – The type of pitch derived from Statcast. effective_speed – Derived speed based on the the extension of the pitcher's release. pfx_x – Horizontal movement in feet from the catcher's perspective. pfx_z – Vertical movement in feet from the catcher's perspective. plate_x – Horizontal position of the ball when it crosses home plate from the catcher's perspective. plate_z – Vertical position of the ball when it crosses home plate from the catcher's perspective. zone – Zone location of the ball when it crosses the plate from the catcher's perspective. description – Description of the resulting pitch. launch_speed – Exit velocity of the batted ball as tracked by Statcast. For the limited subset of batted balls not tracked directly, estimates are included based on the process described here (http://tangotiger.com/index.php/site/article/statcast-lab-no-nulls-in-batted-balls-launch-parameters). launch_angle – Launch angle of the batted ball as tracked by Statcast. hc_x – Hit coordinate X of batted ball. hc_y – Hit coordinate XY of batted ball. if_fielding_alignment – Infield fielding alignment at the time of the pitch. events – Event of the resulting Plate Appearance. ## Questions Can we predict the likelihood of a hitter producing a favorable outcome (e.g., hit or hard contact) based on pitch type and location? How do swing decisions and contact quality vary by pitch characteristics and location, and can these patterns be used to identify hitter tendencies? ## Analysis Plan To address the first question, predicting the likelihood of a favorable outcome, I plan to build classification models that take into account pitch type, pitch location, and hitter swing decisions (e.g., swing/no swing, in-zone vs. out-of-zone) to predict whether a given pitch results in a hit or another favorable result (e.g., hard contact or ball in play). Potential models include logistic regression, random forest classifiers, or gradient-boosted trees. I’ll evaluate model performance using metrics such as accuracy, precision, recall, and AUC. For the second question, identifying hitter tendencies, I will explore how swing decisions and contact quality vary across different pitch types and locations. I’ll use clustering techniques (e.g., k-means or hierarchical clustering) to group hitters based on their swing/contact profiles, which may reveal meaningful patterns or performance archetypes. Feature engineering may include creating aggregated zone-specific swing/contact rates or composite aggressiveness scores. ## Plan of Attack Week 1 (Aug 5–11): Clean and explore the dataset Create key visualizations to understand trends and relationships Finalize research questions and modeling strategy Week 2 (Aug 12–18): Build and evaluate predictive models Tune model parameters and assess performance Begin drafting report and assembling visualizations Week 3 (Aug 19–20): Finalize code and polish documentation Complete report and visualizations Organize GitHub repository and submit the project ## Repository Organization The project repository is organized into clearly labeled folders for data, scripts, and outputs. The data/ folder stores raw and cleaned datasets, scripts/ contains all Python and Quarto code used for analysis, and outputs/ holds visualizations and final results. Each folder includes a README.md file to briefly explain its contents and purpose.