Sweet Spotting: Predicting Baseball Hitting Success from Swing Science

INFO 523 - Final Project

This project uses a Random Forest model to predict whether a batted ball will result in a hit based on key pitch and contact features, including launch angle, launch speed, effective pitch speed, and zone.

Author

Affiliation

Trevor Abshire

College of Information Science, University of Arizona

Introduction

The primary objective of this project was to analyze pitch-by-pitch data from Major League Baseball’s Statcast system for the 2024 season and develop a model to predict the probability that a batted ball would result in a hit. The original dataset contained approximately 1 million pitches across the league, so for feasibility, the analysis was restricted to data from the top 100 hitters by average exit velocity.

Although the initial plan was to perform the analysis at the player level, exploration revealed that swing characteristics are largely general across players. As a result, the model was built using aggregated pitch-level data from the top 100 hitters. Pitch type was found to have minimal influence on hit outcomes, whereas effective pitch speed, launch angle, launch speed, and zone were significant predictors. Due to the binary nature of hit outcomes, a Random Forest classifier was used to model the probability of a hit.

To note: Some portions of this analysis and code suggestions were assisted by AI tools including OpenAI’s ChatGPT and Google’s Gemini. Final content was reviewed and edited by the author.

Abstract

This project leverages machine learning to predict the likelihood that a batted ball in Major League Baseball will result in a hit, based on pitch-level features such as launch angle, launch speed, effective pitch speed, and zone location. Using a Random Forest classifier trained on data from the top 100 hitters in the 2024 season, the model provides probabilistic predictions for each pitch. An interactive interface allows users to input custom pitch values and instantly see the predicted outcome, providing insights into contact quality and offensive performance.

Question

How do launch angle, launch speed, pitch zone, and pitch type influence whether a batted ball results in a hit?

Dataset

The dataset was collected using Python’s pybaseball library (Statcast) with credit to GitHub user stephen1694 for their query of the 2024 season data. The raw dataset contains every pitch and its outcome from March through September 2024.

Player IDs were merged from MLBAM IDs, and the data was filtered to focus on the top 100 hitters by average exit velocity to reduce computational overhead and remain within GitHub’s 100 MB limit. Finally, only pitches that resulted in balls put into play were retained, as contact quality cannot be assessed for pitches that were not hit.

Rows, Columns: (827825, 20)

	Unnamed: 0	batter	player_name	game_date	stand	p_throws	pitch_type	effective_speed	pfx_x	pfx_z	plate_x	plate_z	zone	description	launch_speed	launch_angle	hc_x	hc_y	if_fielding_alignment	events
0	0	686668	Ginkel, Kevin	2024-03-31	R	R	FF	97.0	-0.57	1.33	0.08	3.23	2.0	swinging_strike	NaN	NaN	NaN	NaN	Standard	strikeout
1	1	686668	Ginkel, Kevin	2024-03-31	R	R	SI	96.0	-1.18	0.97	-0.04	1.74	8.0	foul	88.8	-43.0	NaN	NaN	Standard	NaN
2	2	686668	Ginkel, Kevin	2024-03-31	R	R	SL	89.8	0.36	-0.11	1.72	0.27	14.0	blocked_ball	NaN	NaN	NaN	NaN	Standard	NaN
3	3	686668	Ginkel, Kevin	2024-03-31	R	R	SL	88.2	0.40	-0.31	0.27	1.74	8.0	foul	47.7	-38.0	NaN	NaN	Standard	NaN
4	4	686668	Ginkel, Kevin	2024-03-31	R	R	SI	96.4	-1.20	0.93	0.16	1.02	14.0	ball	NaN	NaN	NaN	NaN	Standard	NaN
5	5	686668	Ginkel, Kevin	2024-03-31	R	R	SI	95.3	-1.23	0.92	-0.03	1.64	8.0	foul	92.6	-37.0	NaN	NaN	Standard	NaN
6	6	686668	Ginkel, Kevin	2024-03-31	R	R	FF	97.1	-0.86	0.94	0.38	1.22	14.0	ball	NaN	NaN	NaN	NaN	Standard	NaN
7	7	669911	Ginkel, Kevin	2024-03-31	L	R	FF	97.1	-0.88	1.34	0.03	4.00	12.0	swinging_strike	NaN	NaN	NaN	NaN	Strategic	strikeout
8	8	669911	Ginkel, Kevin	2024-03-31	L	R	SL	88.7	0.35	-0.20	0.66	1.45	14.0	foul	68.9	-22.0	NaN	NaN	Standard	NaN
9	9	669911	Ginkel, Kevin	2024-03-31	L	R	FF	96.9	-0.91	1.37	-0.55	3.28	1.0	foul	78.6	31.0	NaN	NaN	Standard	NaN

Column Definitions

batter – MLB Player ID tied to the play event.
game_year – Year the game took place.
game_date – The calendar date of the game (YYYY-MM-DD).
home_team – Abbreviation of the home team.
stand – Side of the plate the batter is standing.
p_throws – Hand the pitcher throws with.
pitch_type – The type of pitch derived from Statcast.
effective_speed – Speed adjusted based on the pitcher’s release extension.
pfx_x – Horizontal movement in feet from the catcher’s perspective.
pfx_z – Vertical movement in feet from the catcher’s perspective.
plate_x – Horizontal position of the ball when it crosses home plate.
plate_z – Vertical position of the ball when it crosses home plate.
zone – Zone location of the ball when it crosses the plate.
description – Description of the resulting pitch.
launch_speed – Exit velocity of the batted ball as tracked by Statcast. Estimates are included for batted balls not tracked directly source.
launch_angle – Launch angle of the batted ball as tracked by Statcast.
hc_x – Hit coordinate X of batted ball.
hc_y – Hit coordinate Y of batted ball.
if_fielding_alignment – Infield fielding alignment at the time of the pitch.
events – Event of the resulting plate appearance.

EDA + Visualization

	launch_speed
	count	mean	std	median
zone
1.0	2131	91.52	14.63	96.0
2.0	3310	96.43	12.65	99.8
3.0	2023	92.24	15.22	97.4
4.0	5170	94.05	13.83	98.0
5.0	8054	98.12	11.71	101.2
6.0	5085	94.48	13.76	98.8
7.0	3368	94.39	13.18	97.8
8.0	5885	97.44	12.14	100.8
9.0	3747	92.73	13.88	96.5
11.0	1371	84.30	17.11	88.1
12.0	1354	85.24	17.17	89.4
13.0	2255	85.43	15.41	86.9
14.0	2799	83.39	15.66	83.8

Using the mean launch angle and launch speed for different hit types—singles, doubles, triples, and home runs—this scatter plot highlights the typical “windows” in which each type of hit occurs. The shaded regions indicate the ranges of launch speeds and angles where each hit type is most likely:

Singles: Blue shaded area (80.4–101.4 mph launch speed, 1–15° launch angle)
Doubles: Green shaded area (93.7–104.7 mph launch speed, 12–22° launch angle)
Triples: Orange shaded area (94.8–103.6 mph launch speed, 13.75–25° launch angle)
Home Runs: Red shaded area (101.5–107.2 mph launch speed, 25–32° launch angle)

These visual ranges are useful for understanding batted ball outcomes and for creating predictive models of hit success.

In conjunction with swing characteristics, it is also important to consider pitch location (zone). Zones 1–9 correspond to strikes, while zones 11–14 are outside of the strike zone. From the data, we can see that the hardest hits, and the highest likelihood of success, occur in zones 2, 5, and 8.

I was not able to remove these output lines for the life of me, apologies!

::: {#cell-boxplot with angles and velo .cell engine=‘jupyter’ message=‘false’ execution_count=6}

:::

Boxplots show the distribution of launch speed and launch angle by zone. The most variation in launch speed occurs for pitches outside the strike zone, while launch angle appears more consistent across all zones.

::: {#cell-hit rates by zone .cell message=‘false’ execution_count=7}

:::

The zones with the highest frequency of doubles and home runs are 2 and 5, with zone 8 as a third. This pattern aligns with exit velocity tendencies and is important for model considerations.

::: {#cell-launch speed by pitch group .cell message=‘false’ execution_count=8}

:::

These heatmaps show that hitters generate the hardest contact in zones 2, 5, and 9 across all pitch types, with Fastballs generally producing the highest launch speeds. Originally, the pitch type was expected to influence launch speed, but this does not seem to be the case. This helps inform feature selection.

Modeling

::: {#cell-Model + ROC Output .cell message=‘false’ execution_count=13}

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85      4252
           1       0.75      0.66      0.70      2275

    accuracy                           0.80      6527
   macro avg       0.79      0.77      0.78      6527
weighted avg       0.80      0.80      0.80      6527

ROC AUC Score: 0.8605127515945954

Top 10 Important Features:
              feature  importance
8       launch_angle    0.280981
7       launch_speed    0.212495
0    effective_speed    0.086560
1              pfx_x    0.086249
2              pfx_z    0.086008
4            plate_z    0.085883
3            plate_x    0.084925
5               zone    0.032631
9            stand_R    0.010265
6  is_same_hit_pitch    0.009954

:::

The model achieves 80% accuracy with an ROC AUC of 0.86, indicating strong predictive ability for distinguishing hits from non-hits. In this scenario, the Random Forest model uses an ensemble of decision trees to learn complex, non-linear relationships between the pitch and swing features and the outcome of a batted ball. The top features driving this prediction are launch angle, launch speed, and effective pitch speed, followed by pitch movement and plate location metrics, highlighting that both the batter’s swing characteristics and the pitch’s movement/location are critical factors in determining batted-ball success.

From a practical standpoint, this model provides valuable insights for coaches and analysts looking to recruit talent or tailor swing mechanics. By identifying the key factors that influence successful contact, player development staff can offer more targeted feedback to optimize individual performance. Additionally, the ROC AUC of 0.86 is particularly applicable in the context of baseball, where some well-hit balls still result in outs due to defensive positioning. This helps create a high-performing probabilistic model more applicable than a binary outcome model, as it accounts for the nuanced nature of the game and supports more informed, evidence-based decisions.

::: {#cell-New Model .cell message=‘false’ execution_count=14}

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.87      0.84      4252
           1       0.72      0.64      0.68      2275

    accuracy                           0.79      6527
   macro avg       0.77      0.76      0.76      6527
weighted avg       0.79      0.79      0.79      6527

ROC AUC Score: 0.8566956984689816

:::

The updated model achieves 79% accuracy with an ROC AUC of 0.857, showing slightly lower performance than the previous model but still strong overall. It effectively distinguishes hits from non-hits, though precision and recall are higher for non-hit events, reflecting the challenge of predicting hits. Key features driving the predictions include launch angle, launch speed, and effective pitch speed. While pitch movement metrics (pfx_x, pfx_z) were not used as inputs, the model still allows the user to select the zone, which captures the general location of the pitch and helps approximate its influence on hitting success.

Almost Interactive Slider

The interactive slider (currently not functional in the HTML output) is designed to let users input launch angle, launch speed, effective pitch speed, and zone of the pitch to see the model’s predicted probability of a hit versus a non-hit.

Conclusion

This project demonstrates that batted-ball outcomes can be reasonably predicted using key swing and pitch metrics. By analyzing launch angle, launch speed, effective pitch speed, and zone location, the Random Forest model achieved strong predictive performance with approximately 80% accuracy and an ROC AUC of ~0.86. The analysis highlights that both batter swing mechanics and pitch characteristics, including speed and location, play critical roles in determining whether a ball becomes a hit. Heatmaps and distribution plots further emphasize that certain zones and pitch types lead to more successful batted balls, providing actionable insights for player development and strategy. The interactive prediction tool demonstrates the potential for personalized scenario testing. Overall, this work shows the value of combining Statcast data with machine learning to better understand the dynamics of hitting in baseball.

In practice, this type of model can directly support coaches and analysts in guiding swing adjustments, optimizing player development, and identifying undervalued talent during recruitment or trades. With a strong AUC of ~0.86, the model accounts for the fact that even well-hit balls can result in outs due to defensive positioning, making its real-world usefulness even more evident. As teams continue to invest heavily in data-driven decision-making, predictive tools like this offer a competitive edge in forecasting performance and enhancing on-field outcomes.

--- title: "Sweet Spotting: Predicting Baseball Hitting Success from Swing Science" subtitle: "INFO 523 - Final Project" author: - name: "Trevor Abshire" affiliations: - name: "College of Information Science, University of Arizona" description: "This project uses a Random Forest model to predict whether a batted ball will result in a hit based on key pitch and contact features, including launch angle, launch speed, effective pitch speed, and zone." format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false jupyter: python3 --- ## Introduction The primary objective of this project was to analyze pitch-by-pitch data from Major League Baseball's Statcast system for the 2024 season and develop a model to predict the probability that a batted ball would result in a hit. The original dataset contained approximately 1 million pitches across the league, so for feasibility, the analysis was restricted to data from the top 100 hitters by average exit velocity. Although the initial plan was to perform the analysis at the player level, exploration revealed that swing characteristics are largely general across players. As a result, the model was built using aggregated pitch-level data from the top 100 hitters. Pitch type was found to have minimal influence on hit outcomes, whereas effective pitch speed, launch angle, launch speed, and zone were significant predictors. Due to the binary nature of hit outcomes, a Random Forest classifier was used to model the probability of a hit. To note: Some portions of this analysis and code suggestions were assisted by AI tools including OpenAI's ChatGPT and Google's Gemini. Final content was reviewed and edited by the author. ## Abstract This project leverages machine learning to predict the likelihood that a batted ball in Major League Baseball will result in a hit, based on pitch-level features such as launch angle, launch speed, effective pitch speed, and zone location. Using a Random Forest classifier trained on data from the top 100 hitters in the 2024 season, the model provides probabilistic predictions for each pitch. An interactive interface allows users to input custom pitch values and instantly see the predicted outcome, providing insights into contact quality and offensive performance. ## Question How do launch angle, launch speed, pitch zone, and pitch type influence whether a batted ball results in a hit? ## Dataset The dataset was collected using Python’s `pybaseball` library (Statcast) with credit to GitHub user `stephen1694` for their query of the 2024 season data. The raw dataset contains every pitch and its outcome from March through September 2024. Player IDs were merged from [MLBAM IDs](https://razzball.com/mlbamids/), and the data was filtered to focus on the top 100 hitters by average exit velocity to reduce computational overhead and remain within GitHub’s 100 MB limit. Finally, only pitches that resulted in balls put into play were retained, as contact quality cannot be assessed for pitches that were not hit. ```{python} #| label: basic-checks #| echo: false #| results: hide #| message: false import pandas as pd df = pd.read_csv("data/pitch_2024_relevant.csv") # Percentage of missing values per column missing_percent = df.isnull().mean().sort_values(ascending=False) * 100 ``` ```{python} #| label: load-dataset #| message: false #| echo: false import pandas as pd from IPython.display import display data = pd.read_csv("data/pitch_2024_relevant.csv") # Print the shape print(f"Rows, Columns: {data.shape}\n") # Display the first 10 rows display(data.head(10)) ``` ## Column Definitions - **batter** – MLB Player ID tied to the play event.\ - **game_year** – Year the game took place.\ - **game_date** – The calendar date of the game (YYYY-MM-DD).\ - **home_team** – Abbreviation of the home team.\ - **stand** – Side of the plate the batter is standing.\ - **p_throws** – Hand the pitcher throws with.\ - **pitch_type** – The type of pitch derived from Statcast.\ - **effective_speed** – Speed adjusted based on the pitcher's release extension.\ - **pfx_x** – Horizontal movement in feet from the catcher's perspective.\ - **pfx_z** – Vertical movement in feet from the catcher's perspective.\ - **plate_x** – Horizontal position of the ball when it crosses home plate.\ - **plate_z** – Vertical position of the ball when it crosses home plate.\ - **zone** – Zone location of the ball when it crosses the plate.\ - **description** – Description of the resulting pitch.\ - **launch_speed** – Exit velocity of the batted ball as tracked by Statcast. Estimates are included for batted balls not tracked directly [source](http://tangotiger.com/index.php/site/article/statcast-lab-no-nulls-in-batted-balls-launch-parameters).\ - **launch_angle** – Launch angle of the batted ball as tracked by Statcast.\ - **hc_x** – Hit coordinate X of batted ball.\ - **hc_y** – Hit coordinate Y of batted ball.\ - **if_fielding_alignment** – Infield fielding alignment at the time of the pitch.\ - **events** – Event of the resulting plate appearance. ```{python} import pandas as pd # Load data pitch_df = pd.read_csv('data/pitch_2024_relevant.csv') player_ids_df = pd.read_csv('data/player_ids.csv') # Prepare player IDs player_ids_sub = player_ids_df[['MLBAMID', 'Name']].rename(columns={'Name': 'batter_name'}) # Merge on batter and MLBAMID df_hits = pitch_df.merge(player_ids_sub, left_on='batter', right_on='MLBAMID', how='left') df_hits = df_hits.drop(columns=['MLBAMID']) df_hits = df_hits.rename(columns={'player_name': 'pitcher_name'}) # Filter to hitting events only hitting_events = ['single', 'double', 'triple', 'home_run'] df_hits = df_hits[df_hits['events'].isin(hitting_events)].copy() # Now define your pitch_type_map pitch_type_map = { 'FF': 'Fastball', # Four-Seam 'FA': 'Fastball', # General Fastball 'FT': 'Fastball', # Two-Seam 'SI': 'Fastball', # Sinker 'FS': 'Fastball', # Split Finger 'SL': 'Slider', # Slider 'SV': 'Slider', # Slurve 'ST': 'Curveball', # Sweeper 'CU': 'Curveball', # Curveball 'KC': 'Curveball', # Knuckle Curve 'CH': 'Changeup', # Change Up 'FO': 'Changeup', # Fork Ball 'EP': 'Changeup', # Eephus 'FC': 'Fastball', # Cutter 'KN': 'Knuckleball', # Knuckleball 'SC': 'Other', # Screwball 'CS': 'Other', # Slow Curve 'PO': 'Other', # Pitch Out? None: 'Unknown', # or np.nan mapped to 'Unknown' 'NaN': 'Unknown' # if string 'NaN' } # Map pitch_type to pitch_group column df_hits['pitch_group'] = df_hits['pitch_type'].map(pitch_type_map).fillna('Unknown') ``` ## EDA + Visualization ```{python} #| label: distribution and scatter #| message: false #| echo: false import seaborn as sns import matplotlib.pyplot as plt # Summary stats of launch speed and launch angle by hit type summary_stats = df_hits.groupby('events')[['launch_speed', 'launch_angle']].describe() # print(summary_stats) # Plot distribution of launch speed by hit type plt.figure(figsize=(10, 6)) sns.histplot(data=df_hits, x='launch_speed', hue='events', bins=30, kde=True, palette='viridis') plt.title('Launch Speed Distribution by Hit Type') plt.xlabel('Launch Speed (mph)') plt.ylabel('Count') plt.show() # Scatterplot of launch speed vs launch angle by hit type plt.figure(figsize=(10, 6)) sns.scatterplot(data=df_hits, x='launch_speed', y='launch_angle', hue='events', alpha=0.6, palette='viridis') plt.title('Launch Speed vs Launch Angle by Hit Type') plt.xlabel('Launch Speed (mph)') plt.ylabel('Launch Angle (degrees)') plt.show() import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(12,8)) # Scatterplot of launch speed vs launch angle by hit type sns.scatterplot( data=df_hits, x='launch_speed', y='launch_angle', hue='events', alpha=0.6, palette='viridis', edgecolor=None, s=30 ) # Add shaded boxes for typical launch windows # Singles plt.axvspan(80.4, 101.4, ymin=(1 + 90)/180, ymax=(15 + 90)/180, color='blue', alpha=0.6, label='Typical Single Window') # Doubles plt.axvspan(93.7, 104.7, ymin=(12 + 90)/180, ymax=(22 + 90)/180, color='green', alpha=0.6, label='Typical Double Window') # Triples plt.axvspan(94.8, 103.6, ymin=(13.75 + 90)/180, ymax=(25 + 90)/180, color='orange', alpha=0.6, label='Typical Triple Window') # Home Runs plt.axvspan(101.5, 107.2, ymin=(25 + 90)/180, ymax=(32 + 90)/180, color='red', alpha=0.6, label='Typical HR Window') plt.title('Launch Speed vs Launch Angle by Hit Type (2024)') plt.xlabel('Launch Speed (mph)') plt.ylabel('Launch Angle (degrees)') plt.legend(title='Hit Type') plt.grid(True) plt.show() zone_summary = ( df_hits .groupby('zone')[['launch_speed']] .agg(['count', 'mean', 'std', 'median']) .round(2) ) zone_summary ``` Using the mean launch angle and launch speed for different hit types—singles, doubles, triples, and home runs—this scatter plot highlights the typical “windows” in which each type of hit occurs. The shaded regions indicate the ranges of launch speeds and angles where each hit type is most likely: - **Singles:** Blue shaded area (`80.4–101.4 mph` launch speed, `1–15°` launch angle)\ - **Doubles:** Green shaded area (`93.7–104.7 mph` launch speed, `12–22°` launch angle)\ - **Triples:** Orange shaded area (`94.8–103.6 mph` launch speed, `13.75–25°` launch angle)\ - **Home Runs:** Red shaded area (`101.5–107.2 mph` launch speed, `25–32°` launch angle) These visual ranges are useful for understanding batted ball outcomes and for creating predictive models of hit success. ```{python} #| label: strike-zone-heatmap #| message: false #| warning: false #| echo: false #| results: hide import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Summarized data in a DataFrame zone_data = { 'zone': [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14], 'mean_launch_speed': [91.52, 96.43, 92.24, 94.05, 98.12, 94.48, 94.39, 97.44, 92.73, 84.30, 85.24, 85.43, 83.39] } df_zone = pd.DataFrame(zone_data) # Strike zone layout mapping zone_grid = pd.DataFrame([ [11, None, 12], [1, 2, 3], [4, 5, 6], [7, 8, 9], [13, None, 14] ]) # Map mean launch speed to the grid heatmap_values = zone_grid.replace( {z: df_zone.set_index('zone')['mean_launch_speed'].get(z) for z in df_zone['zone']} ) # Plot plt.figure(figsize=(6,8)) sns.heatmap( heatmap_values.astype(float), annot=True, fmt=".1f", cmap="RdYlGn", linewidths=1, cbar_kws={'label': 'Mean Launch Speed (mph)'}, vmin=80, vmax=100 ) plt.title('Mean Launch Speed by Statcast Zone', fontsize=14) plt.ylabel('Pitch Height in Zone') plt.xlabel('Horizontal Location in Zone') plt.gca().invert_yaxis() plt.show() ``` In conjunction with swing characteristics, it is also important to consider pitch location (zone). Zones 1–9 correspond to strikes, while zones 11–14 are outside of the strike zone. From the data, we can see that the hardest hits, and the highest likelihood of success, occur in zones **2, 5, and 8**. I was not able to remove these output lines for the life of me, apologies! ```{python} #| label: boxplot with angles and velo #| message: false #| echo: false #| engine: jupyter import seaborn as sns import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Boxplot for launch speed by zone sns.boxplot(data=df_hits, x='zone', y='launch_speed', ax=axes[0], palette='viridis') axes[0].set_title('Launch Speed by Zone') axes[0].set_xlabel('Pitch Zone') axes[0].set_ylabel('Launch Speed (mph)') # Boxplot for launch angle by zone sns.boxplot(data=df_hits, x='zone', y='launch_angle', ax=axes[1], palette='magma') axes[1].set_title('Launch Angle by Zone') axes[1].set_xlabel('Pitch Zone') axes[1].set_ylabel('Launch Angle (°)') plt.tight_layout() ``` Boxplots show the distribution of launch speed and launch angle by zone. The most variation in launch speed occurs for pitches outside the strike zone, while launch angle appears more consistent across all zones. ```{python} #| label: hit rates by zone #| message: false #| echo: false # Count total hits per zone total_per_zone = df_hits.groupby('zone').size() # Count each hit type per zone hits_per_zone = df_hits.groupby(['zone', 'events']).size().unstack(fill_value=0) # Convert to percentage of hits in that zone hit_rates = hits_per_zone.div(total_per_zone, axis=0) * 100 # Order the columns consistently hit_rates = hit_rates[['single', 'double', 'triple', 'home_run']] #print(hit_rates.round(2)) # Plot stacked bar chart hit_rates.plot(kind='bar', stacked=True, figsize=(10,6)) plt.ylabel("Percentage of Hits") plt.title("Hit Outcome Rates by Zone") plt.legend(title="Hit Type") plt.show() ``` The zones with the highest frequency of doubles and home runs are 2 and 5, with zone 8 as a third. This pattern aligns with exit velocity tendencies and is important for model considerations. ```{python} #| label: launch speed by pitch group #| message: false #| echo: false import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Define strike zone layout zone_grid = pd.DataFrame([ [11, None, 12], [1, 2, 3], [4, 5, 6], [7, 8, 9], [13, None, 14] ]) # List of pitch groups to plot pitch_groups = ['Fastball', 'Slider', 'Curveball', 'Changeup'] # Set up plot grid n = len(pitch_groups) cols = 2 rows = (n + 1) // cols fig, axes = plt.subplots(rows, cols, figsize=(cols * 6, rows * 8), constrained_layout=True) for ax, pg in zip(axes.flatten(), pitch_groups): # Filter data for the pitch group subset = df_hits[df_hits['pitch_group'] == pg] # Compute mean launch speed by zone df_zone = ( subset.groupby('zone')['launch_speed'] .mean() .reset_index() ) # Map values to strike zone grid heatmap_values = zone_grid.replace( {z: df_zone.set_index('zone')['launch_speed'].get(z) for z in df_zone['zone']} ) # Plot heatmap sns.heatmap( heatmap_values.astype(float), annot=True, fmt=".1f", cmap="RdYlGn", linewidths=1, cbar_kws={'label': 'Mean Launch Speed (mph)'}, vmin=80, vmax=105, ax=ax ) ax.set_title(f'Mean Launch Speed - {pg}', fontsize=14) ax.invert_yaxis() # Remove any unused subplots for ax in axes.flatten()[n:]: fig.delaxes(ax) plt.show() ``` These heatmaps show that hitters generate the hardest contact in zones 2, 5, and 9 across all pitch types, with Fastballs generally producing the highest launch speeds. Originally, the pitch type was expected to influence launch speed, but this does not seem to be the case. This helps inform feature selection. ```{python} import pandas as pd # Load data pitch_df = pd.read_csv('data/pitch_2024_top100_reduced.csv') player_ids_df = pd.read_csv('data/player_ids.csv') # Prepare player IDs player_ids_sub = player_ids_df[['MLBAMID', 'Name']].rename(columns={'Name': 'batter_name'}) # Merge on batter and MLBAMID df_top100 = pitch_df.merge(player_ids_sub, left_on='batter', right_on='MLBAMID', how='left') df_top100 = df_top100.drop(columns=['MLBAMID']) df_top100 = df_top100.rename(columns={'player_name': 'pitcher_name'}) # Filter to hitting events only hitting_events = ['single', 'double', 'triple', 'home_run'] df_hits = df_top100[df_top100['events'].isin(hitting_events)].copy() # Now define your pitch_type_map pitch_type_map = { 'FF': 'Fastball', # Four-Seam 'FA': 'Fastball', # General Fastball 'FT': 'Fastball', # Two-Seam 'SI': 'Fastball', # Sinker 'FS': 'Fastball', # Split Finger 'SL': 'Slider', # Slider 'SV': 'Slider', # Slurve 'ST': 'Curveball', # Sweeper 'CU': 'Curveball', # Curveball 'KC': 'Curveball', # Knuckle Curve 'CH': 'Changeup', # Change Up 'FO': 'Changeup', # Fork Ball 'EP': 'Changeup', # Eephus 'FC': 'Fastball', # Cutter 'KN': 'Knuckleball', # Knuckleball 'SC': 'Other', # Screwball 'CS': 'Other', # Slow Curve 'PO': 'Other', # Pitch Out? None: 'Unknown', # or np.nan mapped to 'Unknown' 'NaN': 'Unknown' # if string 'NaN' } # Map pitch_type to pitch_group column df_top100['pitch_group'] = df_top100['pitch_type'].map(pitch_type_map).fillna('Unknown') ``` ```{python} import pandas as pd # Load data bip_df = pd.read_csv('data/BIP_2024_top100.csv') player_ids_df = pd.read_csv('data/player_ids.csv') # Prepare player IDs player_ids_sub = player_ids_df[['MLBAMID', 'Name']].rename(columns={'Name': 'batter_name'}) # Merge on batter and MLBAMID df_bip_top100 = bip_df.merge(player_ids_sub, left_on='batter', right_on='MLBAMID', how='left') df_bip_top100 = df_bip_top100.drop(columns=['MLBAMID']) # Filter to hitting events hitting_events = ['single', 'double', 'triple', 'home_run'] df_hits_top100 = df_bip_top100[df_bip_top100['events'].isin(hitting_events)].copy() # Map pitch_type to pitch_group column pitch_type_map = { 'FF': 'Fastball', 'FA': 'Fastball', 'FT': 'Fastball', 'SI': 'Fastball', 'FS': 'Fastball', 'SL': 'Slider', 'SV': 'Slider', 'ST': 'Curveball', 'CU': 'Curveball', 'KC': 'Curveball', 'CH': 'Changeup', 'FO': 'Changeup', 'EP': 'Changeup', 'FC': 'Fastball', 'KN': 'Knuckleball', 'SC': 'Other', 'CS': 'Other', 'PO': 'Other', None: 'Unknown', 'NaN': 'Unknown' } df_bip_top100['pitch_group'] = df_bip_top100['pitch_type'].map(pitch_type_map).fillna('Unknown') df_hits_top100['pitch_group'] = df_hits_top100['pitch_type'].map(pitch_type_map).fillna('Unknown') ``` ```{python} df_top100["is_same_hit_pitch"] = (df_top100["stand"] == df_top100["p_throws"]).astype(int) df_top100["is_single"] = (df_top100["events"] == "single").astype(int) df_top100["is_double"] = (df_top100["events"] == "double").astype(int) df_top100["is_triple"] = (df_top100["events"] == "triple").astype(int) df_top100["is_home_run"] = (df_top100["events"] == "home_run").astype(int) df_top100 = pd.get_dummies(df_top100, columns=['description'], prefix='desc') #print(df_top100.columns) ``` ```{python} # Define the columns to keep columns_to_keep = [ 'pitcher_name', 'stand', 'p_throws', 'effective_speed', 'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'zone', 'launch_speed', 'launch_angle', 'hc_x', 'hc_y', 'events', 'batter_name', 'pitch_group', 'is_same_hit_pitch', 'is_single', 'is_double', 'is_triple', 'is_home_run', 'desc_foul', 'desc_hit_into_play', 'desc_swinging_strike', 'desc_swinging_strike_blocked' ] # Create refined DataFrame df_top100_refined = df_top100[columns_to_keep].copy() df_top100_refined['is_hit'] = df_top100_refined[['is_single','is_double','is_triple','is_home_run']].max(axis=1) # Preview #print(df_top100_refined.head()) ``` ## Modeling ```{python} #| label: Model + ROC Output #| message: false #| echo: false import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score # 1. Filter to balls in play (contact quality) df_bip_top100 = df_top100_refined[df_top100_refined['desc_hit_into_play'] == 1].copy() # 2. Define features and target y_hit = df_bip_top100['is_hit'] features = [ 'stand', 'p_throws', 'pitch_group', 'effective_speed', 'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'zone', 'is_same_hit_pitch', 'launch_speed', 'launch_angle' ] X_hit = df_bip_top100[features] # 3. One-hot encode categorical features X_hit_encoded = pd.get_dummies( X_hit, columns=['stand', 'p_throws', 'pitch_group'], drop_first=True ) # 4. Train/test split X_train, X_test, y_train, y_test = train_test_split( X_hit_encoded, y_hit, test_size=0.2, random_state=42, stratify=y_hit ) # 5. Train Random Forest model rf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1) rf_model.fit(X_train, y_train) # 6. Make predictions and evaluate y_pred = rf_model.predict(X_test) y_pred_proba = rf_model.predict_proba(X_test)[:, 1] print("Classification Report:\n", classification_report(y_test, y_pred)) print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba)) # 7. Feature importance feature_importance_df = pd.DataFrame({ 'feature': X_train.columns, 'importance': rf_model.feature_importances_ }).sort_values(by='importance', ascending=False) print("\nTop 10 Important Features:\n", feature_importance_df.head(10)) # 8. Visualization of top features top_features = feature_importance_df.head(10) plt.figure(figsize=(8, 5)) sns.barplot( data=top_features, x="importance", y="feature", palette="viridis" ) plt.title("Top 10 Important Features", fontsize=14, weight="bold") plt.xlabel("Importance", fontsize=12) plt.ylabel("Feature", fontsize=12) plt.tight_layout() plt.show() ``` The model achieves 80% accuracy with an ROC AUC of 0.86, indicating strong predictive ability for distinguishing hits from non-hits. In this scenario, the Random Forest model uses an ensemble of decision trees to learn complex, non-linear relationships between the pitch and swing features and the outcome of a batted ball. The top features driving this prediction are launch angle, launch speed, and effective pitch speed, followed by pitch movement and plate location metrics, highlighting that both the batter’s swing characteristics and the pitch’s movement/location are critical factors in determining batted-ball success. From a practical standpoint, this model provides valuable insights for coaches and analysts looking to recruit talent or tailor swing mechanics. By identifying the key factors that influence successful contact, player development staff can offer more targeted feedback to optimize individual performance. Additionally, the ROC AUC of 0.86 is particularly applicable in the context of baseball, where some well-hit balls still result in outs due to defensive positioning. This helps create a high-performing probabilistic model more applicable than a binary outcome model, as it accounts for the nuanced nature of the game and supports more informed, evidence-based decisions. ```{python} #| label: New Model #| message: false #| echo: false import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score # 1. Filter to balls in play (contact quality) df_bip_top100 = df_top100_refined[df_top100_refined['desc_hit_into_play'] == 1].copy() # 2. Select features (numeric only) features = ['launch_angle', 'launch_speed', 'effective_speed', 'zone'] X_hit = df_bip_top100[features] y_hit = df_bip_top100['is_hit'] # 3. Train/test split X_train, X_test, y_train, y_test = train_test_split( X_hit, y_hit, test_size=0.2, random_state=42, stratify=y_hit ) # 4. Train Random Forest model rf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1) rf_model.fit(X_train, y_train) # 5. Evaluate y_pred = rf_model.predict(X_test) y_pred_proba = rf_model.predict_proba(X_test)[:, 1] print("Classification Report:\n", classification_report(y_test, y_pred)) print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba)) # 6. Feature importance visualization feature_importance_df = pd.DataFrame({ 'feature': X_train.columns, 'importance': rf_model.feature_importances_ }).sort_values(by='importance', ascending=False) plt.figure(figsize=(8, 5)) sns.barplot( data=feature_importance_df, x="importance", y="feature", palette="viridis" ) plt.title("Feature Importances (4 Feature Model)", fontsize=14, weight="bold") plt.xlabel("Importance", fontsize=12) plt.ylabel("Feature", fontsize=12) plt.tight_layout() plt.show() # 7. Prediction function def predict_hit(launch_angle, launch_speed, effective_speed, zone): new_pitch = pd.DataFrame([{ 'launch_angle': launch_angle, 'launch_speed': launch_speed, 'effective_speed': effective_speed, 'zone': zone }]) pred_class = rf_model.predict(new_pitch)[0] pred_prob = rf_model.predict_proba(new_pitch)[0][1] return { 'predicted_class': int(pred_class), 'prediction_label': 'Hit' if pred_class == 1 else 'Not Hit', 'hit_probability': round(pred_prob, 3) } ``` The updated model achieves 79% accuracy with an ROC AUC of 0.857, showing slightly lower performance than the previous model but still strong overall. It effectively distinguishes hits from non-hits, though precision and recall are higher for non-hit events, reflecting the challenge of predicting hits. Key features driving the predictions include launch angle, launch speed, and effective pitch speed. While pitch movement metrics (pfx_x, pfx_z) were not used as inputs, the model still allows the user to select the zone, which captures the general location of the pitch and helps approximate its influence on hitting success. ## Almost Interactive Slider ```{python} # 8. Example prediction example_result = predict_hit( launch_angle=26, launch_speed=45.3, effective_speed=98.2, zone=14 ) #print(example_result) import ipywidgets as widgets from ipywidgets import interact, FloatSlider, IntSlider from IPython.display import display, HTML import pandas as pd out = widgets.Output() def predict_hit_ui(launch_angle, launch_speed, effective_speed, zone): with out: out.clear_output() new_pitch = pd.DataFrame([{ 'launch_angle': launch_angle, 'launch_speed': launch_speed, 'effective_speed': effective_speed, 'zone': zone }]) pred_class = rf_model.predict(new_pitch)[0] pred_prob = rf_model.predict_proba(new_pitch)[0][1] color = 'green' if pred_class == 1 else 'red' display(HTML(f"<h3 style='color:{color};'>{'Hit' if pred_class==1 else 'Not Hit'}</h3>")) display(HTML(f"<p>Probability: {pred_prob:.2f}</p>")) interact( predict_hit_ui, launch_angle=FloatSlider(value=20.0, min=-30, max=50, step=0.5, description='Launch Angle'), launch_speed=FloatSlider(value=95.0, min=60, max=120, step=0.5, description='Launch Speed'), effective_speed=FloatSlider(value=90.0, min=60, max=110, step=0.5, description='Effective Speed'), zone=IntSlider(value=5, min=1, max=9, step=1, description='Zone') ) display(out) ``` The interactive slider (currently not functional in the HTML output) is designed to let users input launch angle, launch speed, effective pitch speed, and zone of the pitch to see the model’s predicted probability of a hit versus a non-hit. ## Conclusion This project demonstrates that batted-ball outcomes can be reasonably predicted using key swing and pitch metrics. By analyzing launch angle, launch speed, effective pitch speed, and zone location, the Random Forest model achieved strong predictive performance with approximately 80% accuracy and an ROC AUC of \~0.86. The analysis highlights that both batter swing mechanics and pitch characteristics, including speed and location, play critical roles in determining whether a ball becomes a hit. Heatmaps and distribution plots further emphasize that certain zones and pitch types lead to more successful batted balls, providing actionable insights for player development and strategy. The interactive prediction tool demonstrates the potential for personalized scenario testing. Overall, this work shows the value of combining Statcast data with machine learning to better understand the dynamics of hitting in baseball. In practice, this type of model can directly support coaches and analysts in guiding swing adjustments, optimizing player development, and identifying undervalued talent during recruitment or trades. With a strong AUC of \~0.86, the model accounts for the fact that even well-hit balls can result in outs due to defensive positioning, making its real-world usefulness even more evident. As teams continue to invest heavily in data-driven decision-making, predictive tools like this offer a competitive edge in forecasting performance and enhancing on-field outcomes.