Sweet Spotting: Predicting Baseball Hitting Success from Swing Science
INFO 523 - Final Project
This project uses a Random Forest model to predict whether a batted ball will result in a hit based on key pitch and contact features, including launch angle, launch speed, effective pitch speed, and zone.
Author
Affiliation
Trevor Abshire
College of Information Science, University of Arizona
Introduction
The primary objective of this project was to analyze pitch-by-pitch data from Major League Baseball’s Statcast system for the 2024 season and develop a model to predict the probability that a batted ball would result in a hit. The original dataset contained approximately 1 million pitches across the league, so for feasibility, the analysis was restricted to data from the top 100 hitters by average exit velocity.
Although the initial plan was to perform the analysis at the player level, exploration revealed that swing characteristics are largely general across players. As a result, the model was built using aggregated pitch-level data from the top 100 hitters. Pitch type was found to have minimal influence on hit outcomes, whereas effective pitch speed, launch angle, launch speed, and zone were significant predictors. Due to the binary nature of hit outcomes, a Random Forest classifier was used to model the probability of a hit.
To note: Some portions of this analysis and code suggestions were assisted by AI tools including OpenAI’s ChatGPT and Google’s Gemini. Final content was reviewed and edited by the author.
Abstract
This project leverages machine learning to predict the likelihood that a batted ball in Major League Baseball will result in a hit, based on pitch-level features such as launch angle, launch speed, effective pitch speed, and zone location. Using a Random Forest classifier trained on data from the top 100 hitters in the 2024 season, the model provides probabilistic predictions for each pitch. An interactive interface allows users to input custom pitch values and instantly see the predicted outcome, providing insights into contact quality and offensive performance.
Question
How do launch angle, launch speed, pitch zone, and pitch type influence whether a batted ball results in a hit?
Dataset
The dataset was collected using Python’s pybaseball library (Statcast) with credit to GitHub user stephen1694 for their query of the 2024 season data. The raw dataset contains every pitch and its outcome from March through September 2024.
Player IDs were merged from MLBAM IDs, and the data was filtered to focus on the top 100 hitters by average exit velocity to reduce computational overhead and remain within GitHub’s 100 MB limit. Finally, only pitches that resulted in balls put into play were retained, as contact quality cannot be assessed for pitches that were not hit.
Rows, Columns: (827825, 20)
Unnamed: 0
batter
player_name
game_date
stand
p_throws
pitch_type
effective_speed
pfx_x
pfx_z
plate_x
plate_z
zone
description
launch_speed
launch_angle
hc_x
hc_y
if_fielding_alignment
events
0
0
686668
Ginkel, Kevin
2024-03-31
R
R
FF
97.0
-0.57
1.33
0.08
3.23
2.0
swinging_strike
NaN
NaN
NaN
NaN
Standard
strikeout
1
1
686668
Ginkel, Kevin
2024-03-31
R
R
SI
96.0
-1.18
0.97
-0.04
1.74
8.0
foul
88.8
-43.0
NaN
NaN
Standard
NaN
2
2
686668
Ginkel, Kevin
2024-03-31
R
R
SL
89.8
0.36
-0.11
1.72
0.27
14.0
blocked_ball
NaN
NaN
NaN
NaN
Standard
NaN
3
3
686668
Ginkel, Kevin
2024-03-31
R
R
SL
88.2
0.40
-0.31
0.27
1.74
8.0
foul
47.7
-38.0
NaN
NaN
Standard
NaN
4
4
686668
Ginkel, Kevin
2024-03-31
R
R
SI
96.4
-1.20
0.93
0.16
1.02
14.0
ball
NaN
NaN
NaN
NaN
Standard
NaN
5
5
686668
Ginkel, Kevin
2024-03-31
R
R
SI
95.3
-1.23
0.92
-0.03
1.64
8.0
foul
92.6
-37.0
NaN
NaN
Standard
NaN
6
6
686668
Ginkel, Kevin
2024-03-31
R
R
FF
97.1
-0.86
0.94
0.38
1.22
14.0
ball
NaN
NaN
NaN
NaN
Standard
NaN
7
7
669911
Ginkel, Kevin
2024-03-31
L
R
FF
97.1
-0.88
1.34
0.03
4.00
12.0
swinging_strike
NaN
NaN
NaN
NaN
Strategic
strikeout
8
8
669911
Ginkel, Kevin
2024-03-31
L
R
SL
88.7
0.35
-0.20
0.66
1.45
14.0
foul
68.9
-22.0
NaN
NaN
Standard
NaN
9
9
669911
Ginkel, Kevin
2024-03-31
L
R
FF
96.9
-0.91
1.37
-0.55
3.28
1.0
foul
78.6
31.0
NaN
NaN
Standard
NaN
Column Definitions
batter – MLB Player ID tied to the play event.
game_year – Year the game took place.
game_date – The calendar date of the game (YYYY-MM-DD).
home_team – Abbreviation of the home team.
stand – Side of the plate the batter is standing.
p_throws – Hand the pitcher throws with.
pitch_type – The type of pitch derived from Statcast.
effective_speed – Speed adjusted based on the pitcher’s release extension.
pfx_x – Horizontal movement in feet from the catcher’s perspective.
pfx_z – Vertical movement in feet from the catcher’s perspective.
plate_x – Horizontal position of the ball when it crosses home plate.
plate_z – Vertical position of the ball when it crosses home plate.
zone – Zone location of the ball when it crosses the plate.
description – Description of the resulting pitch.
launch_speed – Exit velocity of the batted ball as tracked by Statcast. Estimates are included for batted balls not tracked directly source.
launch_angle – Launch angle of the batted ball as tracked by Statcast.
hc_x – Hit coordinate X of batted ball.
hc_y – Hit coordinate Y of batted ball.
if_fielding_alignment – Infield fielding alignment at the time of the pitch.
events – Event of the resulting plate appearance.
EDA + Visualization
launch_speed
count
mean
std
median
zone
1.0
2131
91.52
14.63
96.0
2.0
3310
96.43
12.65
99.8
3.0
2023
92.24
15.22
97.4
4.0
5170
94.05
13.83
98.0
5.0
8054
98.12
11.71
101.2
6.0
5085
94.48
13.76
98.8
7.0
3368
94.39
13.18
97.8
8.0
5885
97.44
12.14
100.8
9.0
3747
92.73
13.88
96.5
11.0
1371
84.30
17.11
88.1
12.0
1354
85.24
17.17
89.4
13.0
2255
85.43
15.41
86.9
14.0
2799
83.39
15.66
83.8
Using the mean launch angle and launch speed for different hit types—singles, doubles, triples, and home runs—this scatter plot highlights the typical “windows” in which each type of hit occurs. The shaded regions indicate the ranges of launch speeds and angles where each hit type is most likely:
Singles: Blue shaded area (80.4–101.4 mph launch speed, 1–15° launch angle)
Doubles: Green shaded area (93.7–104.7 mph launch speed, 12–22° launch angle)
Home Runs: Red shaded area (101.5–107.2 mph launch speed, 25–32° launch angle)
These visual ranges are useful for understanding batted ball outcomes and for creating predictive models of hit success.
In conjunction with swing characteristics, it is also important to consider pitch location (zone). Zones 1–9 correspond to strikes, while zones 11–14 are outside of the strike zone. From the data, we can see that the hardest hits, and the highest likelihood of success, occur in zones 2, 5, and 8.
I was not able to remove these output lines for the life of me, apologies!
::: {#cell-boxplot with angles and velo .cell engine=‘jupyter’ message=‘false’ execution_count=6}
:::
Boxplots show the distribution of launch speed and launch angle by zone. The most variation in launch speed occurs for pitches outside the strike zone, while launch angle appears more consistent across all zones.
::: {#cell-hit rates by zone .cell message=‘false’ execution_count=7}
:::
The zones with the highest frequency of doubles and home runs are 2 and 5, with zone 8 as a third. This pattern aligns with exit velocity tendencies and is important for model considerations.
::: {#cell-launch speed by pitch group .cell message=‘false’ execution_count=8}
:::
These heatmaps show that hitters generate the hardest contact in zones 2, 5, and 9 across all pitch types, with Fastballs generally producing the highest launch speeds. Originally, the pitch type was expected to influence launch speed, but this does not seem to be the case. This helps inform feature selection.
The model achieves 80% accuracy with an ROC AUC of 0.86, indicating strong predictive ability for distinguishing hits from non-hits. In this scenario, the Random Forest model uses an ensemble of decision trees to learn complex, non-linear relationships between the pitch and swing features and the outcome of a batted ball. The top features driving this prediction are launch angle, launch speed, and effective pitch speed, followed by pitch movement and plate location metrics, highlighting that both the batter’s swing characteristics and the pitch’s movement/location are critical factors in determining batted-ball success.
From a practical standpoint, this model provides valuable insights for coaches and analysts looking to recruit talent or tailor swing mechanics. By identifying the key factors that influence successful contact, player development staff can offer more targeted feedback to optimize individual performance. Additionally, the ROC AUC of 0.86 is particularly applicable in the context of baseball, where some well-hit balls still result in outs due to defensive positioning. This helps create a high-performing probabilistic model more applicable than a binary outcome model, as it accounts for the nuanced nature of the game and supports more informed, evidence-based decisions.
::: {#cell-New Model .cell message=‘false’ execution_count=14}
The updated model achieves 79% accuracy with an ROC AUC of 0.857, showing slightly lower performance than the previous model but still strong overall. It effectively distinguishes hits from non-hits, though precision and recall are higher for non-hit events, reflecting the challenge of predicting hits. Key features driving the predictions include launch angle, launch speed, and effective pitch speed. While pitch movement metrics (pfx_x, pfx_z) were not used as inputs, the model still allows the user to select the zone, which captures the general location of the pitch and helps approximate its influence on hitting success.
Almost Interactive Slider
The interactive slider (currently not functional in the HTML output) is designed to let users input launch angle, launch speed, effective pitch speed, and zone of the pitch to see the model’s predicted probability of a hit versus a non-hit.
Conclusion
This project demonstrates that batted-ball outcomes can be reasonably predicted using key swing and pitch metrics. By analyzing launch angle, launch speed, effective pitch speed, and zone location, the Random Forest model achieved strong predictive performance with approximately 80% accuracy and an ROC AUC of ~0.86. The analysis highlights that both batter swing mechanics and pitch characteristics, including speed and location, play critical roles in determining whether a ball becomes a hit. Heatmaps and distribution plots further emphasize that certain zones and pitch types lead to more successful batted balls, providing actionable insights for player development and strategy. The interactive prediction tool demonstrates the potential for personalized scenario testing. Overall, this work shows the value of combining Statcast data with machine learning to better understand the dynamics of hitting in baseball.
In practice, this type of model can directly support coaches and analysts in guiding swing adjustments, optimizing player development, and identifying undervalued talent during recruitment or trades. With a strong AUC of ~0.86, the model accounts for the fact that even well-hit balls can result in outs due to defensive positioning, making its real-world usefulness even more evident. As teams continue to invest heavily in data-driven decision-making, predictive tools like this offer a competitive edge in forecasting performance and enhancing on-field outcomes.
Source Code
---title: "Sweet Spotting: Predicting Baseball Hitting Success from Swing Science"subtitle: "INFO 523 - Final Project"author: - name: "Trevor Abshire" affiliations: - name: "College of Information Science, University of Arizona"description: "This project uses a Random Forest model to predict whether a batted ball will result in a hit based on key pitch and contact features, including launch angle, launch speed, effective pitch speed, and zone."format: html: code-tools: true code-overflow: wrap embed-resources: trueeditor: visualexecute: warning: false echo: falsejupyter: python3---## IntroductionThe primary objective of this project was to analyze pitch-by-pitch data from Major League Baseball's Statcast system for the 2024 season and develop a model to predict the probability that a batted ball would result in a hit. The original dataset contained approximately 1 million pitches across the league, so for feasibility, the analysis was restricted to data from the top 100 hitters by average exit velocity.Although the initial plan was to perform the analysis at the player level, exploration revealed that swing characteristics are largely general across players. As a result, the model was built using aggregated pitch-level data from the top 100 hitters. Pitch type was found to have minimal influence on hit outcomes, whereas effective pitch speed, launch angle, launch speed, and zone were significant predictors. Due to the binary nature of hit outcomes, a Random Forest classifier was used to model the probability of a hit.To note: Some portions of this analysis and code suggestions were assisted by AI tools including OpenAI's ChatGPT and Google's Gemini. Final content was reviewed and edited by the author.## AbstractThis project leverages machine learning to predict the likelihood that a batted ball in Major League Baseball will result in a hit, based on pitch-level features such as launch angle, launch speed, effective pitch speed, and zone location. Using a Random Forest classifier trained on data from the top 100 hitters in the 2024 season, the model provides probabilistic predictions for each pitch. An interactive interface allows users to input custom pitch values and instantly see the predicted outcome, providing insights into contact quality and offensive performance.## QuestionHow do launch angle, launch speed, pitch zone, and pitch type influence whether a batted ball results in a hit?## DatasetThe dataset was collected using Python’s `pybaseball` library (Statcast) with credit to GitHub user `stephen1694` for their query of the 2024 season data. The raw dataset contains every pitch and its outcome from March through September 2024.Player IDs were merged from [MLBAM IDs](https://razzball.com/mlbamids/), and the data was filtered to focus on the top 100 hitters by average exit velocity to reduce computational overhead and remain within GitHub’s 100 MB limit. Finally, only pitches that resulted in balls put into play were retained, as contact quality cannot be assessed for pitches that were not hit.```{python}#| label: basic-checks#| echo: false#| results: hide#| message: falseimport pandas as pddf = pd.read_csv("data/pitch_2024_relevant.csv")# Percentage of missing values per columnmissing_percent = df.isnull().mean().sort_values(ascending=False) *100``````{python}#| label: load-dataset#| message: false#| echo: falseimport pandas as pdfrom IPython.display import displaydata = pd.read_csv("data/pitch_2024_relevant.csv")# Print the shapeprint(f"Rows, Columns: {data.shape}\n")# Display the first 10 rowsdisplay(data.head(10))```## Column Definitions- **batter** – MLB Player ID tied to the play event.\- **game_year** – Year the game took place.\- **game_date** – The calendar date of the game (YYYY-MM-DD).\- **home_team** – Abbreviation of the home team.\- **stand** – Side of the plate the batter is standing.\- **p_throws** – Hand the pitcher throws with.\- **pitch_type** – The type of pitch derived from Statcast.\- **effective_speed** – Speed adjusted based on the pitcher's release extension.\- **pfx_x** – Horizontal movement in feet from the catcher's perspective.\- **pfx_z** – Vertical movement in feet from the catcher's perspective.\- **plate_x** – Horizontal position of the ball when it crosses home plate.\- **plate_z** – Vertical position of the ball when it crosses home plate.\- **zone** – Zone location of the ball when it crosses the plate.\- **description** – Description of the resulting pitch.\- **launch_speed** – Exit velocity of the batted ball as tracked by Statcast. Estimates are included for batted balls not tracked directly [source](http://tangotiger.com/index.php/site/article/statcast-lab-no-nulls-in-batted-balls-launch-parameters).\- **launch_angle** – Launch angle of the batted ball as tracked by Statcast.\- **hc_x** – Hit coordinate X of batted ball.\- **hc_y** – Hit coordinate Y of batted ball.\- **if_fielding_alignment** – Infield fielding alignment at the time of the pitch.\- **events** – Event of the resulting plate appearance.```{python}import pandas as pd# Load datapitch_df = pd.read_csv('data/pitch_2024_relevant.csv')player_ids_df = pd.read_csv('data/player_ids.csv')# Prepare player IDsplayer_ids_sub = player_ids_df[['MLBAMID', 'Name']].rename(columns={'Name': 'batter_name'})# Merge on batter and MLBAMIDdf_hits = pitch_df.merge(player_ids_sub, left_on='batter', right_on='MLBAMID', how='left')df_hits = df_hits.drop(columns=['MLBAMID'])df_hits = df_hits.rename(columns={'player_name': 'pitcher_name'})# Filter to hitting events onlyhitting_events = ['single', 'double', 'triple', 'home_run']df_hits = df_hits[df_hits['events'].isin(hitting_events)].copy()# Now define your pitch_type_mappitch_type_map = {'FF': 'Fastball', # Four-Seam'FA': 'Fastball', # General Fastball'FT': 'Fastball', # Two-Seam'SI': 'Fastball', # Sinker'FS': 'Fastball', # Split Finger'SL': 'Slider', # Slider'SV': 'Slider', # Slurve'ST': 'Curveball', # Sweeper'CU': 'Curveball', # Curveball'KC': 'Curveball', # Knuckle Curve'CH': 'Changeup', # Change Up'FO': 'Changeup', # Fork Ball'EP': 'Changeup', # Eephus'FC': 'Fastball', # Cutter'KN': 'Knuckleball', # Knuckleball'SC': 'Other', # Screwball'CS': 'Other', # Slow Curve'PO': 'Other', # Pitch Out?None: 'Unknown', # or np.nan mapped to 'Unknown''NaN': 'Unknown'# if string 'NaN'}# Map pitch_type to pitch_group columndf_hits['pitch_group'] = df_hits['pitch_type'].map(pitch_type_map).fillna('Unknown')```## EDA + Visualization```{python}#| label: distribution and scatter#| message: false#| echo: falseimport seaborn as snsimport matplotlib.pyplot as plt# Summary stats of launch speed and launch angle by hit typesummary_stats = df_hits.groupby('events')[['launch_speed', 'launch_angle']].describe()# print(summary_stats)# Plot distribution of launch speed by hit typeplt.figure(figsize=(10, 6))sns.histplot(data=df_hits, x='launch_speed', hue='events', bins=30, kde=True, palette='viridis')plt.title('Launch Speed Distribution by Hit Type')plt.xlabel('Launch Speed (mph)')plt.ylabel('Count')plt.show()# Scatterplot of launch speed vs launch angle by hit typeplt.figure(figsize=(10, 6))sns.scatterplot(data=df_hits, x='launch_speed', y='launch_angle', hue='events', alpha=0.6, palette='viridis')plt.title('Launch Speed vs Launch Angle by Hit Type')plt.xlabel('Launch Speed (mph)')plt.ylabel('Launch Angle (degrees)')plt.show()import seaborn as snsimport matplotlib.pyplot as pltplt.figure(figsize=(12,8))# Scatterplot of launch speed vs launch angle by hit typesns.scatterplot( data=df_hits, x='launch_speed', y='launch_angle', hue='events', alpha=0.6, palette='viridis', edgecolor=None, s=30)# Add shaded boxes for typical launch windows# Singlesplt.axvspan(80.4, 101.4, ymin=(1+90)/180, ymax=(15+90)/180, color='blue', alpha=0.6, label='Typical Single Window')# Doublesplt.axvspan(93.7, 104.7, ymin=(12+90)/180, ymax=(22+90)/180, color='green', alpha=0.6, label='Typical Double Window')# Triplesplt.axvspan(94.8, 103.6, ymin=(13.75+90)/180, ymax=(25+90)/180, color='orange', alpha=0.6, label='Typical Triple Window')# Home Runsplt.axvspan(101.5, 107.2, ymin=(25+90)/180, ymax=(32+90)/180, color='red', alpha=0.6, label='Typical HR Window')plt.title('Launch Speed vs Launch Angle by Hit Type (2024)')plt.xlabel('Launch Speed (mph)')plt.ylabel('Launch Angle (degrees)')plt.legend(title='Hit Type')plt.grid(True)plt.show()zone_summary = ( df_hits .groupby('zone')[['launch_speed']] .agg(['count', 'mean', 'std', 'median']) .round(2))zone_summary```Using the mean launch angle and launch speed for different hit types—singles, doubles, triples, and home runs—this scatter plot highlights the typical “windows” in which each type of hit occurs. The shaded regions indicate the ranges of launch speeds and angles where each hit type is most likely:- **Singles:** Blue shaded area (`80.4–101.4 mph` launch speed, `1–15°` launch angle)\- **Doubles:** Green shaded area (`93.7–104.7 mph` launch speed, `12–22°` launch angle)\- **Triples:** Orange shaded area (`94.8–103.6 mph` launch speed, `13.75–25°` launch angle)\- **Home Runs:** Red shaded area (`101.5–107.2 mph` launch speed, `25–32°` launch angle)These visual ranges are useful for understanding batted ball outcomes and for creating predictive models of hit success.```{python}#| label: strike-zone-heatmap#| message: false#| warning: false#| echo: false#| results: hideimport matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd# Summarized data in a DataFramezone_data = {'zone': [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14],'mean_launch_speed': [91.52, 96.43, 92.24, 94.05, 98.12, 94.48, 94.39, 97.44, 92.73, 84.30, 85.24, 85.43, 83.39]}df_zone = pd.DataFrame(zone_data)# Strike zone layout mappingzone_grid = pd.DataFrame([ [11, None, 12], [1, 2, 3], [4, 5, 6], [7, 8, 9], [13, None, 14] ])# Map mean launch speed to the gridheatmap_values = zone_grid.replace( {z: df_zone.set_index('zone')['mean_launch_speed'].get(z) for z in df_zone['zone']})# Plotplt.figure(figsize=(6,8))sns.heatmap( heatmap_values.astype(float), annot=True, fmt=".1f", cmap="RdYlGn", linewidths=1, cbar_kws={'label': 'Mean Launch Speed (mph)'}, vmin=80, vmax=100)plt.title('Mean Launch Speed by Statcast Zone', fontsize=14)plt.ylabel('Pitch Height in Zone')plt.xlabel('Horizontal Location in Zone')plt.gca().invert_yaxis()plt.show()```In conjunction with swing characteristics, it is also important to consider pitch location (zone). Zones 1–9 correspond to strikes, while zones 11–14 are outside of the strike zone. From the data, we can see that the hardest hits, and the highest likelihood of success, occur in zones **2, 5, and 8**.I was not able to remove these output lines for the life of me, apologies!```{python}#| label: boxplot with angles and velo#| message: false#| echo: false#| engine: jupyterimport seaborn as snsimport matplotlib.pyplot as pltfig, axes = plt.subplots(1, 2, figsize=(14, 6))# Boxplot for launch speed by zonesns.boxplot(data=df_hits, x='zone', y='launch_speed', ax=axes[0], palette='viridis')axes[0].set_title('Launch Speed by Zone')axes[0].set_xlabel('Pitch Zone')axes[0].set_ylabel('Launch Speed (mph)')# Boxplot for launch angle by zonesns.boxplot(data=df_hits, x='zone', y='launch_angle', ax=axes[1], palette='magma')axes[1].set_title('Launch Angle by Zone')axes[1].set_xlabel('Pitch Zone')axes[1].set_ylabel('Launch Angle (°)')plt.tight_layout()```Boxplots show the distribution of launch speed and launch angle by zone. The most variation in launch speed occurs for pitches outside the strike zone, while launch angle appears more consistent across all zones.```{python}#| label: hit rates by zone#| message: false#| echo: false# Count total hits per zonetotal_per_zone = df_hits.groupby('zone').size()# Count each hit type per zonehits_per_zone = df_hits.groupby(['zone', 'events']).size().unstack(fill_value=0)# Convert to percentage of hits in that zonehit_rates = hits_per_zone.div(total_per_zone, axis=0) *100# Order the columns consistentlyhit_rates = hit_rates[['single', 'double', 'triple', 'home_run']]#print(hit_rates.round(2))# Plot stacked bar charthit_rates.plot(kind='bar', stacked=True, figsize=(10,6))plt.ylabel("Percentage of Hits")plt.title("Hit Outcome Rates by Zone")plt.legend(title="Hit Type")plt.show()```The zones with the highest frequency of doubles and home runs are 2 and 5, with zone 8 as a third. This pattern aligns with exit velocity tendencies and is important for model considerations.```{python}#| label: launch speed by pitch group#| message: false#| echo: falseimport matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd# Define strike zone layoutzone_grid = pd.DataFrame([ [11, None, 12], [1, 2, 3], [4, 5, 6], [7, 8, 9], [13, None, 14]])# List of pitch groups to plotpitch_groups = ['Fastball', 'Slider', 'Curveball', 'Changeup']# Set up plot gridn =len(pitch_groups)cols =2rows = (n +1) // colsfig, axes = plt.subplots(rows, cols, figsize=(cols *6, rows *8), constrained_layout=True)for ax, pg inzip(axes.flatten(), pitch_groups):# Filter data for the pitch group subset = df_hits[df_hits['pitch_group'] == pg]# Compute mean launch speed by zone df_zone = ( subset.groupby('zone')['launch_speed'] .mean() .reset_index() )# Map values to strike zone grid heatmap_values = zone_grid.replace( {z: df_zone.set_index('zone')['launch_speed'].get(z) for z in df_zone['zone']} )# Plot heatmap sns.heatmap( heatmap_values.astype(float), annot=True, fmt=".1f", cmap="RdYlGn", linewidths=1, cbar_kws={'label': 'Mean Launch Speed (mph)'}, vmin=80, vmax=105, ax=ax ) ax.set_title(f'Mean Launch Speed - {pg}', fontsize=14) ax.invert_yaxis()# Remove any unused subplotsfor ax in axes.flatten()[n:]: fig.delaxes(ax)plt.show()```These heatmaps show that hitters generate the hardest contact in zones 2, 5, and 9 across all pitch types, with Fastballs generally producing the highest launch speeds. Originally, the pitch type was expected to influence launch speed, but this does not seem to be the case. This helps inform feature selection.```{python}import pandas as pd# Load datapitch_df = pd.read_csv('data/pitch_2024_top100_reduced.csv')player_ids_df = pd.read_csv('data/player_ids.csv')# Prepare player IDsplayer_ids_sub = player_ids_df[['MLBAMID', 'Name']].rename(columns={'Name': 'batter_name'})# Merge on batter and MLBAMIDdf_top100 = pitch_df.merge(player_ids_sub, left_on='batter', right_on='MLBAMID', how='left')df_top100 = df_top100.drop(columns=['MLBAMID'])df_top100 = df_top100.rename(columns={'player_name': 'pitcher_name'})# Filter to hitting events onlyhitting_events = ['single', 'double', 'triple', 'home_run']df_hits = df_top100[df_top100['events'].isin(hitting_events)].copy()# Now define your pitch_type_mappitch_type_map = {'FF': 'Fastball', # Four-Seam'FA': 'Fastball', # General Fastball'FT': 'Fastball', # Two-Seam'SI': 'Fastball', # Sinker'FS': 'Fastball', # Split Finger'SL': 'Slider', # Slider'SV': 'Slider', # Slurve'ST': 'Curveball', # Sweeper'CU': 'Curveball', # Curveball'KC': 'Curveball', # Knuckle Curve'CH': 'Changeup', # Change Up'FO': 'Changeup', # Fork Ball'EP': 'Changeup', # Eephus'FC': 'Fastball', # Cutter'KN': 'Knuckleball', # Knuckleball'SC': 'Other', # Screwball'CS': 'Other', # Slow Curve'PO': 'Other', # Pitch Out?None: 'Unknown', # or np.nan mapped to 'Unknown''NaN': 'Unknown'# if string 'NaN'}# Map pitch_type to pitch_group columndf_top100['pitch_group'] = df_top100['pitch_type'].map(pitch_type_map).fillna('Unknown')``````{python}import pandas as pd# Load databip_df = pd.read_csv('data/BIP_2024_top100.csv')player_ids_df = pd.read_csv('data/player_ids.csv')# Prepare player IDsplayer_ids_sub = player_ids_df[['MLBAMID', 'Name']].rename(columns={'Name': 'batter_name'})# Merge on batter and MLBAMIDdf_bip_top100 = bip_df.merge(player_ids_sub, left_on='batter', right_on='MLBAMID', how='left')df_bip_top100 = df_bip_top100.drop(columns=['MLBAMID'])# Filter to hitting eventshitting_events = ['single', 'double', 'triple', 'home_run']df_hits_top100 = df_bip_top100[df_bip_top100['events'].isin(hitting_events)].copy()# Map pitch_type to pitch_group columnpitch_type_map = {'FF': 'Fastball', 'FA': 'Fastball', 'FT': 'Fastball', 'SI': 'Fastball', 'FS': 'Fastball','SL': 'Slider', 'SV': 'Slider', 'ST': 'Curveball', 'CU': 'Curveball', 'KC': 'Curveball','CH': 'Changeup', 'FO': 'Changeup', 'EP': 'Changeup', 'FC': 'Fastball', 'KN': 'Knuckleball','SC': 'Other', 'CS': 'Other', 'PO': 'Other', None: 'Unknown', 'NaN': 'Unknown'}df_bip_top100['pitch_group'] = df_bip_top100['pitch_type'].map(pitch_type_map).fillna('Unknown')df_hits_top100['pitch_group'] = df_hits_top100['pitch_type'].map(pitch_type_map).fillna('Unknown')``````{python}df_top100["is_same_hit_pitch"] = (df_top100["stand"] == df_top100["p_throws"]).astype(int)df_top100["is_single"] = (df_top100["events"] =="single").astype(int)df_top100["is_double"] = (df_top100["events"] =="double").astype(int)df_top100["is_triple"] = (df_top100["events"] =="triple").astype(int)df_top100["is_home_run"] = (df_top100["events"] =="home_run").astype(int)df_top100 = pd.get_dummies(df_top100, columns=['description'], prefix='desc')#print(df_top100.columns)``````{python}# Define the columns to keepcolumns_to_keep = ['pitcher_name', 'stand', 'p_throws', 'effective_speed','pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'zone', 'launch_speed','launch_angle', 'hc_x', 'hc_y', 'events', 'batter_name', 'pitch_group','is_same_hit_pitch', 'is_single', 'is_double', 'is_triple', 'is_home_run','desc_foul', 'desc_hit_into_play', 'desc_swinging_strike', 'desc_swinging_strike_blocked']# Create refined DataFramedf_top100_refined = df_top100[columns_to_keep].copy()df_top100_refined['is_hit'] = df_top100_refined[['is_single','is_double','is_triple','is_home_run']].max(axis=1)# Preview#print(df_top100_refined.head())```## Modeling```{python}#| label: Model + ROC Output#| message: false#| echo: falseimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report, roc_auc_score# 1. Filter to balls in play (contact quality)df_bip_top100 = df_top100_refined[df_top100_refined['desc_hit_into_play'] ==1].copy()# 2. Define features and targety_hit = df_bip_top100['is_hit']features = ['stand', 'p_throws', 'pitch_group','effective_speed', 'pfx_x', 'pfx_z','plate_x', 'plate_z', 'zone','is_same_hit_pitch', 'launch_speed', 'launch_angle']X_hit = df_bip_top100[features]# 3. One-hot encode categorical featuresX_hit_encoded = pd.get_dummies( X_hit, columns=['stand', 'p_throws', 'pitch_group'], drop_first=True)# 4. Train/test splitX_train, X_test, y_train, y_test = train_test_split( X_hit_encoded, y_hit, test_size=0.2, random_state=42, stratify=y_hit)# 5. Train Random Forest modelrf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)rf_model.fit(X_train, y_train)# 6. Make predictions and evaluatey_pred = rf_model.predict(X_test)y_pred_proba = rf_model.predict_proba(X_test)[:, 1]print("Classification Report:\n", classification_report(y_test, y_pred))print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba))# 7. Feature importancefeature_importance_df = pd.DataFrame({'feature': X_train.columns,'importance': rf_model.feature_importances_}).sort_values(by='importance', ascending=False)print("\nTop 10 Important Features:\n", feature_importance_df.head(10))# 8. Visualization of top featurestop_features = feature_importance_df.head(10)plt.figure(figsize=(8, 5))sns.barplot( data=top_features, x="importance", y="feature", palette="viridis")plt.title("Top 10 Important Features", fontsize=14, weight="bold")plt.xlabel("Importance", fontsize=12)plt.ylabel("Feature", fontsize=12)plt.tight_layout()plt.show()```The model achieves 80% accuracy with an ROC AUC of 0.86, indicating strong predictive ability for distinguishing hits from non-hits. In this scenario, the Random Forest model uses an ensemble of decision trees to learn complex, non-linear relationships between the pitch and swing features and the outcome of a batted ball. The top features driving this prediction are launch angle, launch speed, and effective pitch speed, followed by pitch movement and plate location metrics, highlighting that both the batter’s swing characteristics and the pitch’s movement/location are critical factors in determining batted-ball success.From a practical standpoint, this model provides valuable insights for coaches and analysts looking to recruit talent or tailor swing mechanics. By identifying the key factors that influence successful contact, player development staff can offer more targeted feedback to optimize individual performance. Additionally, the ROC AUC of 0.86 is particularly applicable in the context of baseball, where some well-hit balls still result in outs due to defensive positioning. This helps create a high-performing probabilistic model more applicable than a binary outcome model, as it accounts for the nuanced nature of the game and supports more informed, evidence-based decisions.```{python}#| label: New Model#| message: false#| echo: falseimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report, roc_auc_score# 1. Filter to balls in play (contact quality)df_bip_top100 = df_top100_refined[df_top100_refined['desc_hit_into_play'] ==1].copy()# 2. Select features (numeric only)features = ['launch_angle', 'launch_speed', 'effective_speed', 'zone']X_hit = df_bip_top100[features]y_hit = df_bip_top100['is_hit']# 3. Train/test splitX_train, X_test, y_train, y_test = train_test_split( X_hit, y_hit, test_size=0.2, random_state=42, stratify=y_hit)# 4. Train Random Forest modelrf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)rf_model.fit(X_train, y_train)# 5. Evaluatey_pred = rf_model.predict(X_test)y_pred_proba = rf_model.predict_proba(X_test)[:, 1]print("Classification Report:\n", classification_report(y_test, y_pred))print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba))# 6. Feature importance visualizationfeature_importance_df = pd.DataFrame({'feature': X_train.columns,'importance': rf_model.feature_importances_}).sort_values(by='importance', ascending=False)plt.figure(figsize=(8, 5))sns.barplot( data=feature_importance_df, x="importance", y="feature", palette="viridis")plt.title("Feature Importances (4 Feature Model)", fontsize=14, weight="bold")plt.xlabel("Importance", fontsize=12)plt.ylabel("Feature", fontsize=12)plt.tight_layout()plt.show()# 7. Prediction functiondef predict_hit(launch_angle, launch_speed, effective_speed, zone): new_pitch = pd.DataFrame([{'launch_angle': launch_angle,'launch_speed': launch_speed,'effective_speed': effective_speed,'zone': zone }]) pred_class = rf_model.predict(new_pitch)[0] pred_prob = rf_model.predict_proba(new_pitch)[0][1]return {'predicted_class': int(pred_class),'prediction_label': 'Hit'if pred_class ==1else'Not Hit','hit_probability': round(pred_prob, 3) }```The updated model achieves 79% accuracy with an ROC AUC of 0.857, showing slightly lower performance than the previous model but still strong overall. It effectively distinguishes hits from non-hits, though precision and recall are higher for non-hit events, reflecting the challenge of predicting hits. Key features driving the predictions include launch angle, launch speed, and effective pitch speed. While pitch movement metrics (pfx_x, pfx_z) were not used as inputs, the model still allows the user to select the zone, which captures the general location of the pitch and helps approximate its influence on hitting success.## Almost Interactive Slider```{python}# 8. Example predictionexample_result = predict_hit( launch_angle=26, launch_speed=45.3, effective_speed=98.2, zone=14)#print(example_result)import ipywidgets as widgetsfrom ipywidgets import interact, FloatSlider, IntSliderfrom IPython.display import display, HTMLimport pandas as pdout = widgets.Output()def predict_hit_ui(launch_angle, launch_speed, effective_speed, zone):with out: out.clear_output() new_pitch = pd.DataFrame([{'launch_angle': launch_angle,'launch_speed': launch_speed,'effective_speed': effective_speed,'zone': zone }]) pred_class = rf_model.predict(new_pitch)[0] pred_prob = rf_model.predict_proba(new_pitch)[0][1] color ='green'if pred_class ==1else'red' display(HTML(f"<h3 style='color:{color};'>{'Hit'if pred_class==1else'Not Hit'}</h3>")) display(HTML(f"<p>Probability: {pred_prob:.2f}</p>"))interact( predict_hit_ui, launch_angle=FloatSlider(value=20.0, min=-30, max=50, step=0.5, description='Launch Angle'), launch_speed=FloatSlider(value=95.0, min=60, max=120, step=0.5, description='Launch Speed'), effective_speed=FloatSlider(value=90.0, min=60, max=110, step=0.5, description='Effective Speed'), zone=IntSlider(value=5, min=1, max=9, step=1, description='Zone'))display(out)```The interactive slider (currently not functional in the HTML output) is designed to let users input launch angle, launch speed, effective pitch speed, and zone of the pitch to see the model’s predicted probability of a hit versus a non-hit.## ConclusionThis project demonstrates that batted-ball outcomes can be reasonably predicted using key swing and pitch metrics. By analyzing launch angle, launch speed, effective pitch speed, and zone location, the Random Forest model achieved strong predictive performance with approximately 80% accuracy and an ROC AUC of \~0.86. The analysis highlights that both batter swing mechanics and pitch characteristics, including speed and location, play critical roles in determining whether a ball becomes a hit. Heatmaps and distribution plots further emphasize that certain zones and pitch types lead to more successful batted balls, providing actionable insights for player development and strategy. The interactive prediction tool demonstrates the potential for personalized scenario testing. Overall, this work shows the value of combining Statcast data with machine learning to better understand the dynamics of hitting in baseball.In practice, this type of model can directly support coaches and analysts in guiding swing adjustments, optimizing player development, and identifying undervalued talent during recruitment or trades. With a strong AUC of \~0.86, the model accounts for the fact that even well-hit balls can result in outs due to defensive positioning, making its real-world usefulness even more evident. As teams continue to invest heavily in data-driven decision-making, predictive tools like this offer a competitive edge in forecasting performance and enhancing on-field outcomes.