Sweet Spotting: Predicting Baseball Hitting Success from Swing Science

INFO 523 - Final Project

This project uses a Random Forest model to predict whether a batted ball will result in a hit based on key pitch and contact features, including launch angle, launch speed, effective pitch speed, and zone.
Author
Affiliation

Trevor Abshire

College of Information Science, University of Arizona

Introduction

The primary objective of this project was to analyze pitch-by-pitch data from Major League Baseball’s Statcast system for the 2024 season and develop a model to predict the probability that a batted ball would result in a hit. The original dataset contained approximately 1 million pitches across the league, so for feasibility, the analysis was restricted to data from the top 100 hitters by average exit velocity.

Although the initial plan was to perform the analysis at the player level, exploration revealed that swing characteristics are largely general across players. As a result, the model was built using aggregated pitch-level data from the top 100 hitters. Pitch type was found to have minimal influence on hit outcomes, whereas effective pitch speed, launch angle, launch speed, and zone were significant predictors. Due to the binary nature of hit outcomes, a Random Forest classifier was used to model the probability of a hit.

To note: Some portions of this analysis and code suggestions were assisted by AI tools including OpenAI’s ChatGPT and Google’s Gemini. Final content was reviewed and edited by the author.

Abstract

This project leverages machine learning to predict the likelihood that a batted ball in Major League Baseball will result in a hit, based on pitch-level features such as launch angle, launch speed, effective pitch speed, and zone location. Using a Random Forest classifier trained on data from the top 100 hitters in the 2024 season, the model provides probabilistic predictions for each pitch. An interactive interface allows users to input custom pitch values and instantly see the predicted outcome, providing insights into contact quality and offensive performance.

Question

How do launch angle, launch speed, pitch zone, and pitch type influence whether a batted ball results in a hit?

Dataset

The dataset was collected using Python’s pybaseball library (Statcast) with credit to GitHub user stephen1694 for their query of the 2024 season data. The raw dataset contains every pitch and its outcome from March through September 2024.

Player IDs were merged from MLBAM IDs, and the data was filtered to focus on the top 100 hitters by average exit velocity to reduce computational overhead and remain within GitHub’s 100 MB limit. Finally, only pitches that resulted in balls put into play were retained, as contact quality cannot be assessed for pitches that were not hit.

Rows, Columns: (827825, 20)
Unnamed: 0 batter player_name game_date stand p_throws pitch_type effective_speed pfx_x pfx_z plate_x plate_z zone description launch_speed launch_angle hc_x hc_y if_fielding_alignment events
0 0 686668 Ginkel, Kevin 2024-03-31 R R FF 97.0 -0.57 1.33 0.08 3.23 2.0 swinging_strike NaN NaN NaN NaN Standard strikeout
1 1 686668 Ginkel, Kevin 2024-03-31 R R SI 96.0 -1.18 0.97 -0.04 1.74 8.0 foul 88.8 -43.0 NaN NaN Standard NaN
2 2 686668 Ginkel, Kevin 2024-03-31 R R SL 89.8 0.36 -0.11 1.72 0.27 14.0 blocked_ball NaN NaN NaN NaN Standard NaN
3 3 686668 Ginkel, Kevin 2024-03-31 R R SL 88.2 0.40 -0.31 0.27 1.74 8.0 foul 47.7 -38.0 NaN NaN Standard NaN
4 4 686668 Ginkel, Kevin 2024-03-31 R R SI 96.4 -1.20 0.93 0.16 1.02 14.0 ball NaN NaN NaN NaN Standard NaN
5 5 686668 Ginkel, Kevin 2024-03-31 R R SI 95.3 -1.23 0.92 -0.03 1.64 8.0 foul 92.6 -37.0 NaN NaN Standard NaN
6 6 686668 Ginkel, Kevin 2024-03-31 R R FF 97.1 -0.86 0.94 0.38 1.22 14.0 ball NaN NaN NaN NaN Standard NaN
7 7 669911 Ginkel, Kevin 2024-03-31 L R FF 97.1 -0.88 1.34 0.03 4.00 12.0 swinging_strike NaN NaN NaN NaN Strategic strikeout
8 8 669911 Ginkel, Kevin 2024-03-31 L R SL 88.7 0.35 -0.20 0.66 1.45 14.0 foul 68.9 -22.0 NaN NaN Standard NaN
9 9 669911 Ginkel, Kevin 2024-03-31 L R FF 96.9 -0.91 1.37 -0.55 3.28 1.0 foul 78.6 31.0 NaN NaN Standard NaN

Column Definitions

  • batter – MLB Player ID tied to the play event.
  • game_year – Year the game took place.
  • game_date – The calendar date of the game (YYYY-MM-DD).
  • home_team – Abbreviation of the home team.
  • stand – Side of the plate the batter is standing.
  • p_throws – Hand the pitcher throws with.
  • pitch_type – The type of pitch derived from Statcast.
  • effective_speed – Speed adjusted based on the pitcher’s release extension.
  • pfx_x – Horizontal movement in feet from the catcher’s perspective.
  • pfx_z – Vertical movement in feet from the catcher’s perspective.
  • plate_x – Horizontal position of the ball when it crosses home plate.
  • plate_z – Vertical position of the ball when it crosses home plate.
  • zone – Zone location of the ball when it crosses the plate.
  • description – Description of the resulting pitch.
  • launch_speed – Exit velocity of the batted ball as tracked by Statcast. Estimates are included for batted balls not tracked directly source.
  • launch_angle – Launch angle of the batted ball as tracked by Statcast.
  • hc_x – Hit coordinate X of batted ball.
  • hc_y – Hit coordinate Y of batted ball.
  • if_fielding_alignment – Infield fielding alignment at the time of the pitch.
  • events – Event of the resulting plate appearance.

EDA + Visualization

launch_speed
count mean std median
zone
1.0 2131 91.52 14.63 96.0
2.0 3310 96.43 12.65 99.8
3.0 2023 92.24 15.22 97.4
4.0 5170 94.05 13.83 98.0
5.0 8054 98.12 11.71 101.2
6.0 5085 94.48 13.76 98.8
7.0 3368 94.39 13.18 97.8
8.0 5885 97.44 12.14 100.8
9.0 3747 92.73 13.88 96.5
11.0 1371 84.30 17.11 88.1
12.0 1354 85.24 17.17 89.4
13.0 2255 85.43 15.41 86.9
14.0 2799 83.39 15.66 83.8

Using the mean launch angle and launch speed for different hit types—singles, doubles, triples, and home runs—this scatter plot highlights the typical “windows” in which each type of hit occurs. The shaded regions indicate the ranges of launch speeds and angles where each hit type is most likely:

  • Singles: Blue shaded area (80.4–101.4 mph launch speed, 1–15° launch angle)
  • Doubles: Green shaded area (93.7–104.7 mph launch speed, 12–22° launch angle)
  • Triples: Orange shaded area (94.8–103.6 mph launch speed, 13.75–25° launch angle)
  • Home Runs: Red shaded area (101.5–107.2 mph launch speed, 25–32° launch angle)

These visual ranges are useful for understanding batted ball outcomes and for creating predictive models of hit success.

In conjunction with swing characteristics, it is also important to consider pitch location (zone). Zones 1–9 correspond to strikes, while zones 11–14 are outside of the strike zone. From the data, we can see that the hardest hits, and the highest likelihood of success, occur in zones 2, 5, and 8.

I was not able to remove these output lines for the life of me, apologies!

::: {#cell-boxplot with angles and velo .cell engine=‘jupyter’ message=‘false’ execution_count=6}

:::

Boxplots show the distribution of launch speed and launch angle by zone. The most variation in launch speed occurs for pitches outside the strike zone, while launch angle appears more consistent across all zones.

::: {#cell-hit rates by zone .cell message=‘false’ execution_count=7}

:::

The zones with the highest frequency of doubles and home runs are 2 and 5, with zone 8 as a third. This pattern aligns with exit velocity tendencies and is important for model considerations.

::: {#cell-launch speed by pitch group .cell message=‘false’ execution_count=8}

:::

These heatmaps show that hitters generate the hardest contact in zones 2, 5, and 9 across all pitch types, with Fastballs generally producing the highest launch speeds. Originally, the pitch type was expected to influence launch speed, but this does not seem to be the case. This helps inform feature selection.

Modeling

::: {#cell-Model + ROC Output .cell message=‘false’ execution_count=13}

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85      4252
           1       0.75      0.66      0.70      2275

    accuracy                           0.80      6527
   macro avg       0.79      0.77      0.78      6527
weighted avg       0.80      0.80      0.80      6527

ROC AUC Score: 0.8605127515945954

Top 10 Important Features:
              feature  importance
8       launch_angle    0.280981
7       launch_speed    0.212495
0    effective_speed    0.086560
1              pfx_x    0.086249
2              pfx_z    0.086008
4            plate_z    0.085883
3            plate_x    0.084925
5               zone    0.032631
9            stand_R    0.010265
6  is_same_hit_pitch    0.009954

:::

The model achieves 80% accuracy with an ROC AUC of 0.86, indicating strong predictive ability for distinguishing hits from non-hits. In this scenario, the Random Forest model uses an ensemble of decision trees to learn complex, non-linear relationships between the pitch and swing features and the outcome of a batted ball. The top features driving this prediction are launch angle, launch speed, and effective pitch speed, followed by pitch movement and plate location metrics, highlighting that both the batter’s swing characteristics and the pitch’s movement/location are critical factors in determining batted-ball success.

From a practical standpoint, this model provides valuable insights for coaches and analysts looking to recruit talent or tailor swing mechanics. By identifying the key factors that influence successful contact, player development staff can offer more targeted feedback to optimize individual performance. Additionally, the ROC AUC of 0.86 is particularly applicable in the context of baseball, where some well-hit balls still result in outs due to defensive positioning. This helps create a high-performing probabilistic model more applicable than a binary outcome model, as it accounts for the nuanced nature of the game and supports more informed, evidence-based decisions.

::: {#cell-New Model .cell message=‘false’ execution_count=14}

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.87      0.84      4252
           1       0.72      0.64      0.68      2275

    accuracy                           0.79      6527
   macro avg       0.77      0.76      0.76      6527
weighted avg       0.79      0.79      0.79      6527

ROC AUC Score: 0.8566956984689816

:::

The updated model achieves 79% accuracy with an ROC AUC of 0.857, showing slightly lower performance than the previous model but still strong overall. It effectively distinguishes hits from non-hits, though precision and recall are higher for non-hit events, reflecting the challenge of predicting hits. Key features driving the predictions include launch angle, launch speed, and effective pitch speed. While pitch movement metrics (pfx_x, pfx_z) were not used as inputs, the model still allows the user to select the zone, which captures the general location of the pitch and helps approximate its influence on hitting success.

Almost Interactive Slider

The interactive slider (currently not functional in the HTML output) is designed to let users input launch angle, launch speed, effective pitch speed, and zone of the pitch to see the model’s predicted probability of a hit versus a non-hit.

Conclusion

This project demonstrates that batted-ball outcomes can be reasonably predicted using key swing and pitch metrics. By analyzing launch angle, launch speed, effective pitch speed, and zone location, the Random Forest model achieved strong predictive performance with approximately 80% accuracy and an ROC AUC of ~0.86. The analysis highlights that both batter swing mechanics and pitch characteristics, including speed and location, play critical roles in determining whether a ball becomes a hit. Heatmaps and distribution plots further emphasize that certain zones and pitch types lead to more successful batted balls, providing actionable insights for player development and strategy. The interactive prediction tool demonstrates the potential for personalized scenario testing. Overall, this work shows the value of combining Statcast data with machine learning to better understand the dynamics of hitting in baseball.

In practice, this type of model can directly support coaches and analysts in guiding swing adjustments, optimizing player development, and identifying undervalued talent during recruitment or trades. With a strong AUC of ~0.86, the model accounts for the fact that even well-hit balls can result in outs due to defensive positioning, making its real-world usefulness even more evident. As teams continue to invest heavily in data-driven decision-making, predictive tools like this offer a competitive edge in forecasting performance and enhancing on-field outcomes.