Whistle Bias?

Investigating Referee Influence on WNBA Home Game Outcomes Using Data Mining

This project aims to investigate potential officiating bias in WNBA games by analyzing referee crew assignments, foul distributions, and game outcomes. The primary objective is to determine whether certain referee combinations disproportionately favor home teams or exhibit consistent patterns of foul disparities.
Author
Affiliation

Amy Esplain

College of Information Science, University of Arizona

# For data handling
import pandas as pd
import numpy as np

# For clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

Dataset

The dataset required for this project is a granular play-by-play level dataset in order to capture the fouls called within a game with an identifiable referee. The chosen dataset is from Kaggle created by Vladislav Shufinskiy (dataset link) who combined several sources into several datasets for publicly available use. I am choosing to use this source that has been created by another individual due to the granular nature of this project. If I were I collect this data myself, it would require extraneous effort due to limitations on API data requests per game for play-by-play details.

The dataset used in this analysis is from the 2022, 2023, 2024 WNBA season that was webscraped from CDN.NBA.COM by Vladislav Shufinskiy. The dataset will use all games available including in-season, playoffs and finals in order to increase sample size for the analysis.

# Import WNBA data from the data folder
import os

# Define the data folder path
data_folder = "data"

# Load individual season data
wnba_2022 = pd.read_csv(os.path.join(data_folder, "wnba_2022.csv"))
wnba_2023 = pd.read_csv(os.path.join(data_folder, "wnba_2023.csv"))
wnba_2024 = pd.read_csv(os.path.join(data_folder, "wnba_2024.csv"))

# Combine all seasons into one dataset
wnba_data = pd.concat([wnba_2022, wnba_2023, wnba_2024], ignore_index=True)

# Display basic information about the dataset
print(f"Total records across all seasons: {len(wnba_data):,}")
print(f"Columns in dataset: {wnba_data.shape[1]}")
print(f"Dataset shape: {wnba_data.shape}")

# Show first few rows where the officialId is not null
print("\nDataset columns:")
print(wnba_data.columns.tolist())

print("\nExample of the dataset:")
# Filter for rows where officialId is not null and show first 5
official_data = wnba_data[wnba_data['officialId'].notnull()]

official_data.head(10)
Total records across all seasons: 28,103
Columns in dataset: 57
Dataset shape: (28103, 57)

Dataset columns:
['actionNumber', 'clock', 'timeActual', 'period', 'periodType', 'actionType', 'subType', 'qualifiers', 'personId', 'x', 'y', 'possession', 'scoreHome', 'scoreAway', 'edited', 'orderNumber', 'xLegacy', 'yLegacy', 'isFieldGoal', 'side', 'description', 'personIdsFilter', 'teamId', 'teamTricode', 'descriptor', 'jumpBallRecoveredName', 'jumpBallRecoverdPersonId', 'playerName', 'playerNameI', 'jumpBallWonPlayerName', 'jumpBallWonPersonId', 'jumpBallLostPlayerName', 'jumpBallLostPersonId', 'shotDistance', 'shotResult', 'shotActionNumber', 'reboundTotal', 'reboundDefensiveTotal', 'reboundOffensiveTotal', 'pointsTotal', 'assistPlayerNameInitial', 'assistPersonId', 'assistTotal', 'turnoverTotal', 'stealPlayerName', 'stealPersonId', 'officialId', 'foulPersonalTotal', 'foulTechnicalTotal', 'foulDrawnPlayerName', 'foulDrawnPersonId', 'blockPlayerName', 'blockPersonId', 'gameId', 'isTargetScoreLastPeriod', 'area', 'areaDetail']

Example of the dataset:
actionNumber clock timeActual period periodType actionType subType qualifiers personId x ... foulPersonalTotal foulTechnicalTotal foulDrawnPlayerName foulDrawnPersonId blockPlayerName blockPersonId gameId isTargetScoreLastPeriod area areaDetail
38 50 PT05M11.00S 2022-08-18T02:15:47Z 1 REGULAR foul personal NaN 1628287 NaN ... 1.0 0.0 Wilson 1628932.0 NaN NaN 1042200101 NaN NaN NaN
46 60 PT04M38.00S 2022-08-18T02:16:46.900Z 1 REGULAR turnover out-of-bounds NaN 1628276 NaN ... NaN NaN NaN NaN NaN NaN 1042200101 NaN NaN NaN
56 68 PT03M58.00S 2022-08-18T02:20:32.100Z 1 REGULAR foul personal NaN 1629498 NaN ... 1.0 0.0 Cunningham 1629482.0 NaN NaN 1042200101 NaN NaN NaN
68 85 PT02M23.00S 2022-08-18T02:22:29.500Z 1 REGULAR foul personal 2freethrow 1629488 NaN ... 1.0 0.0 Wilson 1628932.0 NaN NaN 1042200101 NaN NaN NaN
77 95 PT01M59.00S 2022-08-18T02:23:44Z 1 REGULAR foul personal NaN 1629482 NaN ... 1.0 0.0 Wilson 1628932.0 NaN NaN 1042200101 NaN NaN NaN
90 117 PT00M09.40S 2022-08-18T02:26:14.700Z 1 REGULAR foul personal inpenalty, 2freethrow 1629484 NaN ... 1.0 0.0 Wilson 1628932.0 NaN NaN 1042200101 NaN NaN NaN
113 148 PT08M39.00S 2022-08-18T02:32:02.900Z 2 REGULAR foul personal NaN 203833 NaN ... 1.0 0.0 Gray 204334.0 NaN NaN 1042200101 NaN NaN NaN
116 153 PT08M31.00S 2022-08-18T02:32:47.900Z 2 REGULAR turnover out-of-bounds NaN 203833 NaN ... NaN NaN NaN NaN NaN NaN 1042200101 NaN NaN NaN
120 157 PT08M01.00S 2022-08-18T02:33:33.400Z 2 REGULAR foul offensive NaN 1630387 NaN ... 1.0 0.0 Peddy 203035.0 NaN NaN 1042200101 NaN NaN NaN
121 159 PT08M01.00S 2022-08-18T02:33:33.400Z 2 REGULAR turnover offensive foul NaN 1630387 NaN ... NaN NaN NaN NaN NaN NaN 1042200101 NaN NaN NaN

10 rows × 57 columns

Research Questions

The research questions guiding this project are designed to uncover latent patterns in officiating behavior within the WNBA using unsupervised data mining techniques. Rather than testing predefined hypotheses, the goal is to explore underlying structures and trends in referee decision-making that may indicate systemic tendencies or inconsistencies.

1. Do home teams have significantly higher win rates under specific referee crews?

This question aims to identify clusters of referee crews associated with elevated home team win rates. The project will explore whether specific officiating crews are consistently linked to favorable home outcomes. Patterns that emerge may reflect officiating tendencies that unintentionally reinforce home-court advantage.

2. When games are officiated by certain referee combinations, do they have higher or lower foul disparity?

This question focuses on foul differential as a key indicator of officiating style. The analysis seeks to reveal groups of crews with similar behavioral patterns. Identifying outliers or consistently imbalanced combinations may point to structural officiating trends.

3. Do certain referees call more fouls on away teams?

This question narrows the scope to individual referees to examine whether certain officials consistently contribute to foul imbalances. The objective is to detect underlying officiating bias and identify individuals whose patterns deviate significantly from the normal.

Analysis plan

Problem Introduction

The Women’s National Basketball Association (WNBA) has experienced significant growth in recent years, accompanied by an increasing emphasis on data analytics to enhance forecasting and anomaly detection capabilities. This project seeks to evaluate the fairness of officiating in the WNBA by applying machine learning techniques to referee assignment data, foul differentials, and game outcomes. The primary objective is to identify potential officiating bias and assess the extent to which individual referees may contribute to a home-court advantage.

Problem Formulation

This study will examine potential officiating bias in WNBA games by analyzing referee assignment data, foul differentials, and game outcomes across multiple seasons. The analysis will proceed in the following stages:

1. Data Collection and Preprocessing

  • Data Sources: Collecting data from https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021/data which is a play-by-play datasets with referee assignments.

  • Variables of Interest:

    • Game metadata: date, teams, location (home/away), final scores
    • Referee assignments (names or IDs, crew combinations)
    • Team foul counts
    • Game outcomes (win/loss, point differential)
  • Data Cleaning:

    • Normalize referee names across games
    • Merge datasets to associate referee crews with game-level statistics
    • Handle missing or inconsistent values

2. Exploratory Data Analysis

To validate that the data supports our research questions, key features were explored using distributions and summary statistics:

  • Summarize foul counts by team and referee to understand Foul Differential Distribution.
  • Visualize average foul differential by referee and referee crew to see if there is skewness which could suggest certain crews favor one side.
  • Compute home vs. away win rates across different referee combinations Crews with consistently higher home win rates may indicate officiating bias regardless of foul disparities.
  • Generate pairwise correlations to identify potentially relevant feature groupings. Specific investigation will look at whether foul differential has a meaningful impact on scoring.

3. Feature Engineering

The goal is to construct a comprehensive set of quantitative features that can characterize officiating tendencies and enable meaningful clustering. Below are several features that I plan to create:

  • Foul Differential: The difference in total fouls called on the home team versus the away team. This feature serves as a proxy for imbalance in foul calls and is a central indicator of potential bias.

  • Total Fouls Per Game:
    The sum of fouls called on both teams, capturing referees’ strictness or leniency in foul calling. Calculated as fouls called on the away team minus fouls called on the home team.

  • Average Fouls Per Team:
    Computed as total fouls divided by the number of games officiated by a referee or crew. This helps normalize foul volume across varying sample sizes.

  • Free Throw Differential:
    The difference in free throw attempts between the home and away teams, which may reflect the practical consequences of foul calls.

  • Home Win Indicator:
    A binary variable denoting whether the home team won the game (1 = home team won, 0 = home team loss). Used to explore relationships between crew assignments and home-court advantage.

  • Referee and Crew Identifiers:
    Encoded categorical identifiers for individual referees and three-person crew combinations to enable grouping and aggregation across games. Referee names will be normalized and mapped to a consistent ID to support aggregation. A composite key will be created for the three referees per game to analyze group effects.

  • Normalization and Scaling:

    To prepare for K-means clustering, all continuous features will be standardized (e.g., z-score normalization) to ensure that variables are on the same scale and contribute equally to distance calculations in the clustering algorithm

  • Aggregation Strategy:

    Referee-level features will be aggregated across games to create a per-referee profile. Similarly, crew-level statistics will be generated by grouping games officiated by the same three-referee combinations. This aggregation supports both individual and crew-based cluster analysis.

4. Unsupervised Learning (Pattern Discovery)

  • K-Means Clustering: Use K-means clustering to group:

    • Referee crew based on game-level foul and outcome patterns

    • Individual referees based on their aggregated officiating behavior across multiple games

  • Dimensionality Reduction: Apply Principal Component Analysis (PCA) to identify latent components in officiating behavior (such as home bias, foul volume, crew consistency)

5. Cluster Interpretation

  • Analyze each cluster’s centroid to identify distinguishing features (such as high foul disparity, frequent home wins)
  • Label clusters based on behavioral tendencies (such as “neutral crews”, “home-favoring referees”, “high-caller crews”)
  • Identify any outlier referees or crews with extreme values

6. Reporting and Visualization

  • Create visualizations including heatmaps, bar charts, and PCA plot to articulate findings
  • Plot clustered referee data to visualize separation and cohesion
  • Discuss implications of findings in the context of WNBA officiating policy and fairness
  • Provide future recommendations to improve the analysis

Plan of Attack

Milestone Task Due Date
Submit Proposal Finalize and submit initial proposal for others to provide feedback 8/3/2025
Revise Proposal Address all peer feedback as needed 8/6/2025
Submit Revised Proposal Incorporate and address all feedback for instructor review 8/8/2025
Data Collection & Cleaning - Gather and clean WNBA game logs, referee assignments, and team stats
- Merge datasets and ensure consistent formatting
8/10/2025
Feature Engineering & EDA - Create variables such as foul differential, crew IDs, and home/away indicators
- Visualize trends and explore feature distributions
8/12/2025
Pattern Discovery with K-means - Standardize features and apply K-means clustering 8/14/2025
Cluster Interpretation - Interpret each cluster’s characteristics (e.g., home bias, foul disparity)
- Identify and analyze outlier referees or crews
8/16/2025
Visual & Storytelling Create and finalize visuals that showcase the clustering results, such as PCA plots and cluster heatmaps 8/17/2025
Final Write-Up & Presentation - Create, refine, and finalize the report, code, and presentation
- Ensure all results are well-documented and reproducible
8/19/2025
Final Submission - Submit final report, code, and presentation
- Back up to GitHub (or drive)
8/20/2025

Repo Organization

Path / File Description
.github/ Contains GitHub-specific files, including workflows, actions, and issue management templates.
_extra/ Stores miscellaneous files that do not fit into other project categories; serves as a repository for supplementary documents.
_freeze/ Houses frozen environment files detailing the project’s setup and dependencies.
data/ Directory for all essential data files, including input datasets and resources required for analysis.
images/ Central repository for visual assets such as diagrams, charts, and screenshots used for documentation and presentations.
.gitignore Specifies files and directories to exclude from Git version control.
README.md Central documentation file providing project overview, setup instructions, and usage guidelines.
_quarto.yml Configuration file for Quarto, specifying rendering options and document settings.
about.qmd Provides contextual information about the project and introduces team members and their roles.
index.qmd Main page of the project write-up, including code, visualizations, and final results.
presentation.qmd Quarto file used to create a slideshow of the final project presentation.
project-final.Rproj RStudio project file that defines project-level settings for R-based workflows.
proposal.qmd Contains the project proposal, including dataset descriptions, metadata, research questions, and a weekly progress plan.
requirements.txt Lists required Python packages and versions necessary for reproducing the project environment.

References

[1] WNBA Dataset: https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021/data used in the project