Whistle Bias?

Investigating Referee Influence on WNBA Home Game Outcomes Using Data Mining

This project aims to investigate potential officiating bias in WNBA games by analyzing referee crew assignments, foul distributions, and game outcomes. The primary objective is to determine whether certain referee combinations disproportionately favor home teams or exhibit consistent patterns of foul disparities.

Author

Affiliation

Amy Esplain

College of Information Science, University of Arizona

# For data handling
import pandas as pd
import numpy as np

# For clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

Dataset

The dataset required for this project is a granular play-by-play level dataset in order to capture the fouls called within a game with an identifiable referee. The chosen dataset is from Kaggle created by Vladislav Shufinskiy (dataset link) who combined several sources into several datasets for publicly available use. I am choosing to use this source that has been created by another individual due to the granular nature of this project. If I were I collect this data myself, it would require extraneous effort due to limitations on API data requests per game for play-by-play details.

The dataset used in this analysis is from the 2022, 2023, 2024 WNBA season that was webscraped from CDN.NBA.COM by Vladislav Shufinskiy. The dataset will use all games available including in-season, playoffs and finals in order to increase sample size for the analysis.

# Import WNBA data from the data folder
import os

# Define the data folder path
data_folder = "data"

# Load individual season data
wnba_2022 = pd.read_csv(os.path.join(data_folder, "wnba_2022.csv"))
wnba_2023 = pd.read_csv(os.path.join(data_folder, "wnba_2023.csv"))
wnba_2024 = pd.read_csv(os.path.join(data_folder, "wnba_2024.csv"))

# Combine all seasons into one dataset
wnba_data = pd.concat([wnba_2022, wnba_2023, wnba_2024], ignore_index=True)

# Display basic information about the dataset
print(f"Total records across all seasons: {len(wnba_data):,}")
print(f"Columns in dataset: {wnba_data.shape[1]}")
print(f"Dataset shape: {wnba_data.shape}")

# Show first few rows where the officialId is not null
print("\nDataset columns:")
print(wnba_data.columns.tolist())

print("\nExample of the dataset:")
# Filter for rows where officialId is not null and show first 5
official_data = wnba_data[wnba_data['officialId'].notnull()]

official_data.head(10)

Total records across all seasons: 28,103
Columns in dataset: 57
Dataset shape: (28103, 57)

Dataset columns:
['actionNumber', 'clock', 'timeActual', 'period', 'periodType', 'actionType', 'subType', 'qualifiers', 'personId', 'x', 'y', 'possession', 'scoreHome', 'scoreAway', 'edited', 'orderNumber', 'xLegacy', 'yLegacy', 'isFieldGoal', 'side', 'description', 'personIdsFilter', 'teamId', 'teamTricode', 'descriptor', 'jumpBallRecoveredName', 'jumpBallRecoverdPersonId', 'playerName', 'playerNameI', 'jumpBallWonPlayerName', 'jumpBallWonPersonId', 'jumpBallLostPlayerName', 'jumpBallLostPersonId', 'shotDistance', 'shotResult', 'shotActionNumber', 'reboundTotal', 'reboundDefensiveTotal', 'reboundOffensiveTotal', 'pointsTotal', 'assistPlayerNameInitial', 'assistPersonId', 'assistTotal', 'turnoverTotal', 'stealPlayerName', 'stealPersonId', 'officialId', 'foulPersonalTotal', 'foulTechnicalTotal', 'foulDrawnPlayerName', 'foulDrawnPersonId', 'blockPlayerName', 'blockPersonId', 'gameId', 'isTargetScoreLastPeriod', 'area', 'areaDetail']

Example of the dataset:

	actionNumber	clock	timeActual	period	periodType	actionType	subType	qualifiers	personId	x	...	foulPersonalTotal	foulTechnicalTotal	foulDrawnPlayerName	foulDrawnPersonId	blockPlayerName	blockPersonId	gameId	isTargetScoreLastPeriod	area	areaDetail
38	50	PT05M11.00S	2022-08-18T02:15:47Z	1	REGULAR	foul	personal	NaN	1628287	NaN	...	1.0	0.0	Wilson	1628932.0	NaN	NaN	1042200101	NaN	NaN	NaN
46	60	PT04M38.00S	2022-08-18T02:16:46.900Z	1	REGULAR	turnover	out-of-bounds	NaN	1628276	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1042200101	NaN	NaN	NaN
56	68	PT03M58.00S	2022-08-18T02:20:32.100Z	1	REGULAR	foul	personal	NaN	1629498	NaN	...	1.0	0.0	Cunningham	1629482.0	NaN	NaN	1042200101	NaN	NaN	NaN
68	85	PT02M23.00S	2022-08-18T02:22:29.500Z	1	REGULAR	foul	personal	2freethrow	1629488	NaN	...	1.0	0.0	Wilson	1628932.0	NaN	NaN	1042200101	NaN	NaN	NaN
77	95	PT01M59.00S	2022-08-18T02:23:44Z	1	REGULAR	foul	personal	NaN	1629482	NaN	...	1.0	0.0	Wilson	1628932.0	NaN	NaN	1042200101	NaN	NaN	NaN
90	117	PT00M09.40S	2022-08-18T02:26:14.700Z	1	REGULAR	foul	personal	inpenalty, 2freethrow	1629484	NaN	...	1.0	0.0	Wilson	1628932.0	NaN	NaN	1042200101	NaN	NaN	NaN
113	148	PT08M39.00S	2022-08-18T02:32:02.900Z	2	REGULAR	foul	personal	NaN	203833	NaN	...	1.0	0.0	Gray	204334.0	NaN	NaN	1042200101	NaN	NaN	NaN
116	153	PT08M31.00S	2022-08-18T02:32:47.900Z	2	REGULAR	turnover	out-of-bounds	NaN	203833	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1042200101	NaN	NaN	NaN
120	157	PT08M01.00S	2022-08-18T02:33:33.400Z	2	REGULAR	foul	offensive	NaN	1630387	NaN	...	1.0	0.0	Peddy	203035.0	NaN	NaN	1042200101	NaN	NaN	NaN
121	159	PT08M01.00S	2022-08-18T02:33:33.400Z	2	REGULAR	turnover	offensive foul	NaN	1630387	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1042200101	NaN	NaN	NaN

10 rows × 57 columns

Research Questions

The research questions guiding this project are designed to uncover latent patterns in officiating behavior within the WNBA using unsupervised data mining techniques. Rather than testing predefined hypotheses, the goal is to explore underlying structures and trends in referee decision-making that may indicate systemic tendencies or inconsistencies.

1. Do home teams have significantly higher win rates under specific referee crews?

This question aims to identify clusters of referee crews associated with elevated home team win rates. The project will explore whether specific officiating crews are consistently linked to favorable home outcomes. Patterns that emerge may reflect officiating tendencies that unintentionally reinforce home-court advantage.

2. When games are officiated by certain referee combinations, do they have higher or lower foul disparity?

This question focuses on foul differential as a key indicator of officiating style. The analysis seeks to reveal groups of crews with similar behavioral patterns. Identifying outliers or consistently imbalanced combinations may point to structural officiating trends.

3. Do certain referees call more fouls on away teams?

This question narrows the scope to individual referees to examine whether certain officials consistently contribute to foul imbalances. The objective is to detect underlying officiating bias and identify individuals whose patterns deviate significantly from the normal.

Analysis plan

Problem Introduction

The Women’s National Basketball Association (WNBA) has experienced significant growth in recent years, accompanied by an increasing emphasis on data analytics to enhance forecasting and anomaly detection capabilities. This project seeks to evaluate the fairness of officiating in the WNBA by applying machine learning techniques to referee assignment data, foul differentials, and game outcomes. The primary objective is to identify potential officiating bias and assess the extent to which individual referees may contribute to a home-court advantage.

Problem Formulation

This study will examine potential officiating bias in WNBA games by analyzing referee assignment data, foul differentials, and game outcomes across multiple seasons. The analysis will proceed in the following stages:

1. Data Collection and Preprocessing

Data Sources: Collecting data from https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021/data which is a play-by-play datasets with referee assignments.
Variables of Interest:
- Game metadata: date, teams, location (home/away), final scores
- Referee assignments (names or IDs, crew combinations)
- Team foul counts
- Game outcomes (win/loss, point differential)
Data Cleaning:
- Normalize referee names across games
- Merge datasets to associate referee crews with game-level statistics
- Handle missing or inconsistent values

2. Exploratory Data Analysis

To validate that the data supports our research questions, key features were explored using distributions and summary statistics:

Summarize foul counts by team and referee to understand Foul Differential Distribution.
Visualize average foul differential by referee and referee crew to see if there is skewness which could suggest certain crews favor one side.
Compute home vs. away win rates across different referee combinations Crews with consistently higher home win rates may indicate officiating bias regardless of foul disparities.
Generate pairwise correlations to identify potentially relevant feature groupings. Specific investigation will look at whether foul differential has a meaningful impact on scoring.

3. Feature Engineering

The goal is to construct a comprehensive set of quantitative features that can characterize officiating tendencies and enable meaningful clustering. Below are several features that I plan to create:

Foul Differential: The difference in total fouls called on the home team versus the away team. This feature serves as a proxy for imbalance in foul calls and is a central indicator of potential bias.
Total Fouls Per Game:
The sum of fouls called on both teams, capturing referees’ strictness or leniency in foul calling. Calculated as fouls called on the away team minus fouls called on the home team.
Average Fouls Per Team:
Computed as total fouls divided by the number of games officiated by a referee or crew. This helps normalize foul volume across varying sample sizes.
Free Throw Differential:
The difference in free throw attempts between the home and away teams, which may reflect the practical consequences of foul calls.
Home Win Indicator:
A binary variable denoting whether the home team won the game (1 = home team won, 0 = home team loss). Used to explore relationships between crew assignments and home-court advantage.
Referee and Crew Identifiers:
Encoded categorical identifiers for individual referees and three-person crew combinations to enable grouping and aggregation across games. Referee names will be normalized and mapped to a consistent ID to support aggregation. A composite key will be created for the three referees per game to analyze group effects.
Normalization and Scaling:

To prepare for K-means clustering, all continuous features will be standardized (e.g., z-score normalization) to ensure that variables are on the same scale and contribute equally to distance calculations in the clustering algorithm
Aggregation Strategy:

Referee-level features will be aggregated across games to create a per-referee profile. Similarly, crew-level statistics will be generated by grouping games officiated by the same three-referee combinations. This aggregation supports both individual and crew-based cluster analysis.

4. Unsupervised Learning (Pattern Discovery)

K-Means Clustering: Use K-means clustering to group:
- Referee crew based on game-level foul and outcome patterns
- Individual referees based on their aggregated officiating behavior across multiple games
Dimensionality Reduction: Apply Principal Component Analysis (PCA) to identify latent components in officiating behavior (such as home bias, foul volume, crew consistency)

5. Cluster Interpretation

Analyze each cluster’s centroid to identify distinguishing features (such as high foul disparity, frequent home wins)
Label clusters based on behavioral tendencies (such as “neutral crews”, “home-favoring referees”, “high-caller crews”)
Identify any outlier referees or crews with extreme values

6. Reporting and Visualization

Create visualizations including heatmaps, bar charts, and PCA plot to articulate findings
Plot clustered referee data to visualize separation and cohesion
Discuss implications of findings in the context of WNBA officiating policy and fairness
Provide future recommendations to improve the analysis

Plan of Attack

Milestone	Task	Due Date
Submit Proposal	Finalize and submit initial proposal for others to provide feedback	8/3/2025
Revise Proposal	Address all peer feedback as needed	8/6/2025
Submit Revised Proposal	Incorporate and address all feedback for instructor review	8/8/2025
Data Collection & Cleaning	- Gather and clean WNBA game logs, referee assignments, and team stats - Merge datasets and ensure consistent formatting	8/10/2025
Feature Engineering & EDA	- Create variables such as foul differential, crew IDs, and home/away indicators - Visualize trends and explore feature distributions	8/12/2025
Pattern Discovery with K-means	- Standardize features and apply K-means clustering	8/14/2025
Cluster Interpretation	- Interpret each cluster’s characteristics (e.g., home bias, foul disparity) - Identify and analyze outlier referees or crews	8/16/2025
Visual & Storytelling	Create and finalize visuals that showcase the clustering results, such as PCA plots and cluster heatmaps	8/17/2025
Final Write-Up & Presentation	- Create, refine, and finalize the report, code, and presentation - Ensure all results are well-documented and reproducible	8/19/2025
Final Submission	- Submit final report, code, and presentation - Back up to GitHub (or drive)	8/20/2025

Repo Organization

Path / File	Description
`.github/`	Contains GitHub-specific files, including workflows, actions, and issue management templates.
`_extra/`	Stores miscellaneous files that do not fit into other project categories; serves as a repository for supplementary documents.
`_freeze/`	Houses frozen environment files detailing the project’s setup and dependencies.
`data/`	Directory for all essential data files, including input datasets and resources required for analysis.
`images/`	Central repository for visual assets such as diagrams, charts, and screenshots used for documentation and presentations.
`.gitignore`	Specifies files and directories to exclude from Git version control.
`README.md`	Central documentation file providing project overview, setup instructions, and usage guidelines.
`_quarto.yml`	Configuration file for Quarto, specifying rendering options and document settings.
`about.qmd`	Provides contextual information about the project and introduces team members and their roles.
`index.qmd`	Main page of the project write-up, including code, visualizations, and final results.
`presentation.qmd`	Quarto file used to create a slideshow of the final project presentation.
`project-final.Rproj`	RStudio project file that defines project-level settings for R-based workflows.
`proposal.qmd`	Contains the project proposal, including dataset descriptions, metadata, research questions, and a weekly progress plan.
`requirements.txt`	Lists required Python packages and versions necessary for reproducing the project environment.

References

[1] WNBA Dataset: https://www.kaggle.com/datasets/brains14482/nba-playbyplay-and-shotdetails-data-19962021/data used in the project