Urbanization and Environmental Quality in Arizona, USA

Project Final: Workspace

Final Project for course INFO 523: Data Mining and Discovery

Author

Affiliation

JKP (Vera Jackson, Molly Kerwick, Brooke Pacheco)

College of Information Science, University of Arizona

Introduction

The goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data.

Research Questions

Can we predict the environmental quality of a region based on urbanization indicators for that region?
The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region.

Are storm event data and traffic data successful environmental health and urbanization indicators, respectively?
The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest.

Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model?
This will indicate how accurate our scope is compared to other environmental impact studies.

Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data?
If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities.

Proposed Procedure

This project will consist of 2 studies:

Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two.

Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions.

Data

Traffic Data

Traffic congestion is a major source of air pollution in and around traffic areas (Transportation Research Board). Consequently, high traffic congestion has been shown to have a negative health impact on drivers, pedestrians, and residents in areas near roadways (World Health Organization). We selected traffic data as our urbanization metric as the monitoring was fairly robust throughout areas of different types (eg. rural, urban, agricultural). We measured traffic volume as a value of gross counts in a given year. The traffic data were sourced from the Department of Transportation (DOT) Federal Highway Administration (FHWA) and included traffic count data and monitoring station data. The traffic count data were separated into files for each month spanning our five years of interest (2019-2023). Each line was a single day of counts at a single traffic station. The columns included a count for each hour of that day. The aggregate our traffic count data, we summed the hourly counts into a gross daily count, and then grouped the dataset by station_id summing the newly created daily traffic count. This led to very high counts on the order of magnitude of 10e7. The traffic station held the geographic information for the traffic stations. Each row was a different station location deployed by FHWA that year. It included the features station_id, latitude, longitude, and county_code. We omitted features like the number of lanes and the direction of traffic flow. The traffic data were merged on station_id which linked the geographic location with the traffic counts.

A potential issue in our approach is that gross counts alone are not an indicator of traffic congestion. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.

Climate Data (from NOAA)

The climate (temperature and precipitation) data from NOAA comes from the Global Historical Climatology Network, combining daily climate observations from a variety of data sources, but primarily land-based stations. Many of these stations are updated daily.

The features from the NOAA data included the following:

Feature Name	Description	Use
STATION	name of land-based station for collection	aggregating climate data
LATITUDE	latitude of station position	combining datasets regionally
LONGITUDE	longitude of station position	combining datasets regionally
DATE	date (year-month-day) of data collection	aggregating climate data temporally
PRCP	precipitation (inches)	`rainfall_2022` and `rainfall_below_5yavg`
TAVG	average temperature (Fahrenheit)	`avg_temp_2022` and `avg_temp_above_5yavg`
TMAX	maximum temperature (Fahrenheit)	`max_temp_2022` and `max_temp_above_5yavg`

TMAX and TAVG were used to calculate the maximum and average temperatures (respectively) for 2022 and the 5-year average 2019-2023; then these results were used to determine the difference between the max or average temperature in 2022 and 2019-2023 average. PRCP was used to determine the average rainfall (in inches) in 2022 and 2019-2023 5-year average; likewise, the difference between the 2022 and 2019-2023 averages were calculated from these engineered features.

It is easy to spot the cyclical nature of our climate measurements in the temperature data where each period spans one year. However, the pattern in precipitation and snow in inches over time is more difficult to identify.

Storm Event Data (from NOAA)

The storm event data is sourced from NOAA. Each row represented a weather event tied to a specific time and geographic zone, with end dates and times recorded some down to the exact minute. The data also incorporated multiple observation sources, including the public, broadcast media, trained spotters, and forest or park services. Additionally, it contained fields detailing property and crop damage, injuries, and event magnitude. However, a lot of this meta data contained empty cells which indicated some data reliability issues for certain fields.

Relevant Variables in the dataset (these are the storm event variables we used for modeling and visualizations):

BEGIN_DATE – date the storm event began
YEAR – extracted from BEGIN_DATE, used to filter and group data
MAGNITUDE – numeric severity/intensity of the storm event
CZ_FIPS – county FIPS code identifier used to group data by county

From these base variables, we created several engineered features:

df_2022 – subset of data for storm events that occurred in 2022
lowmag_2022 – number of low magnitude storm events in 2022
lowmag_5yavg – 5-year average (2019–2023) of low magnitude storm events per year
highmag_2022 – number of high magnitude storm events in 2022
highmag_5yavg – 5-year average (2019–2023) of high magnitude storm events per year
avg_mag_2022 – average storm magnitude for each county/zone in 2022

The engineered features (lowmag, highmag, and avg_mag) were used to compare storm activity in 2022 against the 5-year averages and to evaluate how storm intensity varied across counties.

Data Integrity

The datasets we had access to had integrity issues. Meaning the collection methods were inconsistent which led to many missing values. A leading contributor to this problem is the large size of the sampling range. Climate stations are often designed to be operated remotely, meaning any hardware issues that arise have delays in repair time, leading to frequent gaps in collection. Another integrity issue relates to the storm event data which was uniquely granular in both space and time. While the traffic and climate stations were measured relatively consistently over time, storm event data were highly qualitative in nature and only included measurements from when storms were present.

Engineered Features

Climate Features

Feature	Description
max_temp_2022	maximum temperature for 2022
max_temp_above_5yavg	maximum temperature for 2022 minus maximum temperature averaged 2019-2023
avg_temp_2022	average temperature for 2022
avg_temp_above_5yavg	average temperature for 2022 minus average temperature averaged 2019-2023
rainfall_2022	total inches of rainfall in 2022
rainfall_below_5yavg	inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022
lowmagstorm_events_2022	sum of low magnitude storm events in 2022
lowmagstorm_events_above_5yavg	sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023
highmagstorm_events_2022	sum of high magnitude storm events in 2022
highmagstorm_events_above_5yavg	sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023
average_storm_mag_2022	average storm event magnitude for 2022

Urbanization Features

Feature	Description
traffic_counts	total number of vehicles detected in 2022
traffic_counts_above_5yavg	traffic counts in 2022 minus traffic counts averaged 2019-2023

Preprocessing

Geospatial Normalization

Since our research question is about regional relationships, we had to combine our data regionally. Since each dataset had unique geographic qualities, it was not a straightforward process. The two main forms of location information given in our datasets were latitude/longitude pairs and county names. Since latitude and longitude are specific and discrete, we were unable to merge on latitude and longitude pair points alone. In order to merge on specific points, it would mean that the traffic station, the climate measurement, and the storm identification would have had to occur at the exact same location. To account for this, we implemented geospatial normalization in the form of a gridspace.

Our gridspace is a 2-dimensional grid overlayed on the state of Arizona that encompasses pre-defined ranges of latitude and longitude. We manufactured a grid of 336 spaces that is 16 cells horizontally and 21 cells vertically. Any data that had a station location that was within a given gridspace would be merged into that gridspace and assigned a gridspace number and a gridspace geometry. In cartography, geometries describe the shape of a region. Generally speaking, geometries include Points, Lines,and Polygons (https://geopandas.org/en/stable/docs/user_guide/data_structures.html). The traffic and NOAA station data include latitude and longitude information, which would be classified as a Point. The gridspace geometry however is a rectangle which would be considered a Polygon. This form of normalization is helpful because it ensures all data share the same projection information and it assures the regions are the same size. We utilized the GeoPandas Spatial Join (geopandas.sjoin()) function to spatially join our data which we implemented as we merged our datasets. Spatial joining works by joining GeoDataFrames on geometries that intersect. So traffic and climate data would be grouped together in the same gridspace if they fell within the same region.

First, we combined the NOAA climate data and the traffic data. When we grouped data that had multiple station locations within the same gridspace, we assigned the mean of those measurements to be the feature value for that gridspace. The spatial join function only maintains gridspaces that are represented in both datasets. This means that our initial gridspace of {python}grid_size was decreased to ~80 gridspaces. An interesting problem that arose was with the storm event data which did not include latitude and longitude values. Since storms travel across regions, the dataset included a county variable and starting and ending latitude and longitude values. We decided to preserve the county feature from the raw traffic data to merge on. This made the storm values unique in that a single storm event would be represented in every gridspace within the county in which it occurred.

For our regression analyses, our final DataFrame was structured such that each row corresponded to a gridspace and each column was a traffic, climate, or storm event value.

Temporal Normalization

While we have addressed regional normalization using our gridspace feature, we had to find a way to normalize our data in the time dimension. The most common approach to see how features change over time is to perform regression where the independent variable is time. Common examples of this are seeing how a stock’s value changes over time, or seeing how temperature for one city changes day to day. Since our independent variable changes region by region, we accounted for temporal variability by comparing a year of interest (2022) to a dataset of averages for the same features over a 5 year time span (2019-2023). We did this through engineering features that were differences between values measured in 2022 and values that were measured over the 5 year average.

Data Preprocessing

The prepare our data for regression, we removed outliers with a 5% and 95% tolerance range. Then we filled missing values scaled our features using StandardScaler. The low integrity of the dataset meant that we had issues with data being flagged as outliers. This was because the overabundance of NaNs led to the fill values being over-represented, classifying the rest of the dataset as outliers. We believe this to be a leading contributor to our poor model performance.

We grouped our data into climate and urbanization (target) categories. We split each set of categories into 2022 values and 5 year average values. We performed PCA analysis to decrease our dataset preserving only the most prominent features, and then performed one hot encoding on categorical values to convert object types to numerical types.

PCA Results

Component 1: ['max_temp_2022', 'highmagstorm_events_2022', 'highmagstorm_5yavg', 'highmagstorm_events_above_5yavg']
Component 2: ['traffic_counts_2022', 'fips_code', 'highmagstorm_events_2022', 'highmagstorm_5yavg', 'highmagstorm_events_above_5yavg']
Component 3: ['traffic_counts_2022', 'traffic_counts_above_5yavg', 'max_temp_above_5yavg', 'avg_temp_above_5yavg', 'rainfall_2022']
Component 4: ['gridspace', 'traffic_counts_2022', 'traffic_counts_above_5yavg', 'rainfall_2022', 'average_storm_mag_2022']
Component 5: ['gridspace', 'avg_temp_2022', 'rainfall_below_5yavg', 'lowmagstorm_5yavg']
Component 6: ['fips_code', 'avg_temp_above_5yavg', 'average_storm_mag_2022']
Component 7: ['max_temp_2022', 'rainfall_2022', 'lowmagstorm_5yavg', 'average_storm_mag_2022']
Component 8: ['max_temp_above_5yavg', 'avg_temp_2022', 'avg_temp_above_5yavg']
Component 9: ['avg_temp_2022', 'rainfall_below_5yavg', 'lowmagstorm_5yavg']

Regression Analysis

Linear Regression Models

We utilized Ordinary Least Squares Regression, Ridge Regression, and Lasso Regression. We used both Ridge Regression and Lasso Regression to see if there were any prominently predictive features that we could further investigate. We could determine this if the Lasso Regression performed significantly better than the Ridge Regression, because the Lasso Regression omits features of low influence while Ridge Regression preserves them. We grouped the 2022 data and the 5 year average (2019-2023) data in every combination during regression analysis to investigate any underlying relationships.

Environmental Data (5 Year Averages)
Traffic Data (5 Year Averages)
OLS Regression Model:
    Mean-squared error: 0.5443879147049658
    Root mean-squared error: 0.7378264800784571
    R-squared value: -0.2140888231489706

Ridge best alpha selected: 1000 with CV MSE: 1.174985534201412
Ridge Regression Model:
    Mean-squared error: 0.5596105056182925
    Root mean-squared error: 0.7480711902073843
    R-squared value: -0.24803810267560222

Lasso best alpha selected: 1000 with CV MSE: 1.174985534201412
Lasso Regression Model:
    Mean-squared error: 0.5601840743214328
    Root mean-squared error: 0.7484544570790082
    R-squared value: -0.2493172702195181

Environmental Data (2022)
Traffic Data (2022)
OLS Regression Model:
    Mean-squared error: 0.5310351617487781
    Root mean-squared error: 0.7287215941282227
    R-squared value: -0.7615209941768197

Ridge best alpha selected: 100 with CV MSE: 1.1758120989820233
Ridge Regression Model:
    Mean-squared error: 0.41491285307157283
    Root mean-squared error: 0.6441372936506415
    R-squared value: -0.37632637927870594

Lasso best alpha selected: 100 with CV MSE: 1.1758120989820233
Lasso Regression Model:
    Mean-squared error: 0.4315963501853747
    Root mean-squared error: 0.6569599304260304
    R-squared value: -0.43166796970271903

Environmental Data (2022)
Traffic Data (5 Year Averages)
OLS Regression Model:
    Mean-squared error: 0.5553423826213788
    Root mean-squared error: 0.7452129780280123
    R-squared value: -0.23851937478615914

Ridge best alpha selected: 10000 with CV MSE: 1.1760355262786755
Ridge Regression Model:
    Mean-squared error: 0.5590780965942956
    Root mean-squared error: 0.7477152510109016
    R-squared value: -0.2468507290621953

Lasso best alpha selected: 1 with CV MSE: 1.1759802832153639
Lasso Regression Model:
    Mean-squared error: 0.5601840743214328
    Root mean-squared error: 0.7484544570790082
    R-squared value: -0.2493172702195181

Environmental Data (5 Year Averages)
Traffic Data (2022)
OLS Regression Model:
    Mean-squared error: 0.44783955314134777
    Root mean-squared error: 0.669208153821625
    R-squared value: -0.4855490402619558

Ridge best alpha selected: 100 with CV MSE: 1.1569723795267826
Ridge Regression Model:
    Mean-squared error: 0.4109393466705784
    Root mean-squared error: 0.6410455106079275
    R-squared value: -0.3631456796753172

Lasso best alpha selected: 100 with CV MSE: 1.1569723795267826
Lasso Regression Model:
    Mean-squared error: 0.4315963501853747
    Root mean-squared error: 0.6569599304260304
    R-squared value: -0.43166796970271903

Polynomial regression

We utilized a second order polynomial regression to see if it would perform better than the linear regression models. We discovered that the quadratic regression also performed worse than a prediction made by averages.

Environmental Data (5 Year Averages)
Traffic Data (5 Year Averages)
Mean Squared Error: 8.216793889457124
R-squared: -17.325016690929225

Environmental Data (2022)
Traffic Data (2022)
Mean Squared Error: 0.7942820147920161
R-squared: -1.6347491562434548

Environmental Data (2022)
Traffic Data (5 Year Averages)
Mean Squared Error: 1.216637646936872
R-squared: -1.7133338729035632

Environmental Data (5 Year Averages)
Traffic Data (2022)
Mean Squared Error: 6.005052083350384
R-squared: -18.91963259290013

Discussion

The overall low predictive power across all of our models implies that our hypothesis that traffic counts alone are not an accurate metric for urbanization. There is also a concern that these datasets are not appropriate for our research question. Potentially narrowing our scope through decreasing our regions of interest to a single county or our time frame to a single season may help control for natural variability and data integrity in our dataset.
Our PCA analysis demonstrates that extreme storm events and temperature have the most significant interaction among features in our dataset.
The polynomial regression model did not perform better than our linear regression models. Both model types produced negative R-squared values, meaning a simple numerical average has more predictive power than our models.

Future improvements

Some ideas for improvement when investigating our research question include:

Using more complete datasets. The NOAA dataset, in its incompleteness, includes a use advisory that the data are meant mainly for hydrological and agricultural purposes and warns that the incompleteness of the dataset means it is not sufficient for climate change research.
Perform traffic congestion normalization. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.
Perform modeling using PCA clusters with variables of interest. Reducing the number of features may lead to better R-squared values because the regression fits will be less affected by extraneous features.

References

Grid Construction Reference
1. Brennan, J. (2020). Fast and easy gridding of point data with geopandas. https://james-brennan.github.io/posts/fast_gridding_geopandas/
Traffic Data
1. U.S. Department of Transportation Federal Highway Administration. (n.d.). Travel Monitoring. https://www.fhwa.dot.gov/policyinformation/travel_monitoring/tvt.cfm
2. Transportation Research Board. The Congestion Mitigation and Air Quality Improvement Programs. https://onlinepubs.trb.org/onlinepubs/sr/sr264.pdf
3. World Health Organization. Health Effects of Transport-Related Air Pollution https://books.google.com/books?hl=en&lr=&id=b2G3k51rd0oC&oi=fnd&pg=PR1&ots=O94u9zCs4y&sig=v1JEfno-HjkbKf3im4iDQLJJUQ0#v=onepage&q&f=false
Storm Data
1. NOAA National Centers for Environmental Information. (n.d.). Storm Events Databse. https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/
Climate Data
1. NOAA National Centers for Environmental Information. (n.d.). Climate Data. https://www.ncdc.noaa.gov/cdo-web/search;jsessionid=2851FE4A6791F4BF429988727483E450
2. Global Historical Climatology Network. (n.d.). Daily Documentation. https://www.ncei.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf
Trenberth, K.E. (2007). The Impact of Climate Change and Variability on Heavy Precipitation, Floods and Droughts. https://web.archive.org/web/20170809113035id_/http://www.cgd.ucar.edu/cas/Trenberth/website-archive/trenberth.papers-moved/The%20Impact%20of%20Climate%20Change%20on%20the%20Water%20Cycle%20v%204_ss.pdf
Zhang M., Gao, Y., & Ge, J. (2025). Different responses of extreme and mean precipitation to land use and land cover changes. https://www.nature.com/articles/s41612-025-01049-1
Saturn Cloud. (2023). How to Sum Two Columns in a Pandas DataFrame.https://saturncloud.io/blog/how-to-sum-two-columns-in-a-pandas-dataframe/

--- title: "Urbanization and Environmental Quality in Arizona, USA" subtitle: "Project Final: Workspace" author: - name: "JKP (Vera Jackson, Molly Kerwick, Brooke Pacheco)" affiliations: - name: "College of Information Science, University of Arizona" description: "Final Project for course INFO 523: Data Mining and Discovery" format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false echo: false jupyter: python3 --- # Introduction The goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data. Research Questions ``` Can we predict the environmental quality of a region based on urbanization indicators for that region? The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region. Are storm event data and traffic data successful environmental health and urbanization indicators, respectively? The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest. Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model? This will indicate how accurate our scope is compared to other environmental impact studies. Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data? If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities. ``` Proposed Procedure This project will consist of 2 studies: ``` Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two. Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions. ``` ```{python} # Import Libraries # Reading files and plotting import seaborn as sns import glob import os from matplotlib.dates import date2num import geodatasets import geopandas as gpd import pandas as pd import numpy as np import matplotlib.pyplot as plt import shapely from shapely.geometry import Point # Statistical modeling from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures import numbers from sklearn.decomposition import PCA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.feature_selection import SelectKBest, f_classif, f_regression from sklearn.model_selection import train_test_split from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc, make_scorer, mean_squared_error, r2_score ) import statsmodels.api as sm from sklearn.linear_model import LogisticRegression, Ridge, Lasso from scipy import stats ``` ```{python} #| label: gdf-function def get_grid(xmin=-114.5, ymin=31.2, xmax=-109, ymax=37, num_x=15, num_y=20, proj='EPSG:4326'): """ Initialized grid used in gridspace normalization with grid geometry for mapping """ # Calculate width and height of each grid cell cell_size_x = (xmax - xmin) / num_x cell_size_y = (ymax - ymin) / num_y # Define coordinate reference system (CRS) for the grid crs = proj # Initialize a list to hold grid cell geometries grid_cells = [] # Create the grid by looping through x and y ranges for x0 in np.arange(xmin, xmax + cell_size_x, cell_size_x): for y0 in np.arange(ymin, ymax + cell_size_y, cell_size_y): # Define opposite corner of the box (left and upper side) x1 = x0 - cell_size_x y1 = y0 + cell_size_y # Create a rectangular polygon and add it to the list grid_cells.append(shapely.geometry.box(x0, y0, x1, y1)) # Create a GeoDataFrame from the list of polygons with specified CRS grid = gpd.GeoDataFrame(grid_cells, columns=['geometry'], crs=crs) # Assign a unique ID to each grid cell grid['gridspace'] = [i for i in range(336)] # Assumes 336 cells return grid def add_grid_geometry(df): """ Given a DataFrame that already contains a 'gridspace' column, add corresponding grid cell geometries by merging with the generated grid. """ # Generate the grid grid = get_grid() # Merge input df with grid to attach geometry based on gridspace ID df = pd.merge(df, grid, on='gridspace', how='left') return df def grid_bin(df, proj='EPSG:4326', xmin=-114.5, ymin=31.2, xmax=-109, ymax=37, num_x=15, num_y=20): """ Assign a grid cell ('gridspace') to each point in the input DataFrame based on its latitude and longitude. """ # Create point geometries from latitude and longitude df['geometry'] = gpd.points_from_xy(df.longitude, df.latitude, crs=proj) # Convert the DataFrame into a GeoDataFrame gdf = gpd.GeoDataFrame(data=df, geometry='geometry') # Generate the spatial grid with given bounding box and cell count grid = get_grid(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax, num_x=num_x, num_y=num_y, proj=proj) # Perform a spatial join to assign each point to a grid cell gdf_joined = gpd.sjoin(left_df=gdf, right_df=grid, how='left') # Rename the resulting 'index_right' column to 'gridspace' for clarity gdf_joined['gridspace'] = gdf_joined['index_right'] gdf_joined = gdf_joined.drop(['index_right'], axis=1) return gdf_joined # Future shape grid information num_x = 15 num_y = 20 grid_size = (num_x+1) * (num_y+1) ``` # Data ### Traffic Data Traffic congestion is a major source of air pollution in and around traffic areas (Transportation Research Board). Consequently, high traffic congestion has been shown to have a negative health impact on drivers, pedestrians, and residents in areas near roadways (World Health Organization). We selected traffic data as our urbanization metric as the monitoring was fairly robust throughout areas of different types (eg. rural, urban, agricultural). We measured traffic volume as a value of gross counts in a given year. The traffic data were sourced from the Department of Transportation (DOT) Federal Highway Administration (FHWA) and included traffic count data and monitoring station data. The traffic count data were separated into files for each month spanning our five years of interest (2019-2023). Each line was a single day of counts at a single traffic station. The columns included a count for each hour of that day. The aggregate our traffic count data, we summed the hourly counts into a gross daily count, and then grouped the dataset by `station_id` summing the newly created daily traffic count. This led to very high counts on the order of magnitude of 10e7. The traffic station held the geographic information for the traffic stations. Each row was a different station location deployed by FHWA that year. It included the features `station_id`, `latitude`, `longitude`, and `county_code`. We omitted features like the number of lanes and the direction of traffic flow. The traffic data were merged on `station_id` which linked the geographic location with the traffic counts. A potential issue in our approach is that gross counts alone are not an indicator of traffic congestion. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas. ```{python} # Iterating through traffic station and traffic count files to # engineer DataFrames of traffic data df_traffic_5yavg = pd.DataFrame() for i in range(19, 24): # Iterate through years of interest # format values for files/features year = "{:02}".format(i) yearnum = int(f'20{year}') # read in station data for the year df_year = pd.read_csv(f'data/Traffic/stations/df_20{year}') # define data types for merging later df_year['station_id'] = df_year['station_id'].astype(str) for j in range(1,13): # Iterate through months of interest # format strings for files/features monthyear = "{:02}".format(j) + year month = "{:02}".format(j) # Accessing traffic files from data directory # Note: the format of data changed in 2022 so we have to account for the delimiter changing if (i < 22): df_month = pd.read_csv(f"data/Traffic/counts/AZ{monthyear}.VOL") else: df_month = pd.read_csv(f"data/Traffic/counts/AZ{monthyear}.VOL", delimiter='|') # converting all text to lowercase so merging works later df_month.columns = [col.lower() for col in df_month.columns] # sum hour counts to daily count for every line (one line is measurements per station per day) df_month[f'daily_volume_{month}'] = df_month[["hour_{:02}".format(i) for i in range(24)]].sum(axis=1) # sum daily counts to monthly count per station df_month = df_month.groupby('station_id', as_index=False)[[f'daily_volume_{month}']].sum() # defining data types so merging later works df_month['station_id'] = df_month['station_id'].astype(str) # adding each month's total count to an annual df df_year = pd.merge(df_year, df_month, how='left', on='station_id') # summing monthly counts so we have an annual count df_year['traffic_counts'] = df_year[["daily_volume_{:02}".format(i) for i in range(1,13)]].sum(axis=1) # maintaining features used in our research questions and dropping the rest for handleability df_year = df_year[['station_id','latitude','longitude', 'traffic_counts', 'county_code']] # initialize df for 5 year averages if (year == '19'): df_traffic_5yavg = df_year.copy() df_traffic_5yavg[f'traffic_counts_{year}'] = df_year['traffic_counts'] ``` ```{python} # create averages and clean dfs ####### 1. Calculate averages for 5 year average traffic df # remove column that was duplicate added during looping df_traffic_5yavg = df_traffic_5yavg.drop('traffic_counts',axis=1) # create average column from 2019-2023 df_traffic_5yavg['traffic_counts_5yavg'] = df_traffic_5yavg[['traffic_counts_{:02}'.format(year) for year in range(19,24)]].mean(axis=1) ####### 2. make traffic_counts_above_5yavg feature # remove data from years not of interest df_traffic_2022 = df_traffic_5yavg.drop(['traffic_counts_{:02}'.format(year) for year in range(19,22)], axis=1) df_traffic = df_traffic_2022.drop('traffic_counts_23', axis=1) # add counts above 5yavg feature (Ref: https://saturncloud.io/blog/how-to-sum-two-columns-in-a-pandas-dataframe/) df_traffic['traffic_counts_above_5yavg'] = df_traffic_2022['traffic_counts_22'] - df_traffic_2022['traffic_counts_5yavg'] # rename for consistency with environmental df df_traffic = df_traffic.rename({'traffic_counts_22': 'traffic_counts_2022'}, axis=1) # remove intermediary 5yavg feature df_traffic = df_traffic.drop('traffic_counts_5yavg', axis=1) ``` ```{python} # initiate gridspace gdf_traffic = grid_bin(df_traffic, num_x=num_x, num_y=num_y) # average values over gridspace gdf_traffic = grid_bin(df_traffic) gdf_traffic = gdf_traffic.groupby('gridspace', as_index=False)[['traffic_counts_2022', 'traffic_counts_above_5yavg', 'latitude', 'longitude', 'county_code']].mean() # important note: lat and lon averages are not very useful, but they will be replaced later. Just wanted to preserve the feature for rudimentary mapping gdf_traffic = gdf_traffic.rename(columns={'county_code': "fips_code"}) # Add County Info for storm df merging df_county = pd.DataFrame({"fips_code": [1,3,5,7,9,11,12,13,15,17,19,21,23,25,27], "county": ["Apache", "Cochise", "Coconino", "Gila", "Graham", "Greenlee", "La Paz", "Maricopa", "Mohave", "Navajo", "Pima", "Pinal", "Santa Cruz", "Yavapai", "Yuma"]}) gdf_traffic = pd.merge(gdf_traffic, df_county, on='fips_code', how='left') # Need to add geometry of gridspace to gdf_traffic gdf_traffic = add_grid_geometry(gdf_traffic) # Plot traffic data with grid geometry # US state and arizona shapefile for plotting states = gpd.read_file("data/state_outline/cb_2018_us_state_20m/cb_2018_us_state_20m.shp") # path to downloaded shapefile for state outlines states = states.to_crs('EPSG:4326') arizona = states[states['NAME'] == 'Arizona'] # filter for arizona only ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black") # gdf was reverted to df during process, so converting to gdf again gdf_traffic = gpd.GeoDataFrame(gdf_traffic) gdf_traffic.plot('traffic_counts_2022', ax=ax, legend = True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"}) plt.title("Traffic in Arizona (2022)") plt.xlabel("Latitude") plt.ylabel("Longitude") plt.autoscale(False) plt.show() ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black") gdf_traffic.plot('traffic_counts_above_5yavg', ax=ax, legend = True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"}) plt.title("Traffic Above the 5-Year Average in Arizona (2019-2023)") plt.xlabel("Latitude") plt.ylabel("Longitude") plt.autoscale(False) plt.show() ``` ### Climate Data (from NOAA) The climate (temperature and precipitation) data from NOAA comes from the Global Historical Climatology Network, combining daily climate observations from a variety of data sources, but primarily land-based stations. Many of these stations are updated daily. ```{python} # load data all_files = glob.glob("data/NOAA/" + "*.csv") list_of_dfs = [] for file in all_files: df = pd.read_csv(file) list_of_dfs.append(df) # merging data together into one data frame noaa = pd.concat(list_of_dfs, ignore_index = True) # make copy noaa1 = noaa # dropping columns that have too low non-null count (cannot extract valuable estimates from to fill null values) noaa1 = noaa1.drop(['DAPR', 'DASF', 'MDSF', 'PGTM', 'WDF2', 'WDF5', 'WESD', 'WESF'], axis = 1) # noaa1.info() ``` ```{python} # investigate missing values # print("Missing values (NaN): \n" + str(noaa1.isnull().sum())) unique_stations = noaa1['STATION'].unique() station_count = noaa1['STATION'].value_counts() # print("Stations: " + str(unique_stations)) # print("Station count: " + str(station_count)) # group by station noaaNum = noaa1.select_dtypes(include = np.number) # print(noaaNum) # For all values in the numerical column list from above for col in noaaNum.columns: noaa1[col] = noaa1[col].fillna( noaa1.groupby('STATION')[col].transform('median') ) # print("Missing values NOW (NaN): \n" + str(noaa1.isnull().sum())) ``` The features from the NOAA data included the following: | Feature Name | Description | Use | |----------------------------|--------------------------|------------------| | STATION | name of land-based station for collection | aggregating climate data | | LATITUDE | latitude of station position | combining datasets regionally | | LONGITUDE | longitude of station position | combining datasets regionally | | DATE | date (year-month-day) of data collection | aggregating climate data temporally | | PRCP | precipitation (inches) | `rainfall_2022` and `rainfall_below_5yavg` | | TAVG | average temperature (Fahrenheit) | `avg_temp_2022` and `avg_temp_above_5yavg` | | TMAX | maximum temperature (Fahrenheit) | `max_temp_2022` and `max_temp_above_5yavg` | TMAX and TAVG were used to calculate the maximum and average temperatures (respectively) for 2022 and the 5-year average 2019-2023; then these results were used to determine the difference between the max or average temperature in 2022 and 2019-2023 average. PRCP was used to determine the average rainfall (in inches) in 2022 and 2019-2023 5-year average; likewise, the difference between the 2022 and 2019-2023 averages were calculated from these engineered features. ```{python} # Date is in correct time format noaa1['DATE'] = pd.to_datetime(noaa1['DATE']) # Selecting variables to plot noaa1_scatter_eda = noaa1[['DATE', 'PRCP', 'TAVG', 'TMAX', 'SNOW']] # noaa1_scatter_eda = noaa1[['DATE', 'AWND', 'EVAP', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'TOBS', 'WSF2', 'WSF5']] # Numeric columns num_cols = noaa1_scatter_eda.select_dtypes(include=np.number).columns # Create stacked subplots fig, axes = plt.subplots(len(num_cols), 1, figsize=(12, 3 * len(num_cols)), sharex=True) # for loop to plot all figures into grid for i, col in enumerate(num_cols): # Scatter plot sns.scatterplot( data=noaa1_scatter_eda, x="DATE", y=col, ax=axes[i], s=5, alpha=0.5, color="blue", edgecolor=None) axes[i].set_ylabel(col) axes[i].grid(True, linestyle='--', alpha=0.5) axes[-1].set_xlabel("Date") fig.suptitle('Climate Features from NOAA') fig.tight_layout() plt.show() ``` It is easy to spot the cyclical nature of our climate measurements in the temperature data where each period spans one year. However, the pattern in precipitation and snow in inches over time is more difficult to identify. ```{python} # Extract year from date noaa1['YEAR'] = noaa1['DATE'].dt.year # Maximum temperature for 2022/station max_temp_2022 = noaa1[noaa1['YEAR'] == 2022].groupby('STATION')['TMAX'].max().rename('max_temp_2022') # Max temp in a 5-year average per station max_temp_5yavg = noaa1.groupby('STATION')['TMAX'].max().rename('max_temp_5yavg') # Average temperature for 2022 per station avg_temp_2022 = noaa1[noaa1['YEAR'] == 2022].groupby('STATION')['TAVG'].mean().rename('avg_temp_2022') # Average temperature for 5-years/station avg_temp_5yavg = noaa1.groupby('STATION')['TAVG'].mean().rename('avg_temp_5yavg') # Total inches of rainfall in 2022 rainfall_2022 = noaa1[noaa1['YEAR'] == 2022].groupby('STATION')['PRCP'].sum().rename('rainfall_2022') # Total inches average for 5-year period rainfall_5yavg = rainfall_5yavg = noaa1.groupby('STATION')['PRCP'].sum().rename('rainfall_5yavg') # Latitude per station latitude = noaa1.groupby('STATION')['LATITUDE'].first().rename('latitude') # Longitude per station longitude = noaa1.groupby('STATION')['LONGITUDE'].first().rename('longitude') features_2022 = pd.concat([ max_temp_2022, max_temp_5yavg, avg_temp_2022, avg_temp_5yavg, rainfall_2022, rainfall_5yavg, latitude, longitude ], axis=1).reset_index() # Max temp above 5-year average features_2022['max_temp_above_5yavg'] = features_2022['max_temp_2022'] - features_2022['max_temp_5yavg'] # Average temperature for 2022 - average temperature features_2022['avg_temp_above_5yavg'] = features_2022['avg_temp_2022'] - features_2022['avg_temp_5yavg'] # Annual rainfall average minus annual rainfall in 2022 features_2022['rainfall_below_5yavg'] = features_2022['rainfall_2022'] - features_2022['rainfall_5yavg'] noaa_features = features_2022 # noaa_features.info() ``` ```{python} # create grid_bin dataframes gdf_noaa = grid_bin(noaa_features) gdf_noaa1 = gdf_noaa.groupby('gridspace', as_index=False)[['max_temp_2022', 'max_temp_above_5yavg', 'avg_temp_2022', 'avg_temp_above_5yavg', 'rainfall_2022', 'rainfall_below_5yavg', 'latitude', 'longitude']].mean() # important note: lat and lon averages are not very useful, but they will be replaced later. Just wanted to preserve the feature for rudimentary mapping # need to get geometry points back geometry_noaa = gdf_noaa.groupby('gridspace')['geometry'].first().reset_index() gdf_noaa1 = gdf_noaa1.merge(geometry_noaa, on = 'gridspace') # gdf_noaa1 # convert back from df into gdf gdf_noaa1 = gpd.GeoDataFrame(gdf_noaa1, geometry="geometry", crs=gdf_noaa.crs) # US state and arizona shapefile for plotting states = gpd.read_file("data/state_outline/cb_2018_us_state_20m/cb_2018_us_state_20m.shp") # path to downloaded shapefile for state outlines states = states.to_crs(gdf_noaa1.crs) arizona = states[states['NAME'] == 'Arizona'] # filter for arizona only # example: plotting noaa data with arizona shapefile ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black") gdf_noaa1.plot(column="max_temp_2022", ax=ax, cmap="viridis", markersize=30, legend=True, legend_kwds={'label': "Temperature (F)", 'orientation': "vertical"}) plt.title("Average Temperature in Arizona (2022)") plt.xlabel('Latitude') plt.ylabel('Longitude') plt.autoscale(False) plt.show() ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black") gdf_noaa1.plot(column="rainfall_2022", ax=ax, cmap="viridis", markersize=30, legend=True, legend_kwds={'label': "Average Precipitation (inches)", 'orientation': "vertical"}) plt.title("Average Precipitation in Arizona (2022)") plt.xlabel('Latitude') plt.ylabel('Longitude') plt.autoscale(False) plt.show() ``` ### Storm Event Data (from NOAA) The storm event data is sourced from NOAA. Each row represented a weather event tied to a specific time and geographic zone, with end dates and times recorded some down to the exact minute. The data also incorporated multiple observation sources, including the public, broadcast media, trained spotters, and forest or park services. Additionally, it contained fields detailing property and crop damage, injuries, and event magnitude. However, a lot of this meta data contained empty cells which indicated some data reliability issues for certain fields. Relevant Variables in the dataset (these are the storm event variables we used for modeling and visualizations): ``` BEGIN_DATE – date the storm event began YEAR – extracted from BEGIN_DATE, used to filter and group data MAGNITUDE – numeric severity/intensity of the storm event CZ_FIPS – county FIPS code identifier used to group data by county ``` From these base variables, we created several engineered features: ``` df_2022 – subset of data for storm events that occurred in 2022 lowmag_2022 – number of low magnitude storm events in 2022 lowmag_5yavg – 5-year average (2019–2023) of low magnitude storm events per year highmag_2022 – number of high magnitude storm events in 2022 highmag_5yavg – 5-year average (2019–2023) of high magnitude storm events per year avg_mag_2022 – average storm magnitude for each county/zone in 2022 ``` The engineered features (lowmag, highmag, and avg_mag) were used to compare storm activity in 2022 against the 5-year averages and to evaluate how storm intensity varied across counties. ```{python} # Path to your folder folder_path = "data/StormEvents/" # Create a list of all matching files (2019–2023) csv_files = sorted(glob.glob(os.path.join(folder_path, "storm_data_AZ_*.csv"))) # Load and concatenate all CSVs df_list = [] for file in csv_files: df_year = pd.read_csv(file) df_list.append(df_year) # Merge into one dataframe df_all = pd.concat(df_list, ignore_index=True) # Check the result # print("Shape:", df_all.shape) # print("Years combined:", [os.path.basename(f) for f in csv_files]) # print(df_all.head()) # Count frequency of each storm event type event_counts = df_all['EVENT_TYPE'].value_counts().reset_index() event_counts.columns = ['EVENT_TYPE', 'FREQUENCY'] # Create scatter plot plt.figure(figsize=(10,6)) plt.scatter(event_counts['EVENT_TYPE'], event_counts['FREQUENCY'], color='blue') # Add titles and labels plt.title('Frequency of Storm Event Types') plt.xlabel('Storm Event Type') plt.ylabel('Frequency') plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.show() # Rename data frame storm_df = df_all # storm_df.info() ``` ```{python} # Make sure dates and magnitude are correct type storm_df['BEGIN_DATE'] = pd.to_datetime(storm_df['BEGIN_DATE']) storm_df['YEAR'] = storm_df['BEGIN_DATE'].dt.year storm_df['MAGNITUDE'] = pd.to_numeric(storm_df['MAGNITUDE'], errors='coerce') # Define threshold for low/high magnitude low_mag_threshold = 1.0 # Filter for recent years recent_df = storm_df[storm_df['YEAR'].between(2019, 2023)] df_2022 = storm_df[storm_df['YEAR'] == 2022] # Group by county (using CZ_FIPS here) # Low magnitude storm events in 2022 per county lowmag_2022 = df_2022[df_2022['MAGNITUDE'] < low_mag_threshold].groupby('CZ_FIPS').size().rename('lowmagstorm_events_2022') # 5-year average low magnitude storms per county lowmag_5yavg = recent_df[recent_df['MAGNITUDE'] < low_mag_threshold].groupby(['YEAR', 'CZ_FIPS']).size().groupby('CZ_FIPS').mean().rename('lowmagstorm_5yavg') # High magnitude storm events in 2022 per county highmag_2022 = df_2022[df_2022['MAGNITUDE'] >= low_mag_threshold].groupby('CZ_FIPS').size().rename('highmagstorm_events_2022') # 5-year average high magnitude storms per county highmag_5yavg = recent_df[recent_df['MAGNITUDE'] >= low_mag_threshold].groupby(['YEAR', 'CZ_FIPS']).size().groupby('CZ_FIPS').mean().rename('highmagstorm_5yavg') # Average storm magnitude in 2022 per county avg_mag_2022 = df_2022.groupby('CZ_FIPS')['MAGNITUDE'].mean().rename('average_storm_mag_2022') # Combine all of the features storm_features_2022 = pd.concat([ lowmag_2022, lowmag_5yavg, highmag_2022, highmag_5yavg, avg_mag_2022 ], axis=1).reset_index() # Calculate how many low-magnitude storm events in 2022 exceeded the 5-year average # This is done by subtracting the 5-year average number of low-magnitude storms # from the actual count in 2022, per county. storm_features_2022['lowmagstorm_events_above_5yavg'] = ( storm_features_2022['lowmagstorm_events_2022'] - storm_features_2022['lowmagstorm_5yavg'] ) # Calculate how many high-magnitude storm events in 2022 exceeded the 5-year average # This is the difference between the 2022 count and the 5-year average for high-magnitude storms, # computed per county. storm_features_2022['highmagstorm_events_above_5yavg'] = ( storm_features_2022['highmagstorm_events_2022'] - storm_features_2022['highmagstorm_5yavg'] ) # storm_features_2022.info() ``` ### Data Integrity The datasets we had access to had integrity issues. Meaning the collection methods were inconsistent which led to many missing values. A leading contributor to this problem is the large size of the sampling range. Climate stations are often designed to be operated remotely, meaning any hardware issues that arise have delays in repair time, leading to frequent gaps in collection. Another integrity issue relates to the storm event data which was uniquely granular in both space and time. While the traffic and climate stations were measured relatively consistently over time, storm event data were highly qualitative in nature and only included measurements from when storms were present. ## Engineered Features ### Climate Features | Feature | Description | |------------------------------|------------------------------------------| | max_temp_2022 | maximum temperature for 2022 | | max_temp_above_5yavg | maximum temperature for 2022 minus maximum temperature averaged 2019-2023 | | avg_temp_2022 | average temperature for 2022 | | avg_temp_above_5yavg | average temperature for 2022 minus average temperature averaged 2019-2023 | | rainfall_2022 | total inches of rainfall in 2022 | | rainfall_below_5yavg | inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022 | | lowmagstorm_events_2022 | sum of low magnitude storm events in 2022 | | lowmagstorm_events_above_5yavg | sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023 | | highmagstorm_events_2022 | sum of high magnitude storm events in 2022 | | highmagstorm_events_above_5yavg | sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023 | | average_storm_mag_2022 | average storm event magnitude for 2022 | ### Urbanization Features | Feature | Description | |------------------------------|------------------------------------------| | traffic_counts | total number of vehicles detected in 2022 | | traffic_counts_above_5yavg | traffic counts in 2022 minus traffic counts averaged 2019-2023 | # Preprocessing ### Geospatial Normalization Since our research question is about regional relationships, we had to combine our data regionally. Since each dataset had unique geographic qualities, it was not a straightforward process. The two main forms of location information given in our datasets were latitude/longitude pairs and county names. Since latitude and longitude are specific and discrete, we were unable to merge on latitude and longitude pair points alone. In order to merge on specific points, it would mean that the traffic station, the climate measurement, and the storm identification would have had to occur at the exact same location. To account for this, we implemented geospatial normalization in the form of a gridspace. Our gridspace is a 2-dimensional grid overlayed on the state of Arizona that encompasses pre-defined ranges of latitude and longitude. We manufactured a grid of 336 spaces that is 16 cells horizontally and 21 cells vertically. Any data that had a station location that was within a given gridspace would be merged into that gridspace and assigned a gridspace number and a gridspace geometry. In cartography, geometries describe the shape of a region. Generally speaking, geometries include Points, Lines,and Polygons (https://geopandas.org/en/stable/docs/user_guide/data_structures.html). The traffic and NOAA station data include latitude and longitude information, which would be classified as a Point. The gridspace geometry however is a rectangle which would be considered a Polygon. This form of normalization is helpful because it ensures all data share the same projection information and it assures the regions are the same size. We utilized the GeoPandas Spatial Join (`geopandas.sjoin()`) function to spatially join our data which we implemented as we merged our datasets. Spatial joining works by joining GeoDataFrames on geometries that intersect. So traffic and climate data would be grouped together in the same gridspace if they fell within the same region. First, we combined the NOAA climate data and the traffic data. When we grouped data that had multiple station locations within the same gridspace, we assigned the mean of those measurements to be the feature value for that gridspace. The spatial join function only maintains gridspaces that are represented in both datasets. This means that our initial gridspace of `{python}grid_size` was decreased to \~80 gridspaces. An interesting problem that arose was with the storm event data which did not include latitude and longitude values. Since storms travel across regions, the dataset included a county variable and starting and ending latitude and longitude values. We decided to preserve the county feature from the raw traffic data to merge on. This made the storm values unique in that a single storm event would be represented in every gridspace within the county in which it occurred. For our regression analyses, our final DataFrame was structured such that each row corresponded to a gridspace and each column was a traffic, climate, or storm event value. ### Temporal Normalization While we have addressed regional normalization using our `gridspace` feature, we had to find a way to normalize our data in the time dimension. The most common approach to see how features change over time is to perform regression where the independent variable is time. Common examples of this are seeing how a stock's value changes over time, or seeing how temperature for one city changes day to day. Since our independent variable changes region by region, we accounted for temporal variability by comparing a year of interest (2022) to a dataset of averages for the same features over a 5 year time span (2019-2023). We did this through engineering features that were differences between values measured in 2022 and values that were measured over the 5 year average. ```{python} # US state and arizona shapefile for plotting states = gpd.read_file("data/state_outline/cb_2018_us_state_20m/cb_2018_us_state_20m.shp") # path to downloaded shapefile for state outlines states = states.to_crs('EPSG:4326') arizona = states[states['NAME'] == 'Arizona'] # filter for arizona only sns.set_theme(style='white') ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black") # gdf was reverted to df during process, so converting to gdf again gdf_traffic = gpd.GeoDataFrame(gdf_traffic) gdf_traffic.plot('traffic_counts_2022', ax=ax, legend = True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"}) plt.title("Traffic in Arizona (2022)") plt.xlabel("Latitude") plt.ylabel("Longitude") plt.autoscale(False) plt.show() ``` ```{python} ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black") gdf_traffic.plot('traffic_counts_above_5yavg', ax=ax, legend = True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"}) plt.title("Traffic Above the 5-Year Average in Arizona (2019-2023)") plt.xlabel("Latitude") plt.ylabel("Longitude") plt.autoscale(False) plt.show() ``` ```{python} ### Merge Traffic and NOAA Datasets on `gridspace` gdf_traffic_premerge = gdf_traffic.drop(['latitude', 'longitude'], axis=1) # remove Point geometry to preserve gridspace geometry from traffic df gdf_noaa_premerge = gdf_noaa1.drop(['latitude', 'longitude', 'geometry'], axis=1) # Merging removes gridspaces that are not in the left dataset (traffic) gdf_traffic_noaa = pd.merge(gdf_traffic_premerge, gdf_noaa_premerge, on='gridspace') # gdf_traffic_noaa.head() ``` ```{python} ### Merge Storm and Traffic/NOAA Dataset on `fips_code` df_storm_premerge = storm_features_2022.rename({'CZ_FIPS': 'fips_code'}, axis=1) df_storm_premerge.columns merged_df = pd.merge(gdf_traffic_noaa, df_storm_premerge) # merged_df.isnull().sum() ``` ### Data Preprocessing The prepare our data for regression, we removed outliers with a 5% and 95% tolerance range. Then we filled missing values scaled our features using StandardScaler. The low integrity of the dataset meant that we had issues with data being flagged as outliers. This was because the overabundance of NaNs led to the fill values being over-represented, classifying the rest of the dataset as outliers. We believe this to be a leading contributor to our poor model performance. ```{python} # For numerical columns, fill nulls with mean numerical_cols = ['lowmagstorm_events_2022', 'lowmagstorm_5yavg', 'highmagstorm_events_2022', 'highmagstorm_5yavg', 'average_storm_mag_2022', 'lowmagstorm_events_above_5yavg', 'highmagstorm_events_above_5yavg', 'max_temp_above_5yavg', 'avg_temp_above_5yavg', 'rainfall_below_5yavg', 'traffic_counts_2022', 'traffic_counts_above_5yavg', 'max_temp_2022', 'avg_temp_2022', 'rainfall_2022'] for col in numerical_cols: merged_df[col].fillna(merged_df[col].mean(), inplace=True) # print(merged_df.isnull().sum()) # Remove outliers - capping (winsorizing) imputation numeric_cols = merged_df.select_dtypes(include=[np.number]).columns for col in numeric_cols: lower_limit = np.percentile(merged_df[col], 5) upper_limit = np.percentile(merged_df[col], 95) ``` ```{python} # Test # applying log normalization to traffic counts for merged dataset # log_df = merged_df.copy() # log_df = np.log(log_df.select_dtypes(include=[np.number])) # # log_df[['traffic_counts_2022', 'traffic_counts_above_5yavg', 'avg_temp_2022']] = np.log(merged_df[['traffic_counts_2022', 'traffic_counts_above_5yavg', 'avg_temp_2022']]) # # log_df = log_df.drop(['gridspace'], axis=1) # log_df.boxplot() # plt.xticks(rotation=45) ``` ```{python} # Initialize the scaler scaler = StandardScaler() # Select numeric columns numeric_cols = merged_df.select_dtypes(include='number').columns # Fit and transform merged_df_scaled = merged_df.copy() merged_df_scaled[numeric_cols] = scaler.fit_transform(merged_df[numeric_cols]) # print(merged_df_scaled) merged_df_scaled.boxplot() plt.xticks(rotation=60, fontsize=8) plt.title('Box and Whisker Plot of Engineered Features after StandardScaler') plt.show() # sns.pairplot(merged_df_scaled) ``` ```{python} # 2022 urbanization metrics urban_2022 = [ 'traffic_counts_2022', # 'gridspace' # This seems static/spatial, so keep with 2022 data ] # Urbanization metrics relative to 5-year average urban_above_5yavg = [ 'traffic_counts_above_5yavg' ] # 2022 environmental metrics env_2022 = [ 'max_temp_2022', 'avg_temp_2022', 'rainfall_2022', 'lowmagstorm_events_2022', 'highmagstorm_events_2022', 'average_storm_mag_2022' ] # Environmental metrics relative to 5-year average (anomalies) env_above_below_5yavg = [ 'max_temp_above_5yavg', 'avg_temp_above_5yavg', 'rainfall_below_5yavg', # This one is "below", but similar anomaly concept 'lowmagstorm_events_above_5yavg', 'highmagstorm_events_above_5yavg' ] df_5yavg = merged_df_scaled[urban_above_5yavg + env_above_below_5yavg] df_2022 = merged_df_scaled[urban_2022 + env_2022] sns.pairplot(df_5yavg) plt.show() sns.pairplot(df_2022) plt.show() ``` We grouped our data into climate and urbanization (target) categories. We split each set of categories into 2022 values and 5 year average values. We performed PCA analysis to decrease our dataset preserving only the most prominent features, and then performed one hot encoding on categorical values to convert object types to numerical types. ### PCA Results ```{python} # Keep only numeric columns numeric_cols = merged_df_scaled.select_dtypes(include='number').columns df_numeric = merged_df_scaled[numeric_cols] # Run PCA to get 95% variance pca = PCA(n_components=0.95) reduced_data = pca.fit_transform(df_numeric) # Loading scores loading_scores = pd.DataFrame(pca.components_, columns=numeric_cols) threshold = 0.3 top_features = {} for i in range(loading_scores.shape[0]): # Name the component (e.g., "Component 1", "Component 2", etc.) component = f"Component {i+1}" # Get the loading scores for this principal component scores = loading_scores.iloc[i] # Identify features with absolute loading scores above the threshold top_features_for_component = scores[abs(scores) > threshold].index.tolist() # Store the top features for this component in the dictionary top_features[component] = top_features_for_component # Print results for component, features in top_features.items(): print(f"{component}: {features}") ``` ```{python} # Transform Catagorical Variables into Numeric Labels # One hot encoding for catagorical variables with multiple categories categorical_cols = merged_df_scaled.select_dtypes(include='object').columns merged_df_scaled = pd.get_dummies(merged_df_scaled, columns= categorical_cols, drop_first=True) # Convert all bool columns to int (0/1) bool_cols = merged_df_scaled.select_dtypes(include='bool').columns merged_df_scaled[bool_cols] = merged_df_scaled[bool_cols].astype(int) # print(merged_df_scaled.head()) # print(merged_df_scaled.dtypes) ``` # Regression Analysis ### Linear Regression Models We utilized Ordinary Least Squares Regression, Ridge Regression, and Lasso Regression. We used both Ridge Regression and Lasso Regression to see if there were any prominently predictive features that we could further investigate. We could determine this if the Lasso Regression performed significantly better than the Ridge Regression, because the Lasso Regression omits features of low influence while Ridge Regression preserves them. We grouped the 2022 data and the 5 year average (2019-2023) data in every combination during regression analysis to investigate any underlying relationships. ```{python} def run_regression_models(X_val, y_val): X = merged_df_scaled[X_val] # Independent variable y = merged_df_scaled[y_val] # Dependent variable X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) # _______________ OLS Regression _______________ X_train_with_const = sm.add_constant(X_train) model = sm.OLS(y_train, X_train_with_const).fit() model_summary = model.summary2() # print(model_summary) # Add constant to X_test X_test_const = sm.add_constant(X_test) # Predict using trained model y_pred = model.predict(X_test_const) mse = mean_squared_error(y_test, y_pred) # mean squared error rmse = np.sqrt(mse) # root mse r2 = r2_score(y_test, y_pred) # r^2 on test set # Print test performance print("OLS Regression Model:") print("\tMean-squared error:", mse) print("\tRoot mean-squared error:", rmse) print("\tR-squared value:", r2) # print(f"MSE: {mse:.4f}") # print(f"RMSE: {rmse:.4f}") # print(f"R²: {r2:.4f}") # Define the alpha values to try alpha_values = [0.01, 0.1, 1, 10, 100,1000,10000] # Initialize variables to store the best alpha and its corresponding MSE best_alpha = None best_mse = np.inf # Cross-validation to select the best alpha kf = KFold(n_splits=5, shuffle=True, random_state=42) # _______________ Ridge Regression _______________ for alpha in alpha_values: # Create Ridge regression model with current alpha ridge = Ridge(alpha=alpha) # Perform cross-validation and get negative MSE scores cv_scores = cross_val_score(ridge, X_train, y_train, cv=kf, scoring='neg_mean_squared_error') mean_cv_mse = -np.mean(cv_scores) # Update best alpha if current mean CV MSE is lower if mean_cv_mse < best_mse: best_mse = mean_cv_mse best_alpha = alpha print(f"\nRidge best alpha selected: {best_alpha} with CV MSE: {best_mse}") # _______________ Ridge Regression _______________ # Initialize Ridge regression with the best alpha ridge_reg = Ridge(alpha=best_alpha) # Fit the Ridge model on the training data ridge_reg.fit(X_train, y_train) # Predict on the testing set y_pred = ridge_reg.predict(X_test) # Evaluate performance mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print("Ridge Regression Model:") print("\tMean-squared error:", mse) print("\tRoot mean-squared error:", rmse) print("\tR-squared value:", r2) # _______________ Lasso Regression _______________ for alpha in alpha_values: # Create Lasso regression with current alpha lasso = Lasso(alpha=alpha, max_iter=10000) # Perform cross-validation and get negative MSE scores cv_scores = cross_val_score(lasso, X_train, y_train, cv=kf, scoring='neg_mean_squared_error') mean_cv_mse = -np.mean(cv_scores) # Update best alpha if current mean CV MSE is lower if mean_cv_mse < best_mse: best_mse = mean_cv_mse best_alpha = alpha print(f"\nLasso best alpha selected: {best_alpha} with CV MSE: {best_mse}") # _______________ Lasso Regression _______________ # Initialize Lasso regression with the best alpha lasso_reg = Lasso(alpha=best_alpha, max_iter=10000) # Fit the Lasso model on the training data lasso_reg.fit(X_train, y_train) # Predict on the testing set y_pred = lasso_reg.predict(X_test) # Evaluate performance mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print("Lasso Regression Model:") print("\tMean-squared error:", mse) print("\tRoot mean-squared error:", rmse) print("\tR-squared value:", r2) ``` ```{python} print('Environmental Data (5 Year Averages)') print('Traffic Data (5 Year Averages)') run_regression_models(env_above_below_5yavg, 'traffic_counts_above_5yavg') ``` ```{python} print('Environmental Data (2022)') print('Traffic Data (2022)') run_regression_models(env_2022, 'traffic_counts_2022') ``` ```{python} print('Environmental Data (2022)') print('Traffic Data (5 Year Averages)') run_regression_models(env_2022, 'traffic_counts_above_5yavg') ``` ```{python} print('Environmental Data (5 Year Averages)') print('Traffic Data (2022)') run_regression_models(env_above_below_5yavg, 'traffic_counts_2022') ``` ### Polynomial regression We utilized a second order polynomial regression to see if it would perform better than the linear regression models. We discovered that the quadratic regression also performed worse than a prediction made by averages. ```{python} def poly_reg(X_val, y_val, degree=2): X = merged_df_scaled[X_val].values y = merged_df_scaled[y_val].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) # Transforming data for polynomial regression poly_features = PolynomialFeatures(degree = degree) # Adjust degree as necessary X_train_poly = poly_features.fit_transform(X_train) X_test_poly = poly_features.transform(X_test) # Initialize the Linear Regression model linear_reg = LinearRegression() # Fit the model with polynomial features linear_reg.fit(X_train_poly, y_train) # Predict on the testing set y_pred = linear_reg.predict(X_test_poly) # Calculate and print the MSE and R-squared mse_initial = mean_squared_error(y_test, y_pred) r2_initial = r2_score(y_test, y_pred) print(f'Mean Squared Error: {mse_initial}') print(f'R-squared: {r2_initial}') ``` ```{python} print('Environmental Data (5 Year Averages)') print('Traffic Data (5 Year Averages)') poly_reg(env_above_below_5yavg, 'traffic_counts_above_5yavg', degree=2) ``` ```{python} print('Environmental Data (2022)') print('Traffic Data (2022)') poly_reg(env_2022, 'traffic_counts_2022', degree=2) ``` ```{python} print('Environmental Data (2022)') print('Traffic Data (5 Year Averages)') poly_reg(env_2022, 'traffic_counts_above_5yavg', degree=2) ``` ```{python} print('Environmental Data (5 Year Averages)') print('Traffic Data (2022)') poly_reg(env_above_below_5yavg, 'traffic_counts_2022', degree=2) ``` # Discussion - The overall low predictive power across all of our models implies that our hypothesis that traffic counts alone are not an accurate metric for urbanization. There is also a concern that these datasets are not appropriate for our research question. Potentially narrowing our scope through decreasing our regions of interest to a single county or our time frame to a single season may help control for natural variability and data integrity in our dataset. - Our PCA analysis demonstrates that extreme storm events and temperature have the most significant interaction among features in our dataset. - The polynomial regression model did not perform better than our linear regression models. Both model types produced negative R-squared values, meaning a simple numerical average has more predictive power than our models. ### Future improvements Some ideas for improvement when investigating our research question include: - Using more complete datasets. The NOAA dataset, in its incompleteness, includes a use advisory that the data are meant mainly for hydrological and agricultural purposes and warns that the incompleteness of the dataset means it is not sufficient for climate change research. - Perform traffic congestion normalization. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas. - Perform modeling using PCA clusters with variables of interest. Reducing the number of features may lead to better R-squared values because the regression fits will be less affected by extraneous features. # References 1. Grid Construction Reference 1. Brennan, J. (2020). Fast and easy gridding of point data with geopandas. <https://james-brennan.github.io/posts/fast_gridding_geopandas/> 2. Traffic Data 1. U.S. Department of Transportation Federal Highway Administration. (n.d.). Travel Monitoring. <https://www.fhwa.dot.gov/policyinformation/travel_monitoring/tvt.cfm> 2. Transportation Research Board. The Congestion Mitigation and Air Quality Improvement Programs. <https://onlinepubs.trb.org/onlinepubs/sr/sr264.pdf> 3. World Health Organization. Health Effects of Transport-Related Air Pollution <https://books.google.com/books?hl=en&lr=&id=b2G3k51rd0oC&oi=fnd&pg=PR1&ots=O94u9zCs4y&sig=v1JEfno-HjkbKf3im4iDQLJJUQ0#v=onepage&q&f=false> 3. Storm Data 1. NOAA National Centers for Environmental Information. (n.d.). Storm Events Databse. <https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/> 4. Climate Data 1. NOAA National Centers for Environmental Information. (n.d.). Climate Data. <https://www.ncdc.noaa.gov/cdo-web/search;jsessionid=2851FE4A6791F4BF429988727483E450> 2. Global Historical Climatology Network. (n.d.). Daily Documentation. <https://www.ncei.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf> 5. Trenberth, K.E. (2007). The Impact of Climate Change and Variability on Heavy Precipitation, Floods and Droughts. <https://web.archive.org/web/20170809113035id_/http://www.cgd.ucar.edu/cas/Trenberth/website-archive/trenberth.papers-moved/The%20Impact%20of%20Climate%20Change%20on%20the%20Water%20Cycle%20v%204_ss.pdf> 6. Zhang M., Gao, Y., & Ge, J. (2025). Different responses of extreme and mean precipitation to land use and land cover changes. <https://www.nature.com/articles/s41612-025-01049-1> 7. Saturn Cloud. (2023). How to Sum Two Columns in a Pandas DataFrame.<https://saturncloud.io/blog/how-to-sum-two-columns-in-a-pandas-dataframe/>