Urbanization and Environmental Quality in Arizona, USA
Project Final: Workspace
Final Project for course INFO 523: Data Mining and Discovery
Author
Affiliation
JKP (Vera Jackson, Molly Kerwick, Brooke Pacheco)
College of Information Science, University of Arizona
Introduction
The goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data.
Research Questions
Can we predict the environmental quality of a region based on urbanization indicators for that region?
The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region.
Are storm event data and traffic data successful environmental health and urbanization indicators, respectively?
The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest.
Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model?
This will indicate how accurate our scope is compared to other environmental impact studies.
Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data?
If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities.
Proposed Procedure
This project will consist of 2 studies:
Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two.
Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions.
Data
Traffic Data
Traffic congestion is a major source of air pollution in and around traffic areas (Transportation Research Board). Consequently, high traffic congestion has been shown to have a negative health impact on drivers, pedestrians, and residents in areas near roadways (World Health Organization). We selected traffic data as our urbanization metric as the monitoring was fairly robust throughout areas of different types (eg. rural, urban, agricultural). We measured traffic volume as a value of gross counts in a given year. The traffic data were sourced from the Department of Transportation (DOT) Federal Highway Administration (FHWA) and included traffic count data and monitoring station data. The traffic count data were separated into files for each month spanning our five years of interest (2019-2023). Each line was a single day of counts at a single traffic station. The columns included a count for each hour of that day. The aggregate our traffic count data, we summed the hourly counts into a gross daily count, and then grouped the dataset by station_id summing the newly created daily traffic count. This led to very high counts on the order of magnitude of 10e7. The traffic station held the geographic information for the traffic stations. Each row was a different station location deployed by FHWA that year. It included the features station_id, latitude, longitude, and county_code. We omitted features like the number of lanes and the direction of traffic flow. The traffic data were merged on station_id which linked the geographic location with the traffic counts.
A potential issue in our approach is that gross counts alone are not an indicator of traffic congestion. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.
Climate Data (from NOAA)
The climate (temperature and precipitation) data from NOAA comes from the Global Historical Climatology Network, combining daily climate observations from a variety of data sources, but primarily land-based stations. Many of these stations are updated daily.
The features from the NOAA data included the following:
Feature Name
Description
Use
STATION
name of land-based station for collection
aggregating climate data
LATITUDE
latitude of station position
combining datasets regionally
LONGITUDE
longitude of station position
combining datasets regionally
DATE
date (year-month-day) of data collection
aggregating climate data temporally
PRCP
precipitation (inches)
rainfall_2022 and rainfall_below_5yavg
TAVG
average temperature (Fahrenheit)
avg_temp_2022 and avg_temp_above_5yavg
TMAX
maximum temperature (Fahrenheit)
max_temp_2022 and max_temp_above_5yavg
TMAX and TAVG were used to calculate the maximum and average temperatures (respectively) for 2022 and the 5-year average 2019-2023; then these results were used to determine the difference between the max or average temperature in 2022 and 2019-2023 average. PRCP was used to determine the average rainfall (in inches) in 2022 and 2019-2023 5-year average; likewise, the difference between the 2022 and 2019-2023 averages were calculated from these engineered features.
It is easy to spot the cyclical nature of our climate measurements in the temperature data where each period spans one year. However, the pattern in precipitation and snow in inches over time is more difficult to identify.
Storm Event Data (from NOAA)
The storm event data is sourced from NOAA. Each row represented a weather event tied to a specific time and geographic zone, with end dates and times recorded some down to the exact minute. The data also incorporated multiple observation sources, including the public, broadcast media, trained spotters, and forest or park services. Additionally, it contained fields detailing property and crop damage, injuries, and event magnitude. However, a lot of this meta data contained empty cells which indicated some data reliability issues for certain fields.
Relevant Variables in the dataset (these are the storm event variables we used for modeling and visualizations):
BEGIN_DATE – date the storm event began
YEAR – extracted from BEGIN_DATE, used to filter and group data
MAGNITUDE – numeric severity/intensity of the storm event
CZ_FIPS – county FIPS code identifier used to group data by county
From these base variables, we created several engineered features:
df_2022 – subset of data for storm events that occurred in 2022
lowmag_2022 – number of low magnitude storm events in 2022
lowmag_5yavg – 5-year average (2019–2023) of low magnitude storm events per year
highmag_2022 – number of high magnitude storm events in 2022
highmag_5yavg – 5-year average (2019–2023) of high magnitude storm events per year
avg_mag_2022 – average storm magnitude for each county/zone in 2022
The engineered features (lowmag, highmag, and avg_mag) were used to compare storm activity in 2022 against the 5-year averages and to evaluate how storm intensity varied across counties.
Data Integrity
The datasets we had access to had integrity issues. Meaning the collection methods were inconsistent which led to many missing values. A leading contributor to this problem is the large size of the sampling range. Climate stations are often designed to be operated remotely, meaning any hardware issues that arise have delays in repair time, leading to frequent gaps in collection. Another integrity issue relates to the storm event data which was uniquely granular in both space and time. While the traffic and climate stations were measured relatively consistently over time, storm event data were highly qualitative in nature and only included measurements from when storms were present.
Engineered Features
Climate Features
Feature
Description
max_temp_2022
maximum temperature for 2022
max_temp_above_5yavg
maximum temperature for 2022 minus maximum temperature averaged 2019-2023
avg_temp_2022
average temperature for 2022
avg_temp_above_5yavg
average temperature for 2022 minus average temperature averaged 2019-2023
rainfall_2022
total inches of rainfall in 2022
rainfall_below_5yavg
inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022
lowmagstorm_events_2022
sum of low magnitude storm events in 2022
lowmagstorm_events_above_5yavg
sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023
highmagstorm_events_2022
sum of high magnitude storm events in 2022
highmagstorm_events_above_5yavg
sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023
average_storm_mag_2022
average storm event magnitude for 2022
Urbanization Features
Feature
Description
traffic_counts
total number of vehicles detected in 2022
traffic_counts_above_5yavg
traffic counts in 2022 minus traffic counts averaged 2019-2023
Preprocessing
Geospatial Normalization
Since our research question is about regional relationships, we had to combine our data regionally. Since each dataset had unique geographic qualities, it was not a straightforward process. The two main forms of location information given in our datasets were latitude/longitude pairs and county names. Since latitude and longitude are specific and discrete, we were unable to merge on latitude and longitude pair points alone. In order to merge on specific points, it would mean that the traffic station, the climate measurement, and the storm identification would have had to occur at the exact same location. To account for this, we implemented geospatial normalization in the form of a gridspace.
Our gridspace is a 2-dimensional grid overlayed on the state of Arizona that encompasses pre-defined ranges of latitude and longitude. We manufactured a grid of 336 spaces that is 16 cells horizontally and 21 cells vertically. Any data that had a station location that was within a given gridspace would be merged into that gridspace and assigned a gridspace number and a gridspace geometry. In cartography, geometries describe the shape of a region. Generally speaking, geometries include Points, Lines,and Polygons (https://geopandas.org/en/stable/docs/user_guide/data_structures.html). The traffic and NOAA station data include latitude and longitude information, which would be classified as a Point. The gridspace geometry however is a rectangle which would be considered a Polygon. This form of normalization is helpful because it ensures all data share the same projection information and it assures the regions are the same size. We utilized the GeoPandas Spatial Join (geopandas.sjoin()) function to spatially join our data which we implemented as we merged our datasets. Spatial joining works by joining GeoDataFrames on geometries that intersect. So traffic and climate data would be grouped together in the same gridspace if they fell within the same region.
First, we combined the NOAA climate data and the traffic data. When we grouped data that had multiple station locations within the same gridspace, we assigned the mean of those measurements to be the feature value for that gridspace. The spatial join function only maintains gridspaces that are represented in both datasets. This means that our initial gridspace of {python}grid_size was decreased to ~80 gridspaces. An interesting problem that arose was with the storm event data which did not include latitude and longitude values. Since storms travel across regions, the dataset included a county variable and starting and ending latitude and longitude values. We decided to preserve the county feature from the raw traffic data to merge on. This made the storm values unique in that a single storm event would be represented in every gridspace within the county in which it occurred.
For our regression analyses, our final DataFrame was structured such that each row corresponded to a gridspace and each column was a traffic, climate, or storm event value.
Temporal Normalization
While we have addressed regional normalization using our gridspace feature, we had to find a way to normalize our data in the time dimension. The most common approach to see how features change over time is to perform regression where the independent variable is time. Common examples of this are seeing how a stock’s value changes over time, or seeing how temperature for one city changes day to day. Since our independent variable changes region by region, we accounted for temporal variability by comparing a year of interest (2022) to a dataset of averages for the same features over a 5 year time span (2019-2023). We did this through engineering features that were differences between values measured in 2022 and values that were measured over the 5 year average.
Data Preprocessing
The prepare our data for regression, we removed outliers with a 5% and 95% tolerance range. Then we filled missing values scaled our features using StandardScaler. The low integrity of the dataset meant that we had issues with data being flagged as outliers. This was because the overabundance of NaNs led to the fill values being over-represented, classifying the rest of the dataset as outliers. We believe this to be a leading contributor to our poor model performance.
We grouped our data into climate and urbanization (target) categories. We split each set of categories into 2022 values and 5 year average values. We performed PCA analysis to decrease our dataset preserving only the most prominent features, and then performed one hot encoding on categorical values to convert object types to numerical types.
We utilized Ordinary Least Squares Regression, Ridge Regression, and Lasso Regression. We used both Ridge Regression and Lasso Regression to see if there were any prominently predictive features that we could further investigate. We could determine this if the Lasso Regression performed significantly better than the Ridge Regression, because the Lasso Regression omits features of low influence while Ridge Regression preserves them. We grouped the 2022 data and the 5 year average (2019-2023) data in every combination during regression analysis to investigate any underlying relationships.
Environmental Data (5 Year Averages)
Traffic Data (5 Year Averages)
OLS Regression Model:
Mean-squared error: 0.5443879147049658
Root mean-squared error: 0.7378264800784571
R-squared value: -0.2140888231489706
Ridge best alpha selected: 1000 with CV MSE: 1.174985534201412
Ridge Regression Model:
Mean-squared error: 0.5596105056182925
Root mean-squared error: 0.7480711902073843
R-squared value: -0.24803810267560222
Lasso best alpha selected: 1000 with CV MSE: 1.174985534201412
Lasso Regression Model:
Mean-squared error: 0.5601840743214328
Root mean-squared error: 0.7484544570790082
R-squared value: -0.2493172702195181
Environmental Data (2022)
Traffic Data (5 Year Averages)
OLS Regression Model:
Mean-squared error: 0.5553423826213788
Root mean-squared error: 0.7452129780280123
R-squared value: -0.23851937478615914
Ridge best alpha selected: 10000 with CV MSE: 1.1760355262786755
Ridge Regression Model:
Mean-squared error: 0.5590780965942956
Root mean-squared error: 0.7477152510109016
R-squared value: -0.2468507290621953
Lasso best alpha selected: 1 with CV MSE: 1.1759802832153639
Lasso Regression Model:
Mean-squared error: 0.5601840743214328
Root mean-squared error: 0.7484544570790082
R-squared value: -0.2493172702195181
Environmental Data (5 Year Averages)
Traffic Data (2022)
OLS Regression Model:
Mean-squared error: 0.44783955314134777
Root mean-squared error: 0.669208153821625
R-squared value: -0.4855490402619558
Ridge best alpha selected: 100 with CV MSE: 1.1569723795267826
Ridge Regression Model:
Mean-squared error: 0.4109393466705784
Root mean-squared error: 0.6410455106079275
R-squared value: -0.3631456796753172
Lasso best alpha selected: 100 with CV MSE: 1.1569723795267826
Lasso Regression Model:
Mean-squared error: 0.4315963501853747
Root mean-squared error: 0.6569599304260304
R-squared value: -0.43166796970271903
Polynomial regression
We utilized a second order polynomial regression to see if it would perform better than the linear regression models. We discovered that the quadratic regression also performed worse than a prediction made by averages.
Environmental Data (5 Year Averages)
Traffic Data (5 Year Averages)
Mean Squared Error: 8.216793889457124
R-squared: -17.325016690929225
Environmental Data (2022)
Traffic Data (2022)
Mean Squared Error: 0.7942820147920161
R-squared: -1.6347491562434548
Environmental Data (2022)
Traffic Data (5 Year Averages)
Mean Squared Error: 1.216637646936872
R-squared: -1.7133338729035632
Environmental Data (5 Year Averages)
Traffic Data (2022)
Mean Squared Error: 6.005052083350384
R-squared: -18.91963259290013
Discussion
The overall low predictive power across all of our models implies that our hypothesis that traffic counts alone are not an accurate metric for urbanization. There is also a concern that these datasets are not appropriate for our research question. Potentially narrowing our scope through decreasing our regions of interest to a single county or our time frame to a single season may help control for natural variability and data integrity in our dataset.
Our PCA analysis demonstrates that extreme storm events and temperature have the most significant interaction among features in our dataset.
The polynomial regression model did not perform better than our linear regression models. Both model types produced negative R-squared values, meaning a simple numerical average has more predictive power than our models.
Future improvements
Some ideas for improvement when investigating our research question include:
Using more complete datasets. The NOAA dataset, in its incompleteness, includes a use advisory that the data are meant mainly for hydrological and agricultural purposes and warns that the incompleteness of the dataset means it is not sufficient for climate change research.
Perform traffic congestion normalization. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.
Perform modeling using PCA clusters with variables of interest. Reducing the number of features may lead to better R-squared values because the regression fits will be less affected by extraneous features.
---title: "Urbanization and Environmental Quality in Arizona, USA"subtitle: "Project Final: Workspace"author: - name: "JKP (Vera Jackson, Molly Kerwick, Brooke Pacheco)" affiliations: - name: "College of Information Science, University of Arizona"description: "Final Project for course INFO 523: Data Mining and Discovery"format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: trueeditor: visualcode-annotations: hoverexecute: warning: false echo: falsejupyter: python3---# IntroductionThe goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data.Research Questions``` Can we predict the environmental quality of a region based on urbanization indicators for that region?The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region.Are storm event data and traffic data successful environmental health and urbanization indicators, respectively?The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest.Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model?This will indicate how accurate our scope is compared to other environmental impact studies.Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data?If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities.```Proposed ProcedureThis project will consist of 2 studies:``` Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two.Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions.``````{python}# Import Libraries # Reading files and plottingimport seaborn as snsimport globimport osfrom matplotlib.dates import date2numimport geodatasetsimport geopandas as gpdimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport shapelyfrom shapely.geometry import Point# Statistical modelingfrom sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_scorefrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import PolynomialFeaturesimport numbersfrom sklearn.decomposition import PCAfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.tree import DecisionTreeClassifier, plot_treefrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.feature_selection import SelectKBest, f_classif, f_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc, make_scorer, mean_squared_error, r2_score)import statsmodels.api as smfrom sklearn.linear_model import LogisticRegression, Ridge, Lassofrom scipy import stats``````{python}#| label: gdf-functiondef get_grid(xmin=-114.5, ymin=31.2, xmax=-109, ymax=37, num_x=15, num_y=20, proj='EPSG:4326'):""" Initialized grid used in gridspace normalization with grid geometry for mapping """# Calculate width and height of each grid cell cell_size_x = (xmax - xmin) / num_x cell_size_y = (ymax - ymin) / num_y# Define coordinate reference system (CRS) for the grid crs = proj# Initialize a list to hold grid cell geometries grid_cells = []# Create the grid by looping through x and y rangesfor x0 in np.arange(xmin, xmax + cell_size_x, cell_size_x):for y0 in np.arange(ymin, ymax + cell_size_y, cell_size_y):# Define opposite corner of the box (left and upper side) x1 = x0 - cell_size_x y1 = y0 + cell_size_y# Create a rectangular polygon and add it to the list grid_cells.append(shapely.geometry.box(x0, y0, x1, y1))# Create a GeoDataFrame from the list of polygons with specified CRS grid = gpd.GeoDataFrame(grid_cells, columns=['geometry'], crs=crs)# Assign a unique ID to each grid cell grid['gridspace'] = [i for i inrange(336)] # Assumes 336 cellsreturn griddef add_grid_geometry(df):""" Given a DataFrame that already contains a 'gridspace' column, add corresponding grid cell geometries by merging with the generated grid. """# Generate the grid grid = get_grid()# Merge input df with grid to attach geometry based on gridspace ID df = pd.merge(df, grid, on='gridspace', how='left')return dfdef grid_bin(df, proj='EPSG:4326', xmin=-114.5, ymin=31.2, xmax=-109, ymax=37, num_x=15, num_y=20):""" Assign a grid cell ('gridspace') to each point in the input DataFrame based on its latitude and longitude. """# Create point geometries from latitude and longitude df['geometry'] = gpd.points_from_xy(df.longitude, df.latitude, crs=proj)# Convert the DataFrame into a GeoDataFrame gdf = gpd.GeoDataFrame(data=df, geometry='geometry')# Generate the spatial grid with given bounding box and cell count grid = get_grid(xmin=xmin, ymin=ymin, xmax=xmax, ymax=ymax, num_x=num_x, num_y=num_y, proj=proj)# Perform a spatial join to assign each point to a grid cell gdf_joined = gpd.sjoin(left_df=gdf, right_df=grid, how='left')# Rename the resulting 'index_right' column to 'gridspace' for clarity gdf_joined['gridspace'] = gdf_joined['index_right'] gdf_joined = gdf_joined.drop(['index_right'], axis=1)return gdf_joined# Future shape grid informationnum_x =15num_y =20grid_size = (num_x+1) * (num_y+1)```# Data### Traffic DataTraffic congestion is a major source of air pollution in and around traffic areas (Transportation Research Board). Consequently, high traffic congestion has been shown to have a negative health impact on drivers, pedestrians, and residents in areas near roadways (World Health Organization). We selected traffic data as our urbanization metric as the monitoring was fairly robust throughout areas of different types (eg. rural, urban, agricultural). We measured traffic volume as a value of gross counts in a given year. The traffic data were sourced from the Department of Transportation (DOT) Federal Highway Administration (FHWA) and included traffic count data and monitoring station data. The traffic count data were separated into files for each month spanning our five years of interest (2019-2023). Each line was a single day of counts at a single traffic station. The columns included a count for each hour of that day. The aggregate our traffic count data, we summed the hourly counts into a gross daily count, and then grouped the dataset by `station_id` summing the newly created daily traffic count. This led to very high counts on the order of magnitude of 10e7. The traffic station held the geographic information for the traffic stations. Each row was a different station location deployed by FHWA that year. It included the features `station_id`, `latitude`, `longitude`, and `county_code`. We omitted features like the number of lanes and the direction of traffic flow. The traffic data were merged on `station_id` which linked the geographic location with the traffic counts.A potential issue in our approach is that gross counts alone are not an indicator of traffic congestion. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.```{python}# Iterating through traffic station and traffic count files to# engineer DataFrames of traffic datadf_traffic_5yavg = pd.DataFrame()for i inrange(19, 24): # Iterate through years of interest# format values for files/features year ="{:02}".format(i) yearnum =int(f'20{year}')# read in station data for the year df_year = pd.read_csv(f'data/Traffic/stations/df_20{year}')# define data types for merging later df_year['station_id'] = df_year['station_id'].astype(str)for j inrange(1,13): # Iterate through months of interest# format strings for files/features monthyear ="{:02}".format(j) + year month ="{:02}".format(j)# Accessing traffic files from data directory# Note: the format of data changed in 2022 so we have to account for the delimiter changingif (i <22): df_month = pd.read_csv(f"data/Traffic/counts/AZ{monthyear}.VOL")else: df_month = pd.read_csv(f"data/Traffic/counts/AZ{monthyear}.VOL", delimiter='|')# converting all text to lowercase so merging works later df_month.columns = [col.lower() for col in df_month.columns]# sum hour counts to daily count for every line (one line is measurements per station per day) df_month[f'daily_volume_{month}'] = df_month[["hour_{:02}".format(i) for i inrange(24)]].sum(axis=1) # sum daily counts to monthly count per station df_month = df_month.groupby('station_id', as_index=False)[[f'daily_volume_{month}']].sum()# defining data types so merging later works df_month['station_id'] = df_month['station_id'].astype(str)# adding each month's total count to an annual df df_year = pd.merge(df_year, df_month, how='left', on='station_id')# summing monthly counts so we have an annual count df_year['traffic_counts'] = df_year[["daily_volume_{:02}".format(i) for i inrange(1,13)]].sum(axis=1)# maintaining features used in our research questions and dropping the rest for handleability df_year = df_year[['station_id','latitude','longitude', 'traffic_counts', 'county_code']]# initialize df for 5 year averagesif (year =='19'): df_traffic_5yavg = df_year.copy() df_traffic_5yavg[f'traffic_counts_{year}'] = df_year['traffic_counts']``````{python}# create averages and clean dfs####### 1. Calculate averages for 5 year average traffic df# remove column that was duplicate added during loopingdf_traffic_5yavg = df_traffic_5yavg.drop('traffic_counts',axis=1)# create average column from 2019-2023df_traffic_5yavg['traffic_counts_5yavg'] = df_traffic_5yavg[['traffic_counts_{:02}'.format(year) for year inrange(19,24)]].mean(axis=1)####### 2. make traffic_counts_above_5yavg feature# remove data from years not of interestdf_traffic_2022 = df_traffic_5yavg.drop(['traffic_counts_{:02}'.format(year) for year inrange(19,22)], axis=1)df_traffic = df_traffic_2022.drop('traffic_counts_23', axis=1)# add counts above 5yavg feature (Ref: https://saturncloud.io/blog/how-to-sum-two-columns-in-a-pandas-dataframe/)df_traffic['traffic_counts_above_5yavg'] = df_traffic_2022['traffic_counts_22'] - df_traffic_2022['traffic_counts_5yavg']# rename for consistency with environmental dfdf_traffic = df_traffic.rename({'traffic_counts_22': 'traffic_counts_2022'}, axis=1)# remove intermediary 5yavg featuredf_traffic = df_traffic.drop('traffic_counts_5yavg', axis=1)``````{python}# initiate gridspacegdf_traffic = grid_bin(df_traffic, num_x=num_x, num_y=num_y)# average values over gridspacegdf_traffic = grid_bin(df_traffic)gdf_traffic = gdf_traffic.groupby('gridspace', as_index=False)[['traffic_counts_2022', 'traffic_counts_above_5yavg', 'latitude', 'longitude', 'county_code']].mean() # important note: lat and lon averages are not very useful, but they will be replaced later. Just wanted to preserve the feature for rudimentary mappinggdf_traffic = gdf_traffic.rename(columns={'county_code': "fips_code"})# Add County Info for storm df mergingdf_county = pd.DataFrame({"fips_code": [1,3,5,7,9,11,12,13,15,17,19,21,23,25,27],"county": ["Apache", "Cochise", "Coconino", "Gila", "Graham", "Greenlee", "La Paz", "Maricopa", "Mohave", "Navajo", "Pima", "Pinal", "Santa Cruz", "Yavapai", "Yuma"]})gdf_traffic = pd.merge(gdf_traffic, df_county, on='fips_code', how='left')# Need to add geometry of gridspace to gdf_trafficgdf_traffic = add_grid_geometry(gdf_traffic)# Plot traffic data with grid geometry# US state and arizona shapefile for plottingstates = gpd.read_file("data/state_outline/cb_2018_us_state_20m/cb_2018_us_state_20m.shp") # path to downloaded shapefile for state outlinesstates = states.to_crs('EPSG:4326')arizona = states[states['NAME'] =='Arizona'] # filter for arizona onlyax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black")# gdf was reverted to df during process, so converting to gdf againgdf_traffic = gpd.GeoDataFrame(gdf_traffic)gdf_traffic.plot('traffic_counts_2022', ax=ax, legend =True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"})plt.title("Traffic in Arizona (2022)")plt.xlabel("Latitude")plt.ylabel("Longitude")plt.autoscale(False)plt.show()ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black")gdf_traffic.plot('traffic_counts_above_5yavg', ax=ax, legend =True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"})plt.title("Traffic Above the 5-Year Average in Arizona (2019-2023)")plt.xlabel("Latitude")plt.ylabel("Longitude")plt.autoscale(False)plt.show()```### Climate Data (from NOAA)The climate (temperature and precipitation) data from NOAA comes from the Global Historical Climatology Network, combining daily climate observations from a variety of data sources, but primarily land-based stations. Many of these stations are updated daily.```{python}# load dataall_files = glob.glob("data/NOAA/"+"*.csv")list_of_dfs = []forfilein all_files: df = pd.read_csv(file) list_of_dfs.append(df)# merging data together into one data framenoaa = pd.concat(list_of_dfs, ignore_index =True)# make copynoaa1 = noaa# dropping columns that have too low non-null count (cannot extract valuable estimates from to fill null values)noaa1 = noaa1.drop(['DAPR', 'DASF', 'MDSF', 'PGTM', 'WDF2', 'WDF5', 'WESD', 'WESF'], axis =1)# noaa1.info()``````{python}# investigate missing values# print("Missing values (NaN): \n" + str(noaa1.isnull().sum()))unique_stations = noaa1['STATION'].unique()station_count = noaa1['STATION'].value_counts()# print("Stations: " + str(unique_stations))# print("Station count: " + str(station_count))# group by stationnoaaNum = noaa1.select_dtypes(include = np.number)# print(noaaNum)# For all values in the numerical column list from abovefor col in noaaNum.columns: noaa1[col] = noaa1[col].fillna( noaa1.groupby('STATION')[col].transform('median') )# print("Missing values NOW (NaN): \n" + str(noaa1.isnull().sum()))```The features from the NOAA data included the following:| Feature Name | Description | Use ||----------------------------|--------------------------|------------------|| STATION | name of land-based station for collection | aggregating climate data || LATITUDE | latitude of station position | combining datasets regionally || LONGITUDE | longitude of station position | combining datasets regionally || DATE | date (year-month-day) of data collection | aggregating climate data temporally || PRCP | precipitation (inches) | `rainfall_2022` and `rainfall_below_5yavg` || TAVG | average temperature (Fahrenheit) | `avg_temp_2022` and `avg_temp_above_5yavg` || TMAX | maximum temperature (Fahrenheit) | `max_temp_2022` and `max_temp_above_5yavg` |TMAX and TAVG were used to calculate the maximum and average temperatures (respectively) for 2022 and the 5-year average 2019-2023; then these results were used to determine the difference between the max or average temperature in 2022 and 2019-2023 average. PRCP was used to determine the average rainfall (in inches) in 2022 and 2019-2023 5-year average; likewise, the difference between the 2022 and 2019-2023 averages were calculated from these engineered features.```{python}# Date is in correct time formatnoaa1['DATE'] = pd.to_datetime(noaa1['DATE'])# Selecting variables to plotnoaa1_scatter_eda = noaa1[['DATE', 'PRCP', 'TAVG', 'TMAX', 'SNOW']]# noaa1_scatter_eda = noaa1[['DATE', 'AWND', 'EVAP', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'TOBS', 'WSF2', 'WSF5']]# Numeric columnsnum_cols = noaa1_scatter_eda.select_dtypes(include=np.number).columns# Create stacked subplotsfig, axes = plt.subplots(len(num_cols), 1, figsize=(12, 3*len(num_cols)), sharex=True)# for loop to plot all figures into gridfor i, col inenumerate(num_cols):# Scatter plot sns.scatterplot( data=noaa1_scatter_eda, x="DATE", y=col, ax=axes[i], s=5, alpha=0.5, color="blue", edgecolor=None) axes[i].set_ylabel(col) axes[i].grid(True, linestyle='--', alpha=0.5)axes[-1].set_xlabel("Date")fig.suptitle('Climate Features from NOAA')fig.tight_layout()plt.show()```It is easy to spot the cyclical nature of our climate measurements in the temperature data where each period spans one year. However, the pattern in precipitation and snow in inches over time is more difficult to identify.```{python}# Extract year from datenoaa1['YEAR'] = noaa1['DATE'].dt.year# Maximum temperature for 2022/stationmax_temp_2022 = noaa1[noaa1['YEAR'] ==2022].groupby('STATION')['TMAX'].max().rename('max_temp_2022')# Max temp in a 5-year average per stationmax_temp_5yavg = noaa1.groupby('STATION')['TMAX'].max().rename('max_temp_5yavg')# Average temperature for 2022 per stationavg_temp_2022 = noaa1[noaa1['YEAR'] ==2022].groupby('STATION')['TAVG'].mean().rename('avg_temp_2022')# Average temperature for 5-years/stationavg_temp_5yavg = noaa1.groupby('STATION')['TAVG'].mean().rename('avg_temp_5yavg')# Total inches of rainfall in 2022rainfall_2022 = noaa1[noaa1['YEAR'] ==2022].groupby('STATION')['PRCP'].sum().rename('rainfall_2022')# Total inches average for 5-year periodrainfall_5yavg = rainfall_5yavg = noaa1.groupby('STATION')['PRCP'].sum().rename('rainfall_5yavg')# Latitude per stationlatitude = noaa1.groupby('STATION')['LATITUDE'].first().rename('latitude')# Longitude per stationlongitude = noaa1.groupby('STATION')['LONGITUDE'].first().rename('longitude')features_2022 = pd.concat([ max_temp_2022, max_temp_5yavg, avg_temp_2022, avg_temp_5yavg, rainfall_2022, rainfall_5yavg, latitude, longitude], axis=1).reset_index()# Max temp above 5-year averagefeatures_2022['max_temp_above_5yavg'] = features_2022['max_temp_2022'] - features_2022['max_temp_5yavg']# Average temperature for 2022 - average temperaturefeatures_2022['avg_temp_above_5yavg'] = features_2022['avg_temp_2022'] - features_2022['avg_temp_5yavg']# Annual rainfall average minus annual rainfall in 2022features_2022['rainfall_below_5yavg'] = features_2022['rainfall_2022'] - features_2022['rainfall_5yavg']noaa_features = features_2022# noaa_features.info()``````{python}# create grid_bin dataframesgdf_noaa = grid_bin(noaa_features)gdf_noaa1 = gdf_noaa.groupby('gridspace', as_index=False)[['max_temp_2022', 'max_temp_above_5yavg', 'avg_temp_2022', 'avg_temp_above_5yavg', 'rainfall_2022', 'rainfall_below_5yavg', 'latitude', 'longitude']].mean() # important note: lat and lon averages are not very useful, but they will be replaced later. Just wanted to preserve the feature for rudimentary mapping# need to get geometry points backgeometry_noaa = gdf_noaa.groupby('gridspace')['geometry'].first().reset_index() gdf_noaa1 = gdf_noaa1.merge(geometry_noaa, on ='gridspace')# gdf_noaa1# convert back from df into gdfgdf_noaa1 = gpd.GeoDataFrame(gdf_noaa1, geometry="geometry", crs=gdf_noaa.crs)# US state and arizona shapefile for plottingstates = gpd.read_file("data/state_outline/cb_2018_us_state_20m/cb_2018_us_state_20m.shp") # path to downloaded shapefile for state outlinesstates = states.to_crs(gdf_noaa1.crs)arizona = states[states['NAME'] =='Arizona'] # filter for arizona only# example: plotting noaa data with arizona shapefileax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black")gdf_noaa1.plot(column="max_temp_2022", ax=ax, cmap="viridis", markersize=30, legend=True, legend_kwds={'label': "Temperature (F)", 'orientation': "vertical"})plt.title("Average Temperature in Arizona (2022)")plt.xlabel('Latitude')plt.ylabel('Longitude')plt.autoscale(False)plt.show()ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black")gdf_noaa1.plot(column="rainfall_2022", ax=ax, cmap="viridis", markersize=30, legend=True, legend_kwds={'label': "Average Precipitation (inches)", 'orientation': "vertical"})plt.title("Average Precipitation in Arizona (2022)")plt.xlabel('Latitude')plt.ylabel('Longitude')plt.autoscale(False)plt.show()```### Storm Event Data (from NOAA)The storm event data is sourced from NOAA. Each row represented a weather event tied to a specific time and geographic zone, with end dates and times recorded some down to the exact minute. The data also incorporated multiple observation sources, including the public, broadcast media, trained spotters, and forest or park services. Additionally, it contained fields detailing property and crop damage, injuries, and event magnitude. However, a lot of this meta data contained empty cells which indicated some data reliability issues for certain fields.Relevant Variables in the dataset (these are the storm event variables we used for modeling and visualizations):``` BEGIN_DATE – date the storm event beganYEAR – extracted from BEGIN_DATE, used to filter and group dataMAGNITUDE – numeric severity/intensity of the storm eventCZ_FIPS – county FIPS code identifier used to group data by county```From these base variables, we created several engineered features:``` df_2022 – subset of data for storm events that occurred in 2022lowmag_2022 – number of low magnitude storm events in 2022lowmag_5yavg – 5-year average (2019–2023) of low magnitude storm events per yearhighmag_2022 – number of high magnitude storm events in 2022highmag_5yavg – 5-year average (2019–2023) of high magnitude storm events per yearavg_mag_2022 – average storm magnitude for each county/zone in 2022```The engineered features (lowmag, highmag, and avg_mag) were used to compare storm activity in 2022 against the 5-year averages and to evaluate how storm intensity varied across counties.```{python}# Path to your folderfolder_path ="data/StormEvents/"# Create a list of all matching files (2019–2023)csv_files =sorted(glob.glob(os.path.join(folder_path, "storm_data_AZ_*.csv")))# Load and concatenate all CSVsdf_list = []forfilein csv_files: df_year = pd.read_csv(file) df_list.append(df_year)# Merge into one dataframedf_all = pd.concat(df_list, ignore_index=True)# Check the result# print("Shape:", df_all.shape)# print("Years combined:", [os.path.basename(f) for f in csv_files])# print(df_all.head())# Count frequency of each storm event typeevent_counts = df_all['EVENT_TYPE'].value_counts().reset_index()event_counts.columns = ['EVENT_TYPE', 'FREQUENCY']# Create scatter plotplt.figure(figsize=(10,6))plt.scatter(event_counts['EVENT_TYPE'], event_counts['FREQUENCY'], color='blue')# Add titles and labelsplt.title('Frequency of Storm Event Types')plt.xlabel('Storm Event Type')plt.ylabel('Frequency')plt.xticks(rotation=45, ha='right')plt.tight_layout()plt.show()# Rename data framestorm_df = df_all# storm_df.info()``````{python}# Make sure dates and magnitude are correct typestorm_df['BEGIN_DATE'] = pd.to_datetime(storm_df['BEGIN_DATE'])storm_df['YEAR'] = storm_df['BEGIN_DATE'].dt.yearstorm_df['MAGNITUDE'] = pd.to_numeric(storm_df['MAGNITUDE'], errors='coerce')# Define threshold for low/high magnitudelow_mag_threshold =1.0# Filter for recent yearsrecent_df = storm_df[storm_df['YEAR'].between(2019, 2023)]df_2022 = storm_df[storm_df['YEAR'] ==2022]# Group by county (using CZ_FIPS here)# Low magnitude storm events in 2022 per countylowmag_2022 = df_2022[df_2022['MAGNITUDE'] < low_mag_threshold].groupby('CZ_FIPS').size().rename('lowmagstorm_events_2022')# 5-year average low magnitude storms per countylowmag_5yavg = recent_df[recent_df['MAGNITUDE'] < low_mag_threshold].groupby(['YEAR', 'CZ_FIPS']).size().groupby('CZ_FIPS').mean().rename('lowmagstorm_5yavg')# High magnitude storm events in 2022 per countyhighmag_2022 = df_2022[df_2022['MAGNITUDE'] >= low_mag_threshold].groupby('CZ_FIPS').size().rename('highmagstorm_events_2022')# 5-year average high magnitude storms per countyhighmag_5yavg = recent_df[recent_df['MAGNITUDE'] >= low_mag_threshold].groupby(['YEAR', 'CZ_FIPS']).size().groupby('CZ_FIPS').mean().rename('highmagstorm_5yavg')# Average storm magnitude in 2022 per countyavg_mag_2022 = df_2022.groupby('CZ_FIPS')['MAGNITUDE'].mean().rename('average_storm_mag_2022')# Combine all of the featuresstorm_features_2022 = pd.concat([ lowmag_2022, lowmag_5yavg, highmag_2022, highmag_5yavg, avg_mag_2022], axis=1).reset_index()# Calculate how many low-magnitude storm events in 2022 exceeded the 5-year average# This is done by subtracting the 5-year average number of low-magnitude storms# from the actual count in 2022, per county.storm_features_2022['lowmagstorm_events_above_5yavg'] = ( storm_features_2022['lowmagstorm_events_2022'] - storm_features_2022['lowmagstorm_5yavg'])# Calculate how many high-magnitude storm events in 2022 exceeded the 5-year average# This is the difference between the 2022 count and the 5-year average for high-magnitude storms,# computed per county.storm_features_2022['highmagstorm_events_above_5yavg'] = ( storm_features_2022['highmagstorm_events_2022'] - storm_features_2022['highmagstorm_5yavg'])# storm_features_2022.info()```### Data IntegrityThe datasets we had access to had integrity issues. Meaning the collection methods were inconsistent which led to many missing values. A leading contributor to this problem is the large size of the sampling range. Climate stations are often designed to be operated remotely, meaning any hardware issues that arise have delays in repair time, leading to frequent gaps in collection. Another integrity issue relates to the storm event data which was uniquely granular in both space and time. While the traffic and climate stations were measured relatively consistently over time, storm event data were highly qualitative in nature and only included measurements from when storms were present.## Engineered Features### Climate Features| Feature | Description ||------------------------------|------------------------------------------|| max_temp_2022 | maximum temperature for 2022 || max_temp_above_5yavg | maximum temperature for 2022 minus maximum temperature averaged 2019-2023 || avg_temp_2022 | average temperature for 2022 || avg_temp_above_5yavg | average temperature for 2022 minus average temperature averaged 2019-2023 || rainfall_2022 | total inches of rainfall in 2022 || rainfall_below_5yavg | inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022 || lowmagstorm_events_2022 | sum of low magnitude storm events in 2022 || lowmagstorm_events_above_5yavg | sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023 || highmagstorm_events_2022 | sum of high magnitude storm events in 2022 || highmagstorm_events_above_5yavg | sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023 || average_storm_mag_2022 | average storm event magnitude for 2022 |### Urbanization Features| Feature | Description ||------------------------------|------------------------------------------|| traffic_counts | total number of vehicles detected in 2022 || traffic_counts_above_5yavg | traffic counts in 2022 minus traffic counts averaged 2019-2023 |# Preprocessing### Geospatial NormalizationSince our research question is about regional relationships, we had to combine our data regionally. Since each dataset had unique geographic qualities, it was not a straightforward process. The two main forms of location information given in our datasets were latitude/longitude pairs and county names. Since latitude and longitude are specific and discrete, we were unable to merge on latitude and longitude pair points alone. In order to merge on specific points, it would mean that the traffic station, the climate measurement, and the storm identification would have had to occur at the exact same location. To account for this, we implemented geospatial normalization in the form of a gridspace.Our gridspace is a 2-dimensional grid overlayed on the state of Arizona that encompasses pre-defined ranges of latitude and longitude. We manufactured a grid of 336 spaces that is 16 cells horizontally and 21 cells vertically. Any data that had a station location that was within a given gridspace would be merged into that gridspace and assigned a gridspace number and a gridspace geometry. In cartography, geometries describe the shape of a region. Generally speaking, geometries include Points, Lines,and Polygons (https://geopandas.org/en/stable/docs/user_guide/data_structures.html). The traffic and NOAA station data include latitude and longitude information, which would be classified as a Point. The gridspace geometry however is a rectangle which would be considered a Polygon. This form of normalization is helpful because it ensures all data share the same projection information and it assures the regions are the same size. We utilized the GeoPandas Spatial Join (`geopandas.sjoin()`) function to spatially join our data which we implemented as we merged our datasets. Spatial joining works by joining GeoDataFrames on geometries that intersect. So traffic and climate data would be grouped together in the same gridspace if they fell within the same region.First, we combined the NOAA climate data and the traffic data. When we grouped data that had multiple station locations within the same gridspace, we assigned the mean of those measurements to be the feature value for that gridspace. The spatial join function only maintains gridspaces that are represented in both datasets. This means that our initial gridspace of `{python}grid_size` was decreased to \~80 gridspaces. An interesting problem that arose was with the storm event data which did not include latitude and longitude values. Since storms travel across regions, the dataset included a county variable and starting and ending latitude and longitude values. We decided to preserve the county feature from the raw traffic data to merge on. This made the storm values unique in that a single storm event would be represented in every gridspace within the county in which it occurred.For our regression analyses, our final DataFrame was structured such that each row corresponded to a gridspace and each column was a traffic, climate, or storm event value.### Temporal NormalizationWhile we have addressed regional normalization using our `gridspace` feature, we had to find a way to normalize our data in the time dimension. The most common approach to see how features change over time is to perform regression where the independent variable is time. Common examples of this are seeing how a stock's value changes over time, or seeing how temperature for one city changes day to day. Since our independent variable changes region by region, we accounted for temporal variability by comparing a year of interest (2022) to a dataset of averages for the same features over a 5 year time span (2019-2023). We did this through engineering features that were differences between values measured in 2022 and values that were measured over the 5 year average.```{python}# US state and arizona shapefile for plottingstates = gpd.read_file("data/state_outline/cb_2018_us_state_20m/cb_2018_us_state_20m.shp") # path to downloaded shapefile for state outlinesstates = states.to_crs('EPSG:4326')arizona = states[states['NAME'] =='Arizona'] # filter for arizona onlysns.set_theme(style='white')ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black")# gdf was reverted to df during process, so converting to gdf againgdf_traffic = gpd.GeoDataFrame(gdf_traffic)gdf_traffic.plot('traffic_counts_2022', ax=ax, legend =True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"})plt.title("Traffic in Arizona (2022)")plt.xlabel("Latitude")plt.ylabel("Longitude")plt.autoscale(False)plt.show()``````{python}ax = arizona.plot(figsize=(10, 8), color="white", edgecolor="black")gdf_traffic.plot('traffic_counts_above_5yavg', ax=ax, legend =True, legend_kwds={'label': "Traffic Count (1e7)", 'orientation': "vertical"})plt.title("Traffic Above the 5-Year Average in Arizona (2019-2023)")plt.xlabel("Latitude")plt.ylabel("Longitude")plt.autoscale(False)plt.show()``````{python}### Merge Traffic and NOAA Datasets on `gridspace`gdf_traffic_premerge = gdf_traffic.drop(['latitude', 'longitude'], axis=1)# remove Point geometry to preserve gridspace geometry from traffic dfgdf_noaa_premerge = gdf_noaa1.drop(['latitude', 'longitude', 'geometry'], axis=1)# Merging removes gridspaces that are not in the left dataset (traffic)gdf_traffic_noaa = pd.merge(gdf_traffic_premerge, gdf_noaa_premerge, on='gridspace')# gdf_traffic_noaa.head()``````{python}### Merge Storm and Traffic/NOAA Dataset on `fips_code`df_storm_premerge = storm_features_2022.rename({'CZ_FIPS': 'fips_code'}, axis=1)df_storm_premerge.columnsmerged_df = pd.merge(gdf_traffic_noaa, df_storm_premerge)# merged_df.isnull().sum()```### Data PreprocessingThe prepare our data for regression, we removed outliers with a 5% and 95% tolerance range. Then we filled missing values scaled our features using StandardScaler. The low integrity of the dataset meant that we had issues with data being flagged as outliers. This was because the overabundance of NaNs led to the fill values being over-represented, classifying the rest of the dataset as outliers. We believe this to be a leading contributor to our poor model performance.```{python}# For numerical columns, fill nulls with meannumerical_cols = ['lowmagstorm_events_2022', 'lowmagstorm_5yavg', 'highmagstorm_events_2022','highmagstorm_5yavg', 'average_storm_mag_2022', 'lowmagstorm_events_above_5yavg','highmagstorm_events_above_5yavg', 'max_temp_above_5yavg', 'avg_temp_above_5yavg','rainfall_below_5yavg', 'traffic_counts_2022', 'traffic_counts_above_5yavg', 'max_temp_2022', 'avg_temp_2022', 'rainfall_2022']for col in numerical_cols: merged_df[col].fillna(merged_df[col].mean(), inplace=True)# print(merged_df.isnull().sum())# Remove outliers - capping (winsorizing) imputationnumeric_cols = merged_df.select_dtypes(include=[np.number]).columnsfor col in numeric_cols: lower_limit = np.percentile(merged_df[col], 5) upper_limit = np.percentile(merged_df[col], 95)``````{python}# Test# applying log normalization to traffic counts for merged dataset# log_df = merged_df.copy()# log_df = np.log(log_df.select_dtypes(include=[np.number]))# # log_df[['traffic_counts_2022', 'traffic_counts_above_5yavg', 'avg_temp_2022']] = np.log(merged_df[['traffic_counts_2022', 'traffic_counts_above_5yavg', 'avg_temp_2022']])# # log_df = log_df.drop(['gridspace'], axis=1)# log_df.boxplot()# plt.xticks(rotation=45)``````{python}# Initialize the scalerscaler = StandardScaler()# Select numeric columnsnumeric_cols = merged_df.select_dtypes(include='number').columns# Fit and transformmerged_df_scaled = merged_df.copy()merged_df_scaled[numeric_cols] = scaler.fit_transform(merged_df[numeric_cols])# print(merged_df_scaled)merged_df_scaled.boxplot()plt.xticks(rotation=60, fontsize=8)plt.title('Box and Whisker Plot of Engineered Features after StandardScaler')plt.show()# sns.pairplot(merged_df_scaled)``````{python}# 2022 urbanization metricsurban_2022 = ['traffic_counts_2022',# 'gridspace' # This seems static/spatial, so keep with 2022 data]# Urbanization metrics relative to 5-year averageurban_above_5yavg = ['traffic_counts_above_5yavg']# 2022 environmental metricsenv_2022 = ['max_temp_2022','avg_temp_2022','rainfall_2022','lowmagstorm_events_2022','highmagstorm_events_2022','average_storm_mag_2022']# Environmental metrics relative to 5-year average (anomalies)env_above_below_5yavg = ['max_temp_above_5yavg','avg_temp_above_5yavg','rainfall_below_5yavg', # This one is "below", but similar anomaly concept'lowmagstorm_events_above_5yavg','highmagstorm_events_above_5yavg']df_5yavg = merged_df_scaled[urban_above_5yavg + env_above_below_5yavg]df_2022 = merged_df_scaled[urban_2022 + env_2022]sns.pairplot(df_5yavg)plt.show()sns.pairplot(df_2022)plt.show()```We grouped our data into climate and urbanization (target) categories. We split each set of categories into 2022 values and 5 year average values. We performed PCA analysis to decrease our dataset preserving only the most prominent features, and then performed one hot encoding on categorical values to convert object types to numerical types.### PCA Results```{python}# Keep only numeric columnsnumeric_cols = merged_df_scaled.select_dtypes(include='number').columnsdf_numeric = merged_df_scaled[numeric_cols]# Run PCA to get 95% variancepca = PCA(n_components=0.95)reduced_data = pca.fit_transform(df_numeric)# Loading scoresloading_scores = pd.DataFrame(pca.components_, columns=numeric_cols)threshold =0.3top_features = {}for i inrange(loading_scores.shape[0]):# Name the component (e.g., "Component 1", "Component 2", etc.) component =f"Component {i+1}"# Get the loading scores for this principal component scores = loading_scores.iloc[i]# Identify features with absolute loading scores above the threshold top_features_for_component = scores[abs(scores) > threshold].index.tolist()# Store the top features for this component in the dictionary top_features[component] = top_features_for_component# Print resultsfor component, features in top_features.items():print(f"{component}: {features}")``````{python}# Transform Catagorical Variables into Numeric Labels# One hot encoding for catagorical variables with multiple categoriescategorical_cols = merged_df_scaled.select_dtypes(include='object').columnsmerged_df_scaled = pd.get_dummies(merged_df_scaled, columns= categorical_cols, drop_first=True)# Convert all bool columns to int (0/1) bool_cols = merged_df_scaled.select_dtypes(include='bool').columnsmerged_df_scaled[bool_cols] = merged_df_scaled[bool_cols].astype(int)# print(merged_df_scaled.head())# print(merged_df_scaled.dtypes)```# Regression Analysis### Linear Regression ModelsWe utilized Ordinary Least Squares Regression, Ridge Regression, and Lasso Regression. We used both Ridge Regression and Lasso Regression to see if there were any prominently predictive features that we could further investigate. We could determine this if the Lasso Regression performed significantly better than the Ridge Regression, because the Lasso Regression omits features of low influence while Ridge Regression preserves them. We grouped the 2022 data and the 5 year average (2019-2023) data in every combination during regression analysis to investigate any underlying relationships.```{python}def run_regression_models(X_val, y_val): X = merged_df_scaled[X_val] # Independent variable y = merged_df_scaled[y_val] # Dependent variable X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# _______________ OLS Regression _______________ X_train_with_const = sm.add_constant(X_train) model = sm.OLS(y_train, X_train_with_const).fit() model_summary = model.summary2()# print(model_summary)# Add constant to X_test X_test_const = sm.add_constant(X_test)# Predict using trained model y_pred = model.predict(X_test_const) mse = mean_squared_error(y_test, y_pred) # mean squared error rmse = np.sqrt(mse) # root mse r2 = r2_score(y_test, y_pred) # r^2 on test set# Print test performanceprint("OLS Regression Model:")print("\tMean-squared error:", mse)print("\tRoot mean-squared error:", rmse)print("\tR-squared value:", r2)# print(f"MSE: {mse:.4f}")# print(f"RMSE: {rmse:.4f}")# print(f"R²: {r2:.4f}")# Define the alpha values to try alpha_values = [0.01, 0.1, 1, 10, 100,1000,10000]# Initialize variables to store the best alpha and its corresponding MSE best_alpha =None best_mse = np.inf# Cross-validation to select the best alpha kf = KFold(n_splits=5, shuffle=True, random_state=42)# _______________ Ridge Regression _______________for alpha in alpha_values:# Create Ridge regression model with current alpha ridge = Ridge(alpha=alpha)# Perform cross-validation and get negative MSE scores cv_scores = cross_val_score(ridge, X_train, y_train, cv=kf, scoring='neg_mean_squared_error') mean_cv_mse =-np.mean(cv_scores)# Update best alpha if current mean CV MSE is lowerif mean_cv_mse < best_mse: best_mse = mean_cv_mse best_alpha = alphaprint(f"\nRidge best alpha selected: {best_alpha} with CV MSE: {best_mse}")# _______________ Ridge Regression _______________# Initialize Ridge regression with the best alpha ridge_reg = Ridge(alpha=best_alpha)# Fit the Ridge model on the training data ridge_reg.fit(X_train, y_train)# Predict on the testing set y_pred = ridge_reg.predict(X_test)# Evaluate performance mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred)print("Ridge Regression Model:")print("\tMean-squared error:", mse)print("\tRoot mean-squared error:", rmse)print("\tR-squared value:", r2)# _______________ Lasso Regression _______________for alpha in alpha_values:# Create Lasso regression with current alpha lasso = Lasso(alpha=alpha, max_iter=10000)# Perform cross-validation and get negative MSE scores cv_scores = cross_val_score(lasso, X_train, y_train, cv=kf, scoring='neg_mean_squared_error') mean_cv_mse =-np.mean(cv_scores)# Update best alpha if current mean CV MSE is lowerif mean_cv_mse < best_mse: best_mse = mean_cv_mse best_alpha = alphaprint(f"\nLasso best alpha selected: {best_alpha} with CV MSE: {best_mse}")# _______________ Lasso Regression _______________# Initialize Lasso regression with the best alpha lasso_reg = Lasso(alpha=best_alpha, max_iter=10000)# Fit the Lasso model on the training data lasso_reg.fit(X_train, y_train)# Predict on the testing set y_pred = lasso_reg.predict(X_test)# Evaluate performance mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred)print("Lasso Regression Model:")print("\tMean-squared error:", mse)print("\tRoot mean-squared error:", rmse)print("\tR-squared value:", r2)``````{python}print('Environmental Data (5 Year Averages)')print('Traffic Data (5 Year Averages)')run_regression_models(env_above_below_5yavg, 'traffic_counts_above_5yavg')``````{python}print('Environmental Data (2022)')print('Traffic Data (2022)')run_regression_models(env_2022, 'traffic_counts_2022')``````{python}print('Environmental Data (2022)')print('Traffic Data (5 Year Averages)')run_regression_models(env_2022, 'traffic_counts_above_5yavg')``````{python}print('Environmental Data (5 Year Averages)')print('Traffic Data (2022)')run_regression_models(env_above_below_5yavg, 'traffic_counts_2022')```### Polynomial regressionWe utilized a second order polynomial regression to see if it would perform better than the linear regression models. We discovered that the quadratic regression also performed worse than a prediction made by averages.```{python}def poly_reg(X_val, y_val, degree=2): X = merged_df_scaled[X_val].values y = merged_df_scaled[y_val].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Transforming data for polynomial regression poly_features = PolynomialFeatures(degree = degree) # Adjust degree as necessary X_train_poly = poly_features.fit_transform(X_train) X_test_poly = poly_features.transform(X_test)# Initialize the Linear Regression model linear_reg = LinearRegression()# Fit the model with polynomial features linear_reg.fit(X_train_poly, y_train)# Predict on the testing set y_pred = linear_reg.predict(X_test_poly)# Calculate and print the MSE and R-squared mse_initial = mean_squared_error(y_test, y_pred) r2_initial = r2_score(y_test, y_pred)print(f'Mean Squared Error: {mse_initial}')print(f'R-squared: {r2_initial}')``````{python}print('Environmental Data (5 Year Averages)')print('Traffic Data (5 Year Averages)')poly_reg(env_above_below_5yavg, 'traffic_counts_above_5yavg', degree=2)``````{python}print('Environmental Data (2022)')print('Traffic Data (2022)')poly_reg(env_2022, 'traffic_counts_2022', degree=2)``````{python}print('Environmental Data (2022)')print('Traffic Data (5 Year Averages)')poly_reg(env_2022, 'traffic_counts_above_5yavg', degree=2)``````{python}print('Environmental Data (5 Year Averages)')print('Traffic Data (2022)')poly_reg(env_above_below_5yavg, 'traffic_counts_2022', degree=2)```# Discussion- The overall low predictive power across all of our models implies that our hypothesis that traffic counts alone are not an accurate metric for urbanization. There is also a concern that these datasets are not appropriate for our research question. Potentially narrowing our scope through decreasing our regions of interest to a single county or our time frame to a single season may help control for natural variability and data integrity in our dataset.- Our PCA analysis demonstrates that extreme storm events and temperature have the most significant interaction among features in our dataset.- The polynomial regression model did not perform better than our linear regression models. Both model types produced negative R-squared values, meaning a simple numerical average has more predictive power than our models.### Future improvementsSome ideas for improvement when investigating our research question include:- Using more complete datasets. The NOAA dataset, in its incompleteness, includes a use advisory that the data are meant mainly for hydrological and agricultural purposes and warns that the incompleteness of the dataset means it is not sufficient for climate change research.- Perform traffic congestion normalization. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.- Perform modeling using PCA clusters with variables of interest. Reducing the number of features may lead to better R-squared values because the regression fits will be less affected by extraneous features.# References1. Grid Construction Reference 1. Brennan, J. (2020). Fast and easy gridding of point data with geopandas. <https://james-brennan.github.io/posts/fast_gridding_geopandas/>2. Traffic Data 1. U.S. Department of Transportation Federal Highway Administration. (n.d.). Travel Monitoring. <https://www.fhwa.dot.gov/policyinformation/travel_monitoring/tvt.cfm> 2. Transportation Research Board. The Congestion Mitigation and Air Quality Improvement Programs. <https://onlinepubs.trb.org/onlinepubs/sr/sr264.pdf> 3. World Health Organization. Health Effects of Transport-Related Air Pollution <https://books.google.com/books?hl=en&lr=&id=b2G3k51rd0oC&oi=fnd&pg=PR1&ots=O94u9zCs4y&sig=v1JEfno-HjkbKf3im4iDQLJJUQ0#v=onepage&q&f=false>3. Storm Data 1. NOAA National Centers for Environmental Information. (n.d.). Storm Events Databse. <https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/>4. Climate Data 1. NOAA National Centers for Environmental Information. (n.d.). Climate Data. <https://www.ncdc.noaa.gov/cdo-web/search;jsessionid=2851FE4A6791F4BF429988727483E450> 2. Global Historical Climatology Network. (n.d.). Daily Documentation. <https://www.ncei.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf>5. Trenberth, K.E. (2007). The Impact of Climate Change and Variability on Heavy Precipitation, Floods and Droughts. <https://web.archive.org/web/20170809113035id_/http://www.cgd.ucar.edu/cas/Trenberth/website-archive/trenberth.papers-moved/The%20Impact%20of%20Climate%20Change%20on%20the%20Water%20Cycle%20v%204_ss.pdf>6. Zhang M., Gao, Y., & Ge, J. (2025). Different responses of extreme and mean precipitation to land use and land cover changes. <https://www.nature.com/articles/s41612-025-01049-1>7. Saturn Cloud. (2023). How to Sum Two Columns in a Pandas DataFrame.<https://saturncloud.io/blog/how-to-sum-two-columns-in-a-pandas-dataframe/>