Urbanization and Environmental Quality in Arizona, USA

Project Final: Workspace

Final Project for course INFO 523: Data Mining and Discovery
Author
Affiliation

JKP (Vera Jackson, Molly Kerwick, Brooke Pacheco)

College of Information Science, University of Arizona

Introduction

The goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data.

Research Questions

Can we predict the environmental quality of a region based on urbanization indicators for that region?
The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region.

Are storm event data and traffic data successful environmental health and urbanization indicators, respectively?
The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest.

Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model?
This will indicate how accurate our scope is compared to other environmental impact studies.

Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data?
If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities.

Proposed Procedure

This project will consist of 2 studies:

Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two.

Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions.

Data

Traffic Data

Traffic congestion is a major source of air pollution in and around traffic areas (Transportation Research Board). Consequently, high traffic congestion has been shown to have a negative health impact on drivers, pedestrians, and residents in areas near roadways (World Health Organization). We selected traffic data as our urbanization metric as the monitoring was fairly robust throughout areas of different types (eg. rural, urban, agricultural). We measured traffic volume as a value of gross counts in a given year. The traffic data were sourced from the Department of Transportation (DOT) Federal Highway Administration (FHWA) and included traffic count data and monitoring station data. The traffic count data were separated into files for each month spanning our five years of interest (2019-2023). Each line was a single day of counts at a single traffic station. The columns included a count for each hour of that day. The aggregate our traffic count data, we summed the hourly counts into a gross daily count, and then grouped the dataset by station_id summing the newly created daily traffic count. This led to very high counts on the order of magnitude of 10e7. The traffic station held the geographic information for the traffic stations. Each row was a different station location deployed by FHWA that year. It included the features station_id, latitude, longitude, and county_code. We omitted features like the number of lanes and the direction of traffic flow. The traffic data were merged on station_id which linked the geographic location with the traffic counts.

A potential issue in our approach is that gross counts alone are not an indicator of traffic congestion. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.

Climate Data (from NOAA)

The climate (temperature and precipitation) data from NOAA comes from the Global Historical Climatology Network, combining daily climate observations from a variety of data sources, but primarily land-based stations. Many of these stations are updated daily.

The features from the NOAA data included the following:

Feature Name Description Use
STATION name of land-based station for collection aggregating climate data
LATITUDE latitude of station position combining datasets regionally
LONGITUDE longitude of station position combining datasets regionally
DATE date (year-month-day) of data collection aggregating climate data temporally
PRCP precipitation (inches) rainfall_2022 and rainfall_below_5yavg
TAVG average temperature (Fahrenheit) avg_temp_2022 and avg_temp_above_5yavg
TMAX maximum temperature (Fahrenheit) max_temp_2022 and max_temp_above_5yavg

TMAX and TAVG were used to calculate the maximum and average temperatures (respectively) for 2022 and the 5-year average 2019-2023; then these results were used to determine the difference between the max or average temperature in 2022 and 2019-2023 average. PRCP was used to determine the average rainfall (in inches) in 2022 and 2019-2023 5-year average; likewise, the difference between the 2022 and 2019-2023 averages were calculated from these engineered features.

It is easy to spot the cyclical nature of our climate measurements in the temperature data where each period spans one year. However, the pattern in precipitation and snow in inches over time is more difficult to identify.

Storm Event Data (from NOAA)

The storm event data is sourced from NOAA. Each row represented a weather event tied to a specific time and geographic zone, with end dates and times recorded some down to the exact minute. The data also incorporated multiple observation sources, including the public, broadcast media, trained spotters, and forest or park services. Additionally, it contained fields detailing property and crop damage, injuries, and event magnitude. However, a lot of this meta data contained empty cells which indicated some data reliability issues for certain fields.

Relevant Variables in the dataset (these are the storm event variables we used for modeling and visualizations):

BEGIN_DATE – date the storm event began
YEAR – extracted from BEGIN_DATE, used to filter and group data
MAGNITUDE – numeric severity/intensity of the storm event
CZ_FIPS – county FIPS code identifier used to group data by county

From these base variables, we created several engineered features:

df_2022 – subset of data for storm events that occurred in 2022
lowmag_2022 – number of low magnitude storm events in 2022
lowmag_5yavg – 5-year average (2019–2023) of low magnitude storm events per year
highmag_2022 – number of high magnitude storm events in 2022
highmag_5yavg – 5-year average (2019–2023) of high magnitude storm events per year
avg_mag_2022 – average storm magnitude for each county/zone in 2022

The engineered features (lowmag, highmag, and avg_mag) were used to compare storm activity in 2022 against the 5-year averages and to evaluate how storm intensity varied across counties.

Data Integrity

The datasets we had access to had integrity issues. Meaning the collection methods were inconsistent which led to many missing values. A leading contributor to this problem is the large size of the sampling range. Climate stations are often designed to be operated remotely, meaning any hardware issues that arise have delays in repair time, leading to frequent gaps in collection. Another integrity issue relates to the storm event data which was uniquely granular in both space and time. While the traffic and climate stations were measured relatively consistently over time, storm event data were highly qualitative in nature and only included measurements from when storms were present.

Engineered Features

Climate Features

Feature Description
max_temp_2022 maximum temperature for 2022
max_temp_above_5yavg maximum temperature for 2022 minus maximum temperature averaged 2019-2023
avg_temp_2022 average temperature for 2022
avg_temp_above_5yavg average temperature for 2022 minus average temperature averaged 2019-2023
rainfall_2022 total inches of rainfall in 2022
rainfall_below_5yavg inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022
lowmagstorm_events_2022 sum of low magnitude storm events in 2022
lowmagstorm_events_above_5yavg sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023
highmagstorm_events_2022 sum of high magnitude storm events in 2022
highmagstorm_events_above_5yavg sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023
average_storm_mag_2022 average storm event magnitude for 2022

Urbanization Features

Feature Description
traffic_counts total number of vehicles detected in 2022
traffic_counts_above_5yavg traffic counts in 2022 minus traffic counts averaged 2019-2023

Preprocessing

Geospatial Normalization

Since our research question is about regional relationships, we had to combine our data regionally. Since each dataset had unique geographic qualities, it was not a straightforward process. The two main forms of location information given in our datasets were latitude/longitude pairs and county names. Since latitude and longitude are specific and discrete, we were unable to merge on latitude and longitude pair points alone. In order to merge on specific points, it would mean that the traffic station, the climate measurement, and the storm identification would have had to occur at the exact same location. To account for this, we implemented geospatial normalization in the form of a gridspace.

Our gridspace is a 2-dimensional grid overlayed on the state of Arizona that encompasses pre-defined ranges of latitude and longitude. We manufactured a grid of 336 spaces that is 16 cells horizontally and 21 cells vertically. Any data that had a station location that was within a given gridspace would be merged into that gridspace and assigned a gridspace number and a gridspace geometry. In cartography, geometries describe the shape of a region. Generally speaking, geometries include Points, Lines,and Polygons (https://geopandas.org/en/stable/docs/user_guide/data_structures.html). The traffic and NOAA station data include latitude and longitude information, which would be classified as a Point. The gridspace geometry however is a rectangle which would be considered a Polygon. This form of normalization is helpful because it ensures all data share the same projection information and it assures the regions are the same size. We utilized the GeoPandas Spatial Join (geopandas.sjoin()) function to spatially join our data which we implemented as we merged our datasets. Spatial joining works by joining GeoDataFrames on geometries that intersect. So traffic and climate data would be grouped together in the same gridspace if they fell within the same region.

First, we combined the NOAA climate data and the traffic data. When we grouped data that had multiple station locations within the same gridspace, we assigned the mean of those measurements to be the feature value for that gridspace. The spatial join function only maintains gridspaces that are represented in both datasets. This means that our initial gridspace of {python}grid_size was decreased to ~80 gridspaces. An interesting problem that arose was with the storm event data which did not include latitude and longitude values. Since storms travel across regions, the dataset included a county variable and starting and ending latitude and longitude values. We decided to preserve the county feature from the raw traffic data to merge on. This made the storm values unique in that a single storm event would be represented in every gridspace within the county in which it occurred.

For our regression analyses, our final DataFrame was structured such that each row corresponded to a gridspace and each column was a traffic, climate, or storm event value.

Temporal Normalization

While we have addressed regional normalization using our gridspace feature, we had to find a way to normalize our data in the time dimension. The most common approach to see how features change over time is to perform regression where the independent variable is time. Common examples of this are seeing how a stock’s value changes over time, or seeing how temperature for one city changes day to day. Since our independent variable changes region by region, we accounted for temporal variability by comparing a year of interest (2022) to a dataset of averages for the same features over a 5 year time span (2019-2023). We did this through engineering features that were differences between values measured in 2022 and values that were measured over the 5 year average.

Data Preprocessing

The prepare our data for regression, we removed outliers with a 5% and 95% tolerance range. Then we filled missing values scaled our features using StandardScaler. The low integrity of the dataset meant that we had issues with data being flagged as outliers. This was because the overabundance of NaNs led to the fill values being over-represented, classifying the rest of the dataset as outliers. We believe this to be a leading contributor to our poor model performance.

We grouped our data into climate and urbanization (target) categories. We split each set of categories into 2022 values and 5 year average values. We performed PCA analysis to decrease our dataset preserving only the most prominent features, and then performed one hot encoding on categorical values to convert object types to numerical types.

PCA Results

Component 1: ['max_temp_2022', 'highmagstorm_events_2022', 'highmagstorm_5yavg', 'highmagstorm_events_above_5yavg']
Component 2: ['traffic_counts_2022', 'fips_code', 'highmagstorm_events_2022', 'highmagstorm_5yavg', 'highmagstorm_events_above_5yavg']
Component 3: ['traffic_counts_2022', 'traffic_counts_above_5yavg', 'max_temp_above_5yavg', 'avg_temp_above_5yavg', 'rainfall_2022']
Component 4: ['gridspace', 'traffic_counts_2022', 'traffic_counts_above_5yavg', 'rainfall_2022', 'average_storm_mag_2022']
Component 5: ['gridspace', 'avg_temp_2022', 'rainfall_below_5yavg', 'lowmagstorm_5yavg']
Component 6: ['fips_code', 'avg_temp_above_5yavg', 'average_storm_mag_2022']
Component 7: ['max_temp_2022', 'rainfall_2022', 'lowmagstorm_5yavg', 'average_storm_mag_2022']
Component 8: ['max_temp_above_5yavg', 'avg_temp_2022', 'avg_temp_above_5yavg']
Component 9: ['avg_temp_2022', 'rainfall_below_5yavg', 'lowmagstorm_5yavg']

Regression Analysis

Linear Regression Models

We utilized Ordinary Least Squares Regression, Ridge Regression, and Lasso Regression. We used both Ridge Regression and Lasso Regression to see if there were any prominently predictive features that we could further investigate. We could determine this if the Lasso Regression performed significantly better than the Ridge Regression, because the Lasso Regression omits features of low influence while Ridge Regression preserves them. We grouped the 2022 data and the 5 year average (2019-2023) data in every combination during regression analysis to investigate any underlying relationships.

Environmental Data (5 Year Averages)
Traffic Data (5 Year Averages)
OLS Regression Model:
    Mean-squared error: 0.5443879147049658
    Root mean-squared error: 0.7378264800784571
    R-squared value: -0.2140888231489706

Ridge best alpha selected: 1000 with CV MSE: 1.174985534201412
Ridge Regression Model:
    Mean-squared error: 0.5596105056182925
    Root mean-squared error: 0.7480711902073843
    R-squared value: -0.24803810267560222

Lasso best alpha selected: 1000 with CV MSE: 1.174985534201412
Lasso Regression Model:
    Mean-squared error: 0.5601840743214328
    Root mean-squared error: 0.7484544570790082
    R-squared value: -0.2493172702195181
Environmental Data (2022)
Traffic Data (2022)
OLS Regression Model:
    Mean-squared error: 0.5310351617487781
    Root mean-squared error: 0.7287215941282227
    R-squared value: -0.7615209941768197

Ridge best alpha selected: 100 with CV MSE: 1.1758120989820233
Ridge Regression Model:
    Mean-squared error: 0.41491285307157283
    Root mean-squared error: 0.6441372936506415
    R-squared value: -0.37632637927870594

Lasso best alpha selected: 100 with CV MSE: 1.1758120989820233
Lasso Regression Model:
    Mean-squared error: 0.4315963501853747
    Root mean-squared error: 0.6569599304260304
    R-squared value: -0.43166796970271903
Environmental Data (2022)
Traffic Data (5 Year Averages)
OLS Regression Model:
    Mean-squared error: 0.5553423826213788
    Root mean-squared error: 0.7452129780280123
    R-squared value: -0.23851937478615914

Ridge best alpha selected: 10000 with CV MSE: 1.1760355262786755
Ridge Regression Model:
    Mean-squared error: 0.5590780965942956
    Root mean-squared error: 0.7477152510109016
    R-squared value: -0.2468507290621953

Lasso best alpha selected: 1 with CV MSE: 1.1759802832153639
Lasso Regression Model:
    Mean-squared error: 0.5601840743214328
    Root mean-squared error: 0.7484544570790082
    R-squared value: -0.2493172702195181
Environmental Data (5 Year Averages)
Traffic Data (2022)
OLS Regression Model:
    Mean-squared error: 0.44783955314134777
    Root mean-squared error: 0.669208153821625
    R-squared value: -0.4855490402619558

Ridge best alpha selected: 100 with CV MSE: 1.1569723795267826
Ridge Regression Model:
    Mean-squared error: 0.4109393466705784
    Root mean-squared error: 0.6410455106079275
    R-squared value: -0.3631456796753172

Lasso best alpha selected: 100 with CV MSE: 1.1569723795267826
Lasso Regression Model:
    Mean-squared error: 0.4315963501853747
    Root mean-squared error: 0.6569599304260304
    R-squared value: -0.43166796970271903

Polynomial regression

We utilized a second order polynomial regression to see if it would perform better than the linear regression models. We discovered that the quadratic regression also performed worse than a prediction made by averages.

Environmental Data (5 Year Averages)
Traffic Data (5 Year Averages)
Mean Squared Error: 8.216793889457124
R-squared: -17.325016690929225
Environmental Data (2022)
Traffic Data (2022)
Mean Squared Error: 0.7942820147920161
R-squared: -1.6347491562434548
Environmental Data (2022)
Traffic Data (5 Year Averages)
Mean Squared Error: 1.216637646936872
R-squared: -1.7133338729035632
Environmental Data (5 Year Averages)
Traffic Data (2022)
Mean Squared Error: 6.005052083350384
R-squared: -18.91963259290013

Discussion

  • The overall low predictive power across all of our models implies that our hypothesis that traffic counts alone are not an accurate metric for urbanization. There is also a concern that these datasets are not appropriate for our research question. Potentially narrowing our scope through decreasing our regions of interest to a single county or our time frame to a single season may help control for natural variability and data integrity in our dataset.
  • Our PCA analysis demonstrates that extreme storm events and temperature have the most significant interaction among features in our dataset.
  • The polynomial regression model did not perform better than our linear regression models. Both model types produced negative R-squared values, meaning a simple numerical average has more predictive power than our models.

Future improvements

Some ideas for improvement when investigating our research question include:

  • Using more complete datasets. The NOAA dataset, in its incompleteness, includes a use advisory that the data are meant mainly for hydrological and agricultural purposes and warns that the incompleteness of the dataset means it is not sufficient for climate change research.

  • Perform traffic congestion normalization. Traffic congestion takes into account how long a car is on the road in a specific area, which is affected by gross count, number of lanes, and car speed. As Interstate-8, Interstate-10, and Interstate-40 are major shipping lanes that are represented in our data, they may be associated with high gross count but a low urbanization factor as trucks travel at high speed through rural areas.

  • Perform modeling using PCA clusters with variables of interest. Reducing the number of features may lead to better R-squared values because the regression fits will be less affected by extraneous features.

References

  1. Grid Construction Reference

    1. Brennan, J. (2020). Fast and easy gridding of point data with geopandas. https://james-brennan.github.io/posts/fast_gridding_geopandas/
  2. Traffic Data

    1. U.S. Department of Transportation Federal Highway Administration. (n.d.). Travel Monitoring. https://www.fhwa.dot.gov/policyinformation/travel_monitoring/tvt.cfm

    2. Transportation Research Board. The Congestion Mitigation and Air Quality Improvement Programs. https://onlinepubs.trb.org/onlinepubs/sr/sr264.pdf

    3. World Health Organization. Health Effects of Transport-Related Air Pollution https://books.google.com/books?hl=en&lr=&id=b2G3k51rd0oC&oi=fnd&pg=PR1&ots=O94u9zCs4y&sig=v1JEfno-HjkbKf3im4iDQLJJUQ0#v=onepage&q&f=false

  3. Storm Data

    1. NOAA National Centers for Environmental Information. (n.d.). Storm Events Databse. https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/
  4. Climate Data

    1. NOAA National Centers for Environmental Information. (n.d.). Climate Data. https://www.ncdc.noaa.gov/cdo-web/search;jsessionid=2851FE4A6791F4BF429988727483E450

    2. Global Historical Climatology Network. (n.d.). Daily Documentation. https://www.ncei.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf

  5. Trenberth, K.E. (2007). The Impact of Climate Change and Variability on Heavy Precipitation, Floods and Droughts. https://web.archive.org/web/20170809113035id_/http://www.cgd.ucar.edu/cas/Trenberth/website-archive/trenberth.papers-moved/The%20Impact%20of%20Climate%20Change%20on%20the%20Water%20Cycle%20v%204_ss.pdf

  6. Zhang M., Gao, Y., & Ge, J. (2025). Different responses of extreme and mean precipitation to land use and land cover changes. https://www.nature.com/articles/s41612-025-01049-1

  7. Saturn Cloud. (2023). How to Sum Two Columns in a Pandas DataFrame.https://saturncloud.io/blog/how-to-sum-two-columns-in-a-pandas-dataframe/