Urbanization and Environmental Quality in Arizona, USA

Proposal

Final Project for course INFO 523: Data Mining and Discovery
Author
Affiliation

JKP (Vera Jackson, Molly Kerwick, Brooke Pacheco)

College of Information Science, University of Arizona

High Level Goal

The goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data.

Research Questions

  1. Can we predict the environmental quality of a region based on urbanization indicators for that region?
    The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region.

  2. Are storm event data and traffic data successful environmental health and urbanization indicators, respectively?
    The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest.

  3. Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model?
    This will indicate how accurate our scope is compared to other environmental impact studies.

  4. Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data?
    If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities.

Proposed Procedure

This project will consist of 2 studies:

  • Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two.

  • Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions.

Temporal vs. Regional Geospatial Analysis

When thinking of climate models, people are probably most familiar with a temporal frame of thinking. Meaning one region is compared to the same region at a different point in time. As our study is interested in how one region compares to another region, we have to change our mindset when approaching this problem. To compare regions, we plan to have a dataset that has climate, storm events, and traffic features where each instance is a different region.

Since these measurements are dependent on time, however, we will have to add features that normalize the temporal aspect of our dataset. We will do this by engineering features that compare our primary dataset values to a dataset of averaged values for the same regions and features over a 5-year period. We have selected data from 2022 as our primary dataset, and we have selected the timeframe 2019-2023 to construct our averaged values. Comparing storm patterns across regions provides valuable insights into climate health, as variations in storm frequency and intensity can indicate underlying changes in regional climate conditions. This approach enables us to analyze spatial differences in climate quality while accounting for temporal variability.

Engineered Features

Climate Features

Feature Description
max_temp_2022 maximum temperature for 2022
max_temp_above_5yavg maximum temperature for 2022 minus maximum temperature averaged 2019-2023
avg_temp_2022 average temperature for 2022
avg_temp_above_5yavg average temperature for 2022 minus average temperature averaged 2019-2023
rainfall_2022 total inches of rainfall in 2022
rainfall_below_5yavg inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022
lowmagstorm_events_2022 sum of low magnitude storm events in 2022
lowmagstorm_events_above_5yavg sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023
highmagstorm_events_2022 sum of high magnitude storm events in 2022
highmagstorm_events_above_5yavg sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023
average_storm_mag_2022 average storm event magnitude for 2022

Urbanization Features

Feature Description
traffic_counts total number of vehicles detected in 2022
traffic_counts_above_5yavg traffic counts in 2022 minus traffic counts averaged 2019-2023

Datasets

Dataset 1 - Storm Data in Arizona

Shape: (2500, 39)
Years combined: ['storm_data_AZ_2019.csv', 'storm_data_AZ_2020.csv', 'storm_data_AZ_2021.csv', 'storm_data_AZ_2022.csv', 'storm_data_AZ_2023.csv']
   EVENT_ID                                        CZ_NAME_STR BEGIN_LOCATION  \
0    796730  WHITE MOUNTAINS OF GRAHAM AND GREENLEE COUNTIE...                  
1    792354  LITTLE COLORADO RIVER VALLEY IN NAVAJO COUNTY ...                  
2    792442                             WHITE MOUNTAINS (ZONE)                  
3    796731              GALIURO AND PINALENO MOUNTAINS (ZONE)                  
4    792444  LITTLE COLORADO RIVER VALLEY IN APACHE COUNTY ...                  

   BEGIN_DATE  BEGIN_TIME    EVENT_TYPE MAGNITUDE TOR_F_SCALE  DEATHS_DIRECT  \
0  01/01/2019           0  Winter Storm                                    0   
1  01/01/2019           0    Heavy Snow                                    0   
2  01/01/2019           0    Heavy Snow                                    0   
3  01/01/2019           0  Winter Storm                                    0   
4  01/01/2019           0    Heavy Snow                                    0   

   INJURIES_DIRECT  ...  END_LOCATION    END_DATE END_TIME BEGIN_LAT  \
0                0  ...                01/01/2019     1330             
1                0  ...                01/01/2019      700             
2                0  ...                01/01/2019      700             
3                0  ...                01/01/2019     1330             
4                0  ...                01/01/2019     1100             

  BEGIN_LON  END_LAT END_LON  \
0                              
1                              
2                              
3                              
4                              

                                     EVENT_NARRATIVE  \
0  Accumulating snow began on the afternoon of De...   
1  Snow began falling in the Taylor area on New Y...   
2  Eight inches of snow fell in Pinetop-Lakeside ...   
3  Accumulating snow began on the afternoon of De...   
4  Snow began to fall New Year's Eve during the e...   

                                   EPISODE_NARRATIVE  ABSOLUTE_ROWNUMBER  
0  A relatively strong and cold weather system im...                   1  
1  A third storm system in a week crossed norther...                   2  
2  A third storm system in a week crossed norther...                   3  
3  A relatively strong and cold weather system im...                   4  
4  A third storm system in a week crossed norther...                   5  

[5 rows x 39 columns]

This dataset contains storm event records in Arizona, sourced from the NOAA Storm Events Database found https://www.ncdc.noaa.gov/stormevents/choosedates.jsp?statefips=4%2CARIZONA. It includes information about various weather events such as floods, tornadoes, and severe storms, along with details like location, date, event type, magnitude, fatalities, injuries, and property damage. It also includes metadata such as time zones, county information, and narrative descriptions of each event.

The dataset consists of a mix of numerical and categorical values, specifically, 12 columns with integer types and 27 with object types. It was chosen for its relevance to climate and environmental analysis in Arizona. The data enables the exploration of temporal and spatial patterns in extreme weather events and supports investigations into trends related to climate change, urbanization, and risk assessment.

Dataset 2 - Weather Data in Arizona

Shape: (1093110, 37)
Columns: ['STATION', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'DAPR', 'DASF', 'EVAP', 'MDPR', 'MDSF', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'TOBS', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT07', 'WT08', 'WT09', 'WT10', 'WT11', 'NAME', 'AWND', 'PGTM', 'WDF2', 'WDF5', 'WESD', 'WESF', 'WSF2', 'WSF5']
Data types: STATION       object
LATITUDE     float64
LONGITUDE    float64
ELEVATION    float64
DATE          object
DAPR         float64
DASF         float64
EVAP         float64
MDPR         float64
MDSF         float64
PRCP         float64
SNOW         float64
SNWD         float64
TAVG         float64
TMAX         float64
TMIN         float64
TOBS         float64
WT01         float64
WT02         float64
WT03         float64
WT04         float64
WT05         float64
WT06         float64
WT07         float64
WT08         float64
WT09         float64
WT10         float64
WT11         float64
NAME          object
AWND         float64
PGTM         float64
WDF2         float64
WDF5         float64
WESD         float64
WESF         float64
WSF2         float64
WSF5         float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093110 entries, 0 to 1093109
Data columns (total 37 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   STATION    1093110 non-null  object 
 1   LATITUDE   1093110 non-null  float64
 2   LONGITUDE  1093110 non-null  float64
 3   ELEVATION  1091276 non-null  float64
 4   DATE       1093110 non-null  object 
 5   DAPR       4668 non-null     float64
 6   DASF       1 non-null        float64
 7   EVAP       1685 non-null     float64
 8   MDPR       4621 non-null     float64
 9   MDSF       1 non-null        float64
 10  PRCP       925826 non-null   float64
 11  SNOW       647733 non-null   float64
 12  SNWD       186831 non-null   float64
 13  TAVG       169447 non-null   float64
 14  TMAX       374425 non-null   float64
 15  TMIN       373480 non-null   float64
 16  TOBS       185400 non-null   float64
 17  WT01       3111 non-null     float64
 18  WT02       453 non-null      float64
 19  WT03       4333 non-null     float64
 20  WT04       193 non-null      float64
 21  WT05       192 non-null      float64
 22  WT06       38 non-null       float64
 23  WT07       68 non-null       float64
 24  WT08       2465 non-null     float64
 25  WT09       7 non-null        float64
 26  WT10       1 non-null        float64
 27  WT11       290 non-null      float64
 28  NAME       990861 non-null   object 
 29  AWND       27965 non-null    float64
 30  PGTM       1728 non-null     float64
 31  WDF2       27995 non-null    float64
 32  WDF5       27907 non-null    float64
 33  WESD       40807 non-null    float64
 34  WESF       7919 non-null     float64
 35  WSF2       27996 non-null    float64
 36  WSF5       27908 non-null    float64
dtypes: float64(34), object(3)
memory usage: 308.6+ MB

This data from NOAA National Centers for Environmental Information will serve as a measure of environmental quality. The dataset noaadata is a compilation of daily land surface observations within Arizona from 2018 to 2023. Some variables of importance includes latitude and longitude of station, temperatures, precipitation, and snowfall.

This data contains primarily numerical values, with the only categorical variables being the name of the station where weather data is collected, and date the data was collected.

Dataset 3 - Traffic Data in Arizona

Geographic locations of each station are formatted as follows:
Index(['index', 'record_type', 'state_code', 'station_id', 'travel_dir',
       'travel_lane', 'year_record', 'f_system', 'num_lanes',
       'sample_type_volume', 'num_lanes_volume', 'method_volume',
       'sample_type_class', 'num_lanes_class', 'method_class',
       'algorithm_volume', 'num_classes', 'sample_type_truck',
       'num_lanes_truck', 'method_truck', 'calibration', 'data_retrieval',
       'type_sensor_1', 'type_sensor_2', 'primary_purpose', 'lrs_id',
       'lrs_point', 'latitude', 'longitude', 'shrp_id', 'prev_station_id',
       'year_established', 'year_discontinued', 'county_code', 'is_sample',
       'sample_id', 'nhs', 'posted_route_signing', 'posted_signed_route',
       'con_route_signing', 'con_signed_route', 'station_location'],
      dtype='object')

Counts of traffic stations will be read in from the following files:
['data/Traffic/counts/AZ0119.VOL', 'data/Traffic/counts/AZ0219.VOL', 'data/Traffic/counts/AZ0319.VOL', 'data/Traffic/counts/AZ0419.VOL', 'data/Traffic/counts/AZ0519.VOL', 'data/Traffic/counts/AZ0619.VOL', 'data/Traffic/counts/AZ0719.VOL', 'data/Traffic/counts/AZ0819.VOL', 'data/Traffic/counts/AZ0919.VOL', 'data/Traffic/counts/AZ1019.VOL', 'data/Traffic/counts/AZ1119.VOL', 'data/Traffic/counts/AZ1219.VOL', 'data/Traffic/counts/AZ0120.VOL', 'data/Traffic/counts/AZ0220.VOL', 'data/Traffic/counts/AZ0320.VOL', 'data/Traffic/counts/AZ0420.VOL', 'data/Traffic/counts/AZ0520.VOL', 'data/Traffic/counts/AZ0620.VOL', 'data/Traffic/counts/AZ0720.VOL', 'data/Traffic/counts/AZ0820.VOL', 'data/Traffic/counts/AZ0920.VOL', 'data/Traffic/counts/AZ1020.VOL', 'data/Traffic/counts/AZ1120.VOL', 'data/Traffic/counts/AZ1220.VOL', 'data/Traffic/counts/AZ0121.VOL', 'data/Traffic/counts/AZ0221.VOL', 'data/Traffic/counts/AZ0321.VOL', 'data/Traffic/counts/AZ0421.VOL', 'data/Traffic/counts/AZ0521.VOL', 'data/Traffic/counts/AZ0621.VOL', 'data/Traffic/counts/AZ0721.VOL', 'data/Traffic/counts/AZ0821.VOL', 'data/Traffic/counts/AZ0921.VOL', 'data/Traffic/counts/AZ1021.VOL', 'data/Traffic/counts/AZ1121.VOL', 'data/Traffic/counts/AZ1221.VOL', 'data/Traffic/counts/AZ0122.VOL', 'data/Traffic/counts/AZ0222.VOL', 'data/Traffic/counts/AZ0322.VOL', 'data/Traffic/counts/AZ0422.VOL', 'data/Traffic/counts/AZ0522.VOL', 'data/Traffic/counts/AZ0622.VOL', 'data/Traffic/counts/AZ0722.VOL', 'data/Traffic/counts/AZ0822.VOL', 'data/Traffic/counts/AZ0922.VOL', 'data/Traffic/counts/AZ1022.VOL', 'data/Traffic/counts/AZ1122.VOL', 'data/Traffic/counts/AZ1222.VOL']
Index(['Record_Type', 'State_Code', 'F_System', 'Station_Id', 'Travel_Dir',
       'Travel_Lane', 'Year_Record', 'Month_Record', 'Day_Record',
       'Day_of_Week', 'Hour_00', 'Hour_01', 'Hour_02', 'Hour_03', 'Hour_04',
       'Hour_05', 'Hour_06', 'Hour_07', 'Hour_08', 'Hour_09', 'Hour_10',
       'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16',
       'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22',
       'Hour_23', 'Restrictions'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21866 entries, 0 to 21865
Data columns (total 35 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Record_Type   21866 non-null  int64 
 1   State_Code    21866 non-null  int64 
 2   F_System      21866 non-null  object
 3   Station_Id    21866 non-null  int64 
 4   Travel_Dir    21866 non-null  int64 
 5   Travel_Lane   21866 non-null  int64 
 6   Year_Record   21866 non-null  int64 
 7   Month_Record  21866 non-null  int64 
 8   Day_Record    21866 non-null  int64 
 9   Day_of_Week   21866 non-null  int64 
 10  Hour_00       21866 non-null  int64 
 11  Hour_01       21866 non-null  int64 
 12  Hour_02       21866 non-null  int64 
 13  Hour_03       21866 non-null  int64 
 14  Hour_04       21866 non-null  int64 
 15  Hour_05       21866 non-null  int64 
 16  Hour_06       21866 non-null  int64 
 17  Hour_07       21866 non-null  int64 
 18  Hour_08       21866 non-null  int64 
 19  Hour_09       21866 non-null  int64 
 20  Hour_10       21866 non-null  int64 
 21  Hour_11       21866 non-null  int64 
 22  Hour_12       21866 non-null  int64 
 23  Hour_13       21866 non-null  int64 
 24  Hour_14       21866 non-null  int64 
 25  Hour_15       21866 non-null  int64 
 26  Hour_16       21866 non-null  int64 
 27  Hour_17       21866 non-null  int64 
 28  Hour_18       21866 non-null  int64 
 29  Hour_19       21866 non-null  int64 
 30  Hour_20       21866 non-null  int64 
 31  Hour_21       21866 non-null  int64 
 32  Hour_22       21866 non-null  int64 
 33  Hour_23       21866 non-null  int64 
 34  Restrictions  21866 non-null  int64 
dtypes: int64(34), object(1)
memory usage: 5.8+ MB
None

The traffic data above is sourced from the Federal Highway Administration (FHWA) of the U.S. Department of Transportation (https://www.fhwa.dot.gov/policyinformation/tables/tmasdata/). It will serve as a metric for urbanization level of regions throughout Arizona. It is presumed that counties with higher traffic flow have greater urbanization than counties with low traffic flow.

The traffic data are sourced from two different file structures, counts data and station data.

  • The station data contains information about sampling locations including latitude and longitude, number of lanes on road, and type of sensor. Our main variables of interest will be latitude (numerical), longitude (numerical), and county (categorical)

  • The count data is nearly all numerical. Each row is a station location on a specific day of the given file month/year. Each row includes counts of passing vehicles at the listed station location for every hour of that day. We will engineer a feature variable aggregating total counts over a year at each station location. Variables of interest include the hourly counts (ex. Hour_05) and the station number.

Schedule

Date Task Details Completed?
Aug 9 Final Proposal Submission Submit this project proposal to meet course requirements and include peer edits. Ensure all datasets are identified and accessible.
Aug 10 Data Validation & Cleaning Verify all NOAA, storm event, and traffic datasets are complete and free of formatting issues. Handle missing data, unify column formats, and confirm geospatial coordinate accuracy.
Aug 11 Exploratory Data Analysis (EDA) Perform initial descriptive statistics and visualizations for each dataset (storm events, climate data, traffic counts). Identify trends, anomalies, and data distribution shapes.
Aug 12 Dataset Integration Merge storm, climate, and traffic data into a single regional dataset for 2022. Create temporal normalization dataset for 2019–2023 averages.
Aug 13 Feature Engineering Create engineered features outlined in the proposal (e.g., max_temp_above_5yavg, traffic_counts_above_5yavg). Ensure proper units and scaling for regression.
Aug 14 Regression Analysis Setup Split data into training/testing sets. Implement baseline linear and quadratic regression models to predict environmental quality from urbanization indicators.
Aug 15 Model Tuning & Evaluation Perform hyperparameter tuning for regression models. Compare linear vs. quadratic fits. Assess model performance.
Aug 16 PCA & Regional Comparison Run PCA on the combined dataset. Identify clusters of similar regions and interpret principal components in relation to urbanization and environmental quality.
Aug 17 Visualization Development Create regression plots, PCA scatterplots, and geospatial maps of Arizona showing urbanization and climate impact patterns.
Aug 18 Presentation Recording Develop and record project presentation, including visualizations and key findings. Ensure explanation of model results and their policy implications.
Aug 19 Final Write-Up Complete detailed report including methodology, results, discussion, and limitations. Prepare references and appendices with code snippets.
Aug 20 (11:59 PM) Submission Deadline Submit final write-up and presentation.