Shape: (2500, 39)
Years combined: ['storm_data_AZ_2019.csv', 'storm_data_AZ_2020.csv', 'storm_data_AZ_2021.csv', 'storm_data_AZ_2022.csv', 'storm_data_AZ_2023.csv']
EVENT_ID CZ_NAME_STR BEGIN_LOCATION \
0 796730 WHITE MOUNTAINS OF GRAHAM AND GREENLEE COUNTIE...
1 792354 LITTLE COLORADO RIVER VALLEY IN NAVAJO COUNTY ...
2 792442 WHITE MOUNTAINS (ZONE)
3 796731 GALIURO AND PINALENO MOUNTAINS (ZONE)
4 792444 LITTLE COLORADO RIVER VALLEY IN APACHE COUNTY ...
BEGIN_DATE BEGIN_TIME EVENT_TYPE MAGNITUDE TOR_F_SCALE DEATHS_DIRECT \
0 01/01/2019 0 Winter Storm 0
1 01/01/2019 0 Heavy Snow 0
2 01/01/2019 0 Heavy Snow 0
3 01/01/2019 0 Winter Storm 0
4 01/01/2019 0 Heavy Snow 0
INJURIES_DIRECT ... END_LOCATION END_DATE END_TIME BEGIN_LAT \
0 0 ... 01/01/2019 1330
1 0 ... 01/01/2019 700
2 0 ... 01/01/2019 700
3 0 ... 01/01/2019 1330
4 0 ... 01/01/2019 1100
BEGIN_LON END_LAT END_LON \
0
1
2
3
4
EVENT_NARRATIVE \
0 Accumulating snow began on the afternoon of De...
1 Snow began falling in the Taylor area on New Y...
2 Eight inches of snow fell in Pinetop-Lakeside ...
3 Accumulating snow began on the afternoon of De...
4 Snow began to fall New Year's Eve during the e...
EPISODE_NARRATIVE ABSOLUTE_ROWNUMBER
0 A relatively strong and cold weather system im... 1
1 A third storm system in a week crossed norther... 2
2 A third storm system in a week crossed norther... 3
3 A relatively strong and cold weather system im... 4
4 A third storm system in a week crossed norther... 5
[5 rows x 39 columns]
Urbanization and Environmental Quality in Arizona, USA
Proposal
High Level Goal
The goal of our project is to analyze the relationship between urbanization and environmental quality across regions within Arizona, identifying patterns and anomalies in how urban growth impacts climate metrics such as storms, temperature, and rainfall. As we have seen with the uproar by the Tucson population regarding the proposed environmental impacts of the Project Blue data center, environmental quality is important to local populations. We hope to use this study to identify the regional relationship between urbanization and climate indicators. We will be using traffic count data as our urbanization metric, and National Oceanographic and Atmospheric Administration (NOAA) climate and storm event data as our environmental quality data.
Research Questions
Can we predict the environmental quality of a region based on urbanization indicators for that region?
The reason we chose this question is that if we can accurately predict how urbanization affects climate, policymakers can proactively work to preserve or improve the environmental quality of that region.Are storm event data and traffic data successful environmental health and urbanization indicators, respectively?
The infrastructure for traffic monitoring is affordable to implement and already has a framework for deployment. If traffic volume is an indicator of environmental quality, we could use it to better study areas of climatological interest.Is the relationship between urbanization level and climate indicators better described by a linear or quadratic model?
This will indicate how accurate our scope is compared to other environmental impact studies.Can PCA analysis be used to compare regions throughout Arizona based on traffic, climate, and storm event data?
If our hypothesis that high urbanization level leads to low environmental quality is correct, we would expect to see similarities between metropolitan regions. Additionally, we might be able to identify regions with unknown similarities.
Proposed Procedure
This project will consist of 2 studies:
Study 1: A regional regression model of environmental quality features (storm events and climate data) as a function of our urbanization target feature (traffic volume). We will do this with multivariate regression models, with a hypothesis that there is a negative linear relationship between urbanization and environmental quality. We will train a linear regression model and a quadratic regression model and compare the two.
Study 2: A PCA analysis of the state of Arizona comparing both climate data and traffic data to identify similar regions.
Temporal vs. Regional Geospatial Analysis
When thinking of climate models, people are probably most familiar with a temporal frame of thinking. Meaning one region is compared to the same region at a different point in time. As our study is interested in how one region compares to another region, we have to change our mindset when approaching this problem. To compare regions, we plan to have a dataset that has climate, storm events, and traffic features where each instance is a different region.
Since these measurements are dependent on time, however, we will have to add features that normalize the temporal aspect of our dataset. We will do this by engineering features that compare our primary dataset values to a dataset of averaged values for the same regions and features over a 5-year period. We have selected data from 2022 as our primary dataset, and we have selected the timeframe 2019-2023 to construct our averaged values. Comparing storm patterns across regions provides valuable insights into climate health, as variations in storm frequency and intensity can indicate underlying changes in regional climate conditions. This approach enables us to analyze spatial differences in climate quality while accounting for temporal variability.
Engineered Features
Climate Features
Feature | Description |
---|---|
max_temp_2022 | maximum temperature for 2022 |
max_temp_above_5yavg | maximum temperature for 2022 minus maximum temperature averaged 2019-2023 |
avg_temp_2022 | average temperature for 2022 |
avg_temp_above_5yavg | average temperature for 2022 minus average temperature averaged 2019-2023 |
rainfall_2022 | total inches of rainfall in 2022 |
rainfall_below_5yavg | inches of annual rainfall averaged 2019-2023 minus inches of annual rainfall in 2022 |
lowmagstorm_events_2022 | sum of low magnitude storm events in 2022 |
lowmagstorm_events_above_5yavg | sum of low magnitude storm events for 2022 minus sum of low magnitude storm events averaged 2019-2023 |
highmagstorm_events_2022 | sum of high magnitude storm events in 2022 |
highmagstorm_events_above_5yavg | sum of high magnitude storm events for 2022 minus sum of high magnitude storm events averaged 2019-2023 |
average_storm_mag_2022 | average storm event magnitude for 2022 |
Urbanization Features
Feature | Description |
---|---|
traffic_counts | total number of vehicles detected in 2022 |
traffic_counts_above_5yavg | traffic counts in 2022 minus traffic counts averaged 2019-2023 |
Datasets
Dataset 1 - Storm Data in Arizona
This dataset contains storm event records in Arizona, sourced from the NOAA Storm Events Database found https://www.ncdc.noaa.gov/stormevents/choosedates.jsp?statefips=4%2CARIZONA. It includes information about various weather events such as floods, tornadoes, and severe storms, along with details like location, date, event type, magnitude, fatalities, injuries, and property damage. It also includes metadata such as time zones, county information, and narrative descriptions of each event.
The dataset consists of a mix of numerical and categorical values, specifically, 12 columns with integer types and 27 with object types. It was chosen for its relevance to climate and environmental analysis in Arizona. The data enables the exploration of temporal and spatial patterns in extreme weather events and supports investigations into trends related to climate change, urbanization, and risk assessment.
Dataset 2 - Weather Data in Arizona
Shape: (1093110, 37)
Columns: ['STATION', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE', 'DAPR', 'DASF', 'EVAP', 'MDPR', 'MDSF', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'TOBS', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT07', 'WT08', 'WT09', 'WT10', 'WT11', 'NAME', 'AWND', 'PGTM', 'WDF2', 'WDF5', 'WESD', 'WESF', 'WSF2', 'WSF5']
Data types: STATION object
LATITUDE float64
LONGITUDE float64
ELEVATION float64
DATE object
DAPR float64
DASF float64
EVAP float64
MDPR float64
MDSF float64
PRCP float64
SNOW float64
SNWD float64
TAVG float64
TMAX float64
TMIN float64
TOBS float64
WT01 float64
WT02 float64
WT03 float64
WT04 float64
WT05 float64
WT06 float64
WT07 float64
WT08 float64
WT09 float64
WT10 float64
WT11 float64
NAME object
AWND float64
PGTM float64
WDF2 float64
WDF5 float64
WESD float64
WESF float64
WSF2 float64
WSF5 float64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093110 entries, 0 to 1093109
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STATION 1093110 non-null object
1 LATITUDE 1093110 non-null float64
2 LONGITUDE 1093110 non-null float64
3 ELEVATION 1091276 non-null float64
4 DATE 1093110 non-null object
5 DAPR 4668 non-null float64
6 DASF 1 non-null float64
7 EVAP 1685 non-null float64
8 MDPR 4621 non-null float64
9 MDSF 1 non-null float64
10 PRCP 925826 non-null float64
11 SNOW 647733 non-null float64
12 SNWD 186831 non-null float64
13 TAVG 169447 non-null float64
14 TMAX 374425 non-null float64
15 TMIN 373480 non-null float64
16 TOBS 185400 non-null float64
17 WT01 3111 non-null float64
18 WT02 453 non-null float64
19 WT03 4333 non-null float64
20 WT04 193 non-null float64
21 WT05 192 non-null float64
22 WT06 38 non-null float64
23 WT07 68 non-null float64
24 WT08 2465 non-null float64
25 WT09 7 non-null float64
26 WT10 1 non-null float64
27 WT11 290 non-null float64
28 NAME 990861 non-null object
29 AWND 27965 non-null float64
30 PGTM 1728 non-null float64
31 WDF2 27995 non-null float64
32 WDF5 27907 non-null float64
33 WESD 40807 non-null float64
34 WESF 7919 non-null float64
35 WSF2 27996 non-null float64
36 WSF5 27908 non-null float64
dtypes: float64(34), object(3)
memory usage: 308.6+ MB
This data from NOAA National Centers for Environmental Information will serve as a measure of environmental quality. The dataset noaadata
is a compilation of daily land surface observations within Arizona from 2018 to 2023. Some variables of importance includes latitude and longitude of station, temperatures, precipitation, and snowfall.
This data contains primarily numerical values, with the only categorical variables being the name of the station where weather data is collected, and date the data was collected.
Dataset 3 - Traffic Data in Arizona
Geographic locations of each station are formatted as follows:
Index(['index', 'record_type', 'state_code', 'station_id', 'travel_dir',
'travel_lane', 'year_record', 'f_system', 'num_lanes',
'sample_type_volume', 'num_lanes_volume', 'method_volume',
'sample_type_class', 'num_lanes_class', 'method_class',
'algorithm_volume', 'num_classes', 'sample_type_truck',
'num_lanes_truck', 'method_truck', 'calibration', 'data_retrieval',
'type_sensor_1', 'type_sensor_2', 'primary_purpose', 'lrs_id',
'lrs_point', 'latitude', 'longitude', 'shrp_id', 'prev_station_id',
'year_established', 'year_discontinued', 'county_code', 'is_sample',
'sample_id', 'nhs', 'posted_route_signing', 'posted_signed_route',
'con_route_signing', 'con_signed_route', 'station_location'],
dtype='object')
Counts of traffic stations will be read in from the following files:
['data/Traffic/counts/AZ0119.VOL', 'data/Traffic/counts/AZ0219.VOL', 'data/Traffic/counts/AZ0319.VOL', 'data/Traffic/counts/AZ0419.VOL', 'data/Traffic/counts/AZ0519.VOL', 'data/Traffic/counts/AZ0619.VOL', 'data/Traffic/counts/AZ0719.VOL', 'data/Traffic/counts/AZ0819.VOL', 'data/Traffic/counts/AZ0919.VOL', 'data/Traffic/counts/AZ1019.VOL', 'data/Traffic/counts/AZ1119.VOL', 'data/Traffic/counts/AZ1219.VOL', 'data/Traffic/counts/AZ0120.VOL', 'data/Traffic/counts/AZ0220.VOL', 'data/Traffic/counts/AZ0320.VOL', 'data/Traffic/counts/AZ0420.VOL', 'data/Traffic/counts/AZ0520.VOL', 'data/Traffic/counts/AZ0620.VOL', 'data/Traffic/counts/AZ0720.VOL', 'data/Traffic/counts/AZ0820.VOL', 'data/Traffic/counts/AZ0920.VOL', 'data/Traffic/counts/AZ1020.VOL', 'data/Traffic/counts/AZ1120.VOL', 'data/Traffic/counts/AZ1220.VOL', 'data/Traffic/counts/AZ0121.VOL', 'data/Traffic/counts/AZ0221.VOL', 'data/Traffic/counts/AZ0321.VOL', 'data/Traffic/counts/AZ0421.VOL', 'data/Traffic/counts/AZ0521.VOL', 'data/Traffic/counts/AZ0621.VOL', 'data/Traffic/counts/AZ0721.VOL', 'data/Traffic/counts/AZ0821.VOL', 'data/Traffic/counts/AZ0921.VOL', 'data/Traffic/counts/AZ1021.VOL', 'data/Traffic/counts/AZ1121.VOL', 'data/Traffic/counts/AZ1221.VOL', 'data/Traffic/counts/AZ0122.VOL', 'data/Traffic/counts/AZ0222.VOL', 'data/Traffic/counts/AZ0322.VOL', 'data/Traffic/counts/AZ0422.VOL', 'data/Traffic/counts/AZ0522.VOL', 'data/Traffic/counts/AZ0622.VOL', 'data/Traffic/counts/AZ0722.VOL', 'data/Traffic/counts/AZ0822.VOL', 'data/Traffic/counts/AZ0922.VOL', 'data/Traffic/counts/AZ1022.VOL', 'data/Traffic/counts/AZ1122.VOL', 'data/Traffic/counts/AZ1222.VOL']
Index(['Record_Type', 'State_Code', 'F_System', 'Station_Id', 'Travel_Dir',
'Travel_Lane', 'Year_Record', 'Month_Record', 'Day_Record',
'Day_of_Week', 'Hour_00', 'Hour_01', 'Hour_02', 'Hour_03', 'Hour_04',
'Hour_05', 'Hour_06', 'Hour_07', 'Hour_08', 'Hour_09', 'Hour_10',
'Hour_11', 'Hour_12', 'Hour_13', 'Hour_14', 'Hour_15', 'Hour_16',
'Hour_17', 'Hour_18', 'Hour_19', 'Hour_20', 'Hour_21', 'Hour_22',
'Hour_23', 'Restrictions'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21866 entries, 0 to 21865
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Record_Type 21866 non-null int64
1 State_Code 21866 non-null int64
2 F_System 21866 non-null object
3 Station_Id 21866 non-null int64
4 Travel_Dir 21866 non-null int64
5 Travel_Lane 21866 non-null int64
6 Year_Record 21866 non-null int64
7 Month_Record 21866 non-null int64
8 Day_Record 21866 non-null int64
9 Day_of_Week 21866 non-null int64
10 Hour_00 21866 non-null int64
11 Hour_01 21866 non-null int64
12 Hour_02 21866 non-null int64
13 Hour_03 21866 non-null int64
14 Hour_04 21866 non-null int64
15 Hour_05 21866 non-null int64
16 Hour_06 21866 non-null int64
17 Hour_07 21866 non-null int64
18 Hour_08 21866 non-null int64
19 Hour_09 21866 non-null int64
20 Hour_10 21866 non-null int64
21 Hour_11 21866 non-null int64
22 Hour_12 21866 non-null int64
23 Hour_13 21866 non-null int64
24 Hour_14 21866 non-null int64
25 Hour_15 21866 non-null int64
26 Hour_16 21866 non-null int64
27 Hour_17 21866 non-null int64
28 Hour_18 21866 non-null int64
29 Hour_19 21866 non-null int64
30 Hour_20 21866 non-null int64
31 Hour_21 21866 non-null int64
32 Hour_22 21866 non-null int64
33 Hour_23 21866 non-null int64
34 Restrictions 21866 non-null int64
dtypes: int64(34), object(1)
memory usage: 5.8+ MB
None
The traffic data above is sourced from the Federal Highway Administration (FHWA) of the U.S. Department of Transportation (https://www.fhwa.dot.gov/policyinformation/tables/tmasdata/). It will serve as a metric for urbanization level of regions throughout Arizona. It is presumed that counties with higher traffic flow have greater urbanization than counties with low traffic flow.
The traffic data are sourced from two different file structures, counts data and station data.
The station data contains information about sampling locations including latitude and longitude, number of lanes on road, and type of sensor. Our main variables of interest will be latitude (numerical), longitude (numerical), and county (categorical)
The count data is nearly all numerical. Each row is a station location on a specific day of the given file month/year. Each row includes counts of passing vehicles at the listed station location for every hour of that day. We will engineer a feature variable aggregating total counts over a year at each station location. Variables of interest include the hourly counts (ex. Hour_05) and the station number.
Schedule
Date | Task | Details | Completed? |
---|---|---|---|
Aug 9 | Final Proposal Submission | Submit this project proposal to meet course requirements and include peer edits. Ensure all datasets are identified and accessible. | ✅ |
Aug 10 | Data Validation & Cleaning | Verify all NOAA, storm event, and traffic datasets are complete and free of formatting issues. Handle missing data, unify column formats, and confirm geospatial coordinate accuracy. | ✅ |
Aug 11 | Exploratory Data Analysis (EDA) | Perform initial descriptive statistics and visualizations for each dataset (storm events, climate data, traffic counts). Identify trends, anomalies, and data distribution shapes. | ✅ |
Aug 12 | Dataset Integration | Merge storm, climate, and traffic data into a single regional dataset for 2022. Create temporal normalization dataset for 2019–2023 averages. | ✅ |
Aug 13 | Feature Engineering | Create engineered features outlined in the proposal (e.g., max_temp_above_5yavg , traffic_counts_above_5yavg ). Ensure proper units and scaling for regression. |
✅ |
Aug 14 | Regression Analysis Setup | Split data into training/testing sets. Implement baseline linear and quadratic regression models to predict environmental quality from urbanization indicators. | ✅ |
Aug 15 | Model Tuning & Evaluation | Perform hyperparameter tuning for regression models. Compare linear vs. quadratic fits. Assess model performance. | ✅ |
Aug 16 | PCA & Regional Comparison | Run PCA on the combined dataset. Identify clusters of similar regions and interpret principal components in relation to urbanization and environmental quality. | ✅ |
Aug 17 | Visualization Development | Create regression plots, PCA scatterplots, and geospatial maps of Arizona showing urbanization and climate impact patterns. | ✅ |
Aug 18 | Presentation Recording | Develop and record project presentation, including visualizations and key findings. Ensure explanation of model results and their policy implications. | ✅ |
Aug 19 | Final Write-Up | Complete detailed report including methodology, results, discussion, and limitations. Prepare references and appendices with code snippets. | ✅ |
Aug 20 (11:59 PM) | Submission Deadline | Submit final write-up and presentation. | ✅ |