Proposal of Prediction of Tornado Occurance in Minnesota
Proposal
An analysis of historical weather/tornado patterns in the area of Twin Cities Minnesota (MN) from 2020-Current(2025) to predict Tornado Occurance
Author
Affiliation
The Crengineers - Tyler Hart
College of Information Science, College of Systems and Industrial Engineering, University of Arizona
Dataset
First and foremost I chose this project as I have recently had some friends effected by tornados in the state of MN. Since there is a large amount of weather data publically available, I thought the data would be easy to source and models already exists as a baseline to use. I think this will allow me to focus on the ML part of the course rather than the data preprocessing which is a new area to me. Furthermore, I think there are some good classification questions about how to classify weather based on basic data and some good regression questions about can I predict tornadoes occurence based on this basic daily weather data. As of now I have 2 datasets the first is the historical archive of tornadows in MN from as far back as the 1950’s it contains the day of the Tornado as well as relative size/damage that was in its path. It has 38 columns which can be seen in the sub section codebooks.
Looking at the data it seems like a lot of the data is not filled in or defaulted with zeros. This is due to the many sources of data. In my case I think this is ok as I am really using it as a query to the conditions of the day. An important note is that this contains all the Tonado data for MN. Since I only have weather data for the twin cities I will filter it out later to only the surrounding counties.
[]
An image of the counties I am going to pick can be seen below:
[]
A full data dictionary can be seen in the csv for the storm data export located at data_dictionaries/Storm-Data-Export.pdf. These will eventualy be transcribed to the README.md files for each of the datasets
The other data I will be using is data queried from a weather API. the code used to do this can be seen in the main python sections of the proposal. This set contains all of the last 5 years 2020 - current daily weather data. The first couple rows can be seen below: ’HistoricRaw.head() It is 20 * 1827 different daily weather data coming out of MSP international airport. This is the major weather hub for the twin cities. Using this data it all seems to be pretty numerical and filled in pretty accurately. I do see some areas with 0 values that may need to be addressed later. ’HistoricRaw.info() My plan is to use ths to get the conditions on days where tornados occurred. This will be were most of the model data is derived from. A data dictionary can be seen in the data/README.md or below:
Codebook for [mn_weather_data] Dataset
Variable Names
Data Types
Description
Unit
date
TimeDate
Data and Time of weather collection
(YYYY-MM-DD hh-mm-ss)
weather_code
Float
The most severe weather condition on a given day
(WMO Code)
temperature_2m_mean
Float
Mean daily air temperature at 2 meters above ground
(°F)
temperature_2m_max
Float
Maximum daily air temperature at 2 meters above ground
(°F)
temperature_2m_min
Float
Minimum daily air temperature at 2 meters above ground
(°F)
precipitation_sum
Float
Sum of daily precipitation (including rain, showers and snowfall)
(mm)
rain_sum
Float
Sum of daily rain
(mm)
snowfall_sum
Float
Sum of daily snowfall
(cm)
wind_speed_10m_max
Float
Maximum wind speed on a day
(mph)
wind_gusts_10m_max
Float
Maximum wind gusts on a day
(mph)
wind_direction_10m_dominan
Float
Dominant wind direction
(°)
dew_point_2m_mean
Float
Mean daily dew point at 2 meters above ground
()
dew_point_2m_max
Float
Maximum daily dew point at 2 meters above ground
()
dew_point_2m_min
Float
Minimum daily dew point at 2 meters above ground
()
relative_humidity_2m_mean
Float
Mean daily relative humidiity at 2 meters above ground
()
relative_humidity_2m_max
Float
Maximum daily relative humidity at 2 meters above ground
()
relative_humidity_2m_min
Float
Minimum daily relative humidity at 2 meters above ground
()
surface_pressure_mean
Float
Mean daily pressure at surface
()
surface_pressure_max
Float
Maximum daily pressure at surface
()
surface_pressure_min
Float
Minimum daily daily pressure at surface
()
Codebook for [storm_data_search_results] Dataset
Variable Names
Data Types
Description
Unit
EVENT_ID
Int
ID assigned by NWS to note a single, small part that goes into a specific storm episode
Database ID
CZ_NAME_STR
String
County/Parish, Zone or Marine Name assigned to the county FIPS number or NWS Forecast Zone
None
BEGIN_LOCATION
String
Location the event began
None
BEGIN_DATE
Date
Date the event was reported
MM-DD-YYYY
BEGIN_TIME
Time
Time the event was reported
hh:mm:ss
EVENT_TYPE
String
Type of Storm Event (ex. Tornadoes, Hail, etc.)
None
TOR_F_SCALE
String
Enhanced Fujita Scale describes the strength of the tornado based on the amount and type of damage caused by the tornado
Fujita Scale
DEATHS_DIRECT
Int
The number of deaths directly related to the weather event
Deaths
INJURIES_DIRECT
Int
The number of injuries directly related to the weather event
Injuries
DAMAGE_PROPERTY_NUM
Float
Estimated monetary damage of property in the effected areas
$
DAMAGE_CROPS_NUM
Float
Estimated monetary damage of crops in the effected areas
$
STATE_ABBR
String
State code of effected area
State Code
CZ_TIMEZONE
String
Time Zone for the County/Parish, Zone or Marine Name
Time Zone
EPISODE_ID
Int
ID assigned by NWS to denote the storm episode
None
CZ_TYPE
String
Type of Jurisdiction (County/Parish, Zone or Marine Name)
None
CZ_FIPS
Int
FIPS ID number given to County/Parish, Zone or Marine Name
FIPS Code
WFO
String
National Weather Service Forecast Office’s area of responsibility in which the event occurred
WFO Code
INJURIES_INDIRECT
Int
The number of injuries indirectly related to the weather event
Injuries
DEATHS_INDIRECT
Int
The number of deaths indirectly related to the weather event
Deaths
SOURCE
String
Source of where information came from (ex. Weather Radar, Storm Chaser, Sighting, etc.)
None
FLOOD_CAUSE
String
Reported or estimated cause of the flood
None
TOR_LENGTH
Float
Length of the tornado or tornado segment while on the ground
Miles
TOR_WIDTH
Int
Width of the tornado or tornado segment while on the ground
Yards
END_LOCATION
String
Location the event ended
None
END_DATE
Date
Date the event ended
MM-DD-YYYY
END_TIME
Time
Time the event was ended
hh:mm:ss
BEGIN_LAT
Float
The latitude where the event began
(°)
BEGIN_LON
Float
The longitude where the event began
(°)
END_LAT
Float
The latitude where the event ended
(°)
END_LON
Float
The longitude where the event ended
(°)
EPISODE_NARRATIVE
Sentence
The episode narrative depicting the general nature and overall activity of the episode. The narrative is created by NWS.
None
EVENT_NARRATIVE
Sentence
The event narrative provides more specific details of the individual event. The event narrative is provided by NWS.
None
Data Clean Up
The tornado data has a lot of holes in a lot of its categories. For this analysis I really only need the date, county, and tornado statistics (fujita scale, width, length, etc.) in order to concatenate it to the daily weather data. The daily weather data will be mostly untouched as the data has no null values and contains all of the relevant necessary data. As such I will append the relevant tornado data to the weather data by date.
CZ_NAME_STR
date
TOR_F_SCALE
TOR_LENGTH
TOR_WIDTH
weather_code
temperature_2m_mean
temperature_2m_max
temperature_2m_min
precipitation_sum
...
dew_point_2m_max
dew_point_2m_min
relative_humidity_2m_mean
relative_humidity_2m_max
relative_humidity_2m_min
surface_pressure_mean
surface_pressure_max
surface_pressure_min
tornado_occurred
time_of_year
0
None
2000-01-02
None
0.0
0.0
73.0
29.362250
33.4085
26.838501
0.039370
...
26.6585
20.088500
80.965065
90.481580
67.150380
977.19560
984.67926
972.81570
0
1
1
None
2000-01-03
None
0.0
0.0
3.0
24.491000
26.3885
22.788500
0.000000
...
19.5485
15.768499
74.725880
82.030160
66.527940
986.42120
988.42944
984.91630
0
1
2
None
2000-01-04
None
0.0
0.0
71.0
16.728498
24.2285
10.638500
0.000000
...
18.3785
-1.871502
60.066280
78.316124
44.035545
987.85956
990.71370
984.34186
0
1
3
None
2000-01-05
None
0.0
0.0
73.0
17.831000
28.0085
10.368500
0.090551
...
25.5785
-1.781502
69.617760
91.112190
55.900590
983.71606
990.35310
978.67990
0
1
4
None
2000-01-06
None
0.0
0.0
3.0
22.987251
29.3585
13.248501
0.000000
...
25.7585
3.348501
69.387764
89.471180
56.031500
986.35736
992.35600
979.56040
0
1
5 rows × 26 columns
Exploratory Data Analysis
Questions
Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area?
Can I create a GUI/dashboard to quickly allow users to tune/view predictions.
Analysis plan
Week
Dates
Tasks
Status
Date Finished
1
7/28->8/1
Search and Find Data
Complete
8/1
2
8/4->8/8
Create Proposal and Load and Make Data Dict and Clean Data
Complete
8/11
2.5
8/7->8/10
Perform EDA and Create Engineering Features and Start Write Up
Complete
8/15
3
8/11->8/13
Create Models and Examine Measures of Accuracy and Add to Write Up
Complete
8/15
3.5
8/14->8/17
Tune and Validate Classification and Regression and Add to Write Up
Complete
8/16
4
8/18->8/20
Finish Write Up and Finish Presentation
In Progress
8/17
Course of Action
Search and Find Data:
Description:
The goal of this task is to find relevent data and to define the relevant questions.
In this task I want to find data on tornadoes in the state of MN and data on daily weather patterns in the Twin Cities.
Once I find this data, I want tp add it to the repo and begin the planning to get the proposal started and start the data dictionary.
Acceptance Criteria:
Data sets sourced and README Data Dictionary drafted out.
Data added to repo with relevant supporting data.
(Something New) Use API to source daily weather data.
Create Proposal and Load and Make Data Dict and Clean Data:
Description:
The goal of this task is to create the drafted proposal and make the corresponding data dictionary for the README and proposal respectively.
I want to load in and clean the data to only the relevant parameters that I need.
The columns I want specifically are all the daily weather data and the tornadoes from 2020-2025 in the surrounding counties to the twin cities.
Define New Features (I want a feature for did tornado occur, I want a feature for major storms as well)
Acceptance Criteria:
Data Loaded and Merged
Data columns scrubbed for relevant data and missing data resolved
Data Dictionary finished
Data Features Defined
Perform EDA and Create Engineering Features and Start Write Up:
Description:
The goal of this task is to Create the features listed previously.
Once the features are created I want to do I primary EDA using seaborn/matlab plots.
Document my findings from this EDA so that I can put those findings in my write up.
Acceptance Criteria:
Features created
EDA performed and documented 200 ish words of different observations
Create Models and Examine Measures of Accuracy and Add to Write Up:
Description:
The goal of this task is to create a simple classification model probably logistic regression or naive bayes.
From there I want to examine how accurate the model is by examining the accuracy and percision scores
I want to look at other scores potentially like the F1 score or ROC-AUC.
I want to also do some preliminary looks at implementing other classification models for more complex models like Random Forest and Support Vector Machines (SVM)
Acceptance Criteria:
Simple Model(s) Created and fit examined and documented for write up.
Implement scores like F1 and ROC-AUC
Start Implementation of Random Forest and SVM
Tune and Validate Classification and Regression and Add to Write Up:
Description:
The goal of this task is to finish of the Random Forest and SVM from the previous task.
From there I want to examine measures of fit and see how well it fits using similar scoring to previous tasks
I want to add all of the measures of fit to the write up.
Start app using TKINTER
The app will use the best fiting model coeffecients to estimate if a tornado is likely given things like Precipitation, CAPE, Surface Pressure, Temp, etc.
Acceptance Criteria:
Random Forest and SVM regressions finished and examined.
Documentation for all the models is finished for the write up
TKINTER App is set up and ready to begin use.
Finish Write Up and Finish Presentation:
Description:
The goal of this task is to finish up the TKINTER app.
The second goal of this task is to finish the write up.
The other major goal is to start and finish the presentation based off the final write up.
Acceptance Criteria:
Write up, App, and presentation finished
Repo Organization
.github/: specifically designated for GitHub-related files, including workflows, actions, and templates customized for managing issues.
_extra/: dedicated to storing miscellaneous files that do not categorically align with other project sections, serving as a versatile repository for various supplementary documents.
_freeze/: houses frozen environment files that contain detailed information about the project’s environment setup and dependencies.
data/: designated for storing essential data files crucial for the project’s operations, including input files, datasets, and other vital data resources.
data_dictionary/: designated as a place to put supplemental documentation that I have used/ that has been helpful
images/: functioning as a central repository for visual assets utilized across the project, such as diagrams, charts, and screenshots, this directory houses essential visual components essential for project documentation and presentation purposes.
.gitignore: designed to define exclusions from version control, ensuring that specific files and directories are not tracked by Git, thereby simplifying the versioning process.
README.md: functioning as the central repository of project information, this README document provides vital details covering project setup, usage instructions, and an overarching summary of project objectives and scope.
about.qmd: Quarto file supplements project documentation by providing additional contextual information, describing our project purpose, as well as names and background of individual team members.
index.qmd: serves as the main page for our project, where our write up will eventually be. This Quarto file offers in-depth descriptions of our project, encompassing all code and visualizations, as well as eventually our results.
presentation.qmd: serves as a Quarto file that will present our slideshow of our final results of our project.
proposal.qmd: designed as the Quarto file responsible for our project proposal, housing our dataset, metadata, description, and questions, as well as our weekly plan of attack that will be updated weekly.
requirements.txt: specifies the dependencies and their respective versions required for the project to run successfully.
Source: Kaggle Competition Shared Project Has Similar Layout For Ease I Checked that all of the Files Serve Similar Purpose
Source Code
---title: "Proposal of Prediction of Tornado Occurance in Minnesota"subtitle: "Proposal"author: - name: "The Crengineers - Tyler Hart" affiliations: - name: "College of Information Science, College of Systems and Industrial Engineering, University of Arizona"description: "An analysis of historical weather/tornado patterns in the area of Twin Cities Minnesota (MN) from 2020-Current(2025) to predict Tornado Occurance"format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: trueeditor: visualcode-annotations: hoverexecute: warning: false echo: false freeze: auto # re-render only when source changesjupyter: python3---```{python}#| label: load-pkgs#| message: false# Here are the packages I need as of now to query the API and do pasic dataset operationsimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport openmeteo_requestsimport requests_cachefrom retry_requests import retry# Here is a boolean to regenerate the call to the API in case I want to query more datacallAPIbool =False```# DatasetFirst and foremost I chose this project as I have recently had some friends effected by tornados in the state of MN. Since there is a large amount of weather data publically available, I thought the data would be easy to source and models already exists as a baseline to use. I think this will allow me to focus on the ML part of the course rather than the data preprocessing which is a new area to me. Furthermore, I think there are some good classification questions about how to classify weather based on basic data and some good regression questions about can I predict tornadoes occurence based on this basic daily weather data. As of now I have 2 datasets the first is the historical archive of tornadows in MN from as far back as the 1950's it contains the day of the Tornado as well as relative size/damage that was in its path. It has 38 columns which can be seen in the sub section codebooks. ```{python}#| label: load-dataset#| message: false# Here is the function to call the Open-Meteo API to get daily weather datadef callDailyWeatherAPI(): # Setup the Open-Meteo API client with cache and retry on error cache_session = requests_cache.CachedSession('.cache', expire_after =-1) retry_session = retry(cache_session, retries =5, backoff_factor =0.2) openmeteo = openmeteo_requests.Client(session = retry_session)# The order of variables in daily is important to assign them correctly below url ="https://archive-api.open-meteo.com/v1/archive" params = {"latitude": 44.98,"longitude": -93.2638,"start_date": "2000-01-01","end_date": "2024-12-31","daily": ["weather_code", "temperature_2m_mean", "temperature_2m_max", "temperature_2m_min", "precipitation_sum", "rain_sum", "snowfall_sum", "wind_speed_10m_max", "wind_gusts_10m_max", "wind_direction_10m_dominant", "dew_point_2m_mean", "dew_point_2m_max", "dew_point_2m_min", "relative_humidity_2m_mean", "relative_humidity_2m_max", "relative_humidity_2m_min", "surface_pressure_mean", "surface_pressure_max", "surface_pressure_min"],"timezone": "America/Chicago","temperature_unit": "fahrenheit","wind_speed_unit": "mph","precipitation_unit": "inch", } responses = openmeteo.weather_api(url, params=params) response = responses[0]print(f"Coordinates: {response.Latitude()}°N {response.Longitude()}°E")print(f"Elevation: {response.Elevation()} m asl")print(f"Timezone: {response.Timezone()}{response.TimezoneAbbreviation()}")print(f"Timezone difference to GMT+0: {response.UtcOffsetSeconds()}s")# Process daily data. The order of variables needs to be the same as requested. daily = response.Daily() daily_weather_code = daily.Variables(0).ValuesAsNumpy() daily_temperature_2m_mean = daily.Variables(1).ValuesAsNumpy() daily_temperature_2m_max = daily.Variables(2).ValuesAsNumpy() daily_temperature_2m_min = daily.Variables(3).ValuesAsNumpy() daily_precipitation_sum = daily.Variables(4).ValuesAsNumpy() daily_rain_sum = daily.Variables(5).ValuesAsNumpy() daily_snowfall_sum = daily.Variables(6).ValuesAsNumpy() daily_wind_speed_10m_max = daily.Variables(7).ValuesAsNumpy() daily_wind_gusts_10m_max = daily.Variables(8).ValuesAsNumpy() daily_wind_direction_10m_dominant = daily.Variables(9).ValuesAsNumpy() daily_dew_point_2m_mean = daily.Variables(10).ValuesAsNumpy() daily_dew_point_2m_max = daily.Variables(11).ValuesAsNumpy() daily_dew_point_2m_min = daily.Variables(12).ValuesAsNumpy() daily_relative_humidity_2m_mean = daily.Variables(13).ValuesAsNumpy() daily_relative_humidity_2m_max = daily.Variables(14).ValuesAsNumpy() daily_relative_humidity_2m_min = daily.Variables(15).ValuesAsNumpy() daily_surface_pressure_mean = daily.Variables(16).ValuesAsNumpy() daily_surface_pressure_max = daily.Variables(17).ValuesAsNumpy() daily_surface_pressure_min = daily.Variables(18).ValuesAsNumpy() daily_data = {"date": pd.date_range( start = pd.to_datetime(daily.Time(), unit ="s", utc =True), end = pd.to_datetime(daily.TimeEnd(), unit ="s", utc =True), freq = pd.Timedelta(seconds = daily.Interval()), inclusive ="left" )} daily_data["weather_code"] = daily_weather_code daily_data["temperature_2m_mean"] = daily_temperature_2m_mean daily_data["temperature_2m_max"] = daily_temperature_2m_max daily_data["temperature_2m_min"] = daily_temperature_2m_min daily_data["precipitation_sum"] = daily_precipitation_sum daily_data["rain_sum"] = daily_rain_sum daily_data["snowfall_sum"] = daily_snowfall_sum daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max daily_data["wind_gusts_10m_max"] = daily_wind_gusts_10m_max daily_data["wind_direction_10m_dominant"] = daily_wind_direction_10m_dominant daily_data["dew_point_2m_mean"] = daily_dew_point_2m_mean daily_data["dew_point_2m_max"] = daily_dew_point_2m_max daily_data["dew_point_2m_min"] = daily_dew_point_2m_min daily_data["relative_humidity_2m_mean"] = daily_relative_humidity_2m_mean daily_data["relative_humidity_2m_max"] = daily_relative_humidity_2m_max daily_data["relative_humidity_2m_min"] = daily_relative_humidity_2m_min daily_data["surface_pressure_mean"] = daily_surface_pressure_mean daily_data["surface_pressure_max"] = daily_surface_pressure_max daily_data["surface_pressure_min"] = daily_surface_pressure_min WeatherRaw = pd.DataFrame(data = daily_data) WeatherRaw.to_csv('data/mn_weather_data.csv')# print("\nDaily data\n", WeatherRaw)if callAPIbool ==True: callDailyWeatherAPI()else: WeatherRaw = pd.read_csv("data/mn_weather_data.csv")# WeatherRaw.head()# WeatherRaw.info()# WeatherRaw.isna().sum()HistoricRaw = pd.read_csv("data/storm_data_search_results.csv")# HistoricRaw.head()# HistoricRaw.info()# HistoricRaw.isna().sum()```Looking at the data it seems like a lot of the data is not filled in or defaulted with zeros. This is due to the many sources of data. In my case I think this is ok as I am really using it as a query to the conditions of the day. An important note is that this contains all the Tonado data for MN. Since I only have weather data for the twin cities I will filter it out later to only the surrounding counties.[]An image of the counties I am going to pick can be seen below:[]A full data dictionary can be seen in the csv for the storm data export located at data_dictionaries/Storm-Data-Export.pdf. These will eventualy be transcribed to the README.md files for each of the datasetsThe other data I will be using is data queried from a weather API. the code used to do this can be seen in the main python sections of the proposal. This set contains all of the last 5 years 2020 - current daily weather data. The first couple rows can be seen below: 'HistoricRaw.head()It is 20 * 1827 different daily weather data coming out of MSP international airport. This is the major weather hub for the twin cities. Using this data it all seems to be pretty numerical and filled in pretty accurately. I do see some areas with 0 values that may need to be addressed later. 'HistoricRaw.info()My plan is to use ths to get the conditions on days where tornados occurred. This will be were most of the model data is derived from. A data dictionary can be seen in the data/README.md or below: ### Codebook for [mn_weather_data] Dataset| Variable Names |Data Types| Description | Unit ||-----------------------------|----------|-------------|------|| date | TimeDate | Data and Time of weather collection | (YYYY-MM-DD hh-mm-ss) || weather_code | Float | The most severe weather condition on a given day | (WMO Code) || temperature_2m_mean | Float | Mean daily air temperature at 2 meters above ground | (°F) || temperature_2m_max | Float | Maximum daily air temperature at 2 meters above ground | (°F) || temperature_2m_min | Float | Minimum daily air temperature at 2 meters above ground | (°F) || precipitation_sum | Float | Sum of daily precipitation (including rain, showers and snowfall) | (mm) || rain_sum | Float | Sum of daily rain | (mm) || snowfall_sum | Float | Sum of daily snowfall | (cm) || wind_speed_10m_max | Float | Maximum wind speed on a day | (mph) | | wind_gusts_10m_max | Float | Maximum wind gusts on a day | (mph) || wind_direction_10m_dominan | Float | Dominant wind direction | (°) || dew_point_2m_mean | Float | Mean daily dew point at 2 meters above ground | () || dew_point_2m_max | Float | Maximum daily dew point at 2 meters above ground | () || dew_point_2m_min | Float | Minimum daily dew point at 2 meters above ground | () || relative_humidity_2m_mean | Float | Mean daily relative humidiity at 2 meters above ground | () || relative_humidity_2m_max | Float | Maximum daily relative humidity at 2 meters above ground | () || relative_humidity_2m_min | Float | Minimum daily relative humidity at 2 meters above ground | () || surface_pressure_mean | Float | Mean daily pressure at surface | () || surface_pressure_max | Float | Maximum daily pressure at surface | () || surface_pressure_min | Float | Minimum daily daily pressure at surface | () |### Codebook for [storm_data_search_results] Dataset| Variable Names | Data Types | Description | Unit ||---------------------|-----------------------|-------------|-------|| EVENT_ID | Int | ID assigned by NWS to note a single, small part that goes into a specific storm episode | Database ID || CZ_NAME_STR | String | County/Parish, Zone or Marine Name assigned to the county FIPS number or NWS Forecast Zone | None || BEGIN_LOCATION | String | Location the event began | None | | BEGIN_DATE | Date | Date the event was reported | MM-DD-YYYY | | BEGIN_TIME | Time | Time the event was reported | hh:mm:ss | | EVENT_TYPE | String | Type of Storm Event (ex. Tornadoes, Hail, etc.) | None | | TOR_F_SCALE | String | Enhanced Fujita Scale describes the strength of the tornado based on the amount and type of damage caused by the tornado | Fujita Scale || DEATHS_DIRECT | Int | The number of deaths directly related to the weather event | Deaths | | INJURIES_DIRECT | Int | The number of injuries directly related to the weather event | Injuries | | DAMAGE_PROPERTY_NUM | Float | Estimated monetary damage of property in the effected areas | $ | | DAMAGE_CROPS_NUM | Float | Estimated monetary damage of crops in the effected areas | $ | | STATE_ABBR | String | State code of effected area | State Code | | CZ_TIMEZONE | String | Time Zone for the County/Parish, Zone or Marine Name | Time Zone | | EPISODE_ID | Int | ID assigned by NWS to denote the storm episode | None || CZ_TYPE | String | Type of Jurisdiction (County/Parish, Zone or Marine Name) | None || CZ_FIPS | Int | FIPS ID number given to County/Parish, Zone or Marine Name | FIPS Code | | WFO | String | National Weather Service Forecast Office’s area of responsibility in which the event occurred | WFO Code || INJURIES_INDIRECT | Int | The number of injuries indirectly related to the weather event | Injuries | | DEATHS_INDIRECT | Int | The number of deaths indirectly related to the weather event | Deaths | | SOURCE | String | Source of where information came from (ex. Weather Radar, Storm Chaser, Sighting, etc.) | None || FLOOD_CAUSE | String | Reported or estimated cause of the flood | None | | TOR_LENGTH | Float | Length of the tornado or tornado segment while on the ground | Miles || TOR_WIDTH | Int | Width of the tornado or tornado segment while on the ground | Yards || END_LOCATION | String | Location the event ended | None | | END_DATE | Date | Date the event ended | MM-DD-YYYY | | END_TIME | Time | Time the event was ended | hh:mm:ss || BEGIN_LAT | Float | The latitude where the event began | (°) || BEGIN_LON | Float | The longitude where the event began | (°) || END_LAT | Float | The latitude where the event ended | (°) || END_LON | Float | The longitude where the event ended | (°) || EPISODE_NARRATIVE | Sentence | The episode narrative depicting the general nature and overall activity of the episode. The narrative is created by NWS. | None | | EVENT_NARRATIVE | Sentence | The event narrative provides more specific details of the individual event. The event narrative is provided by NWS. | None |### Data Clean UpThe tornado data has a lot of holes in a lot of its categories. For this analysis I really only need the date, county, and tornado statistics (fujita scale, width, length, etc.) in order to concatenate it to the daily weather data. The daily weather data will be mostly untouched as the data has no null values and contains all of the relevant necessary data. As such I will append the relevant tornado data to the weather data by date. ```{python}#| label: clean-data#| message: false# Here is the function to clean up the dataset and save itCleanRaw = HistoricRaw[['CZ_NAME_STR', 'BEGIN_DATE', 'TOR_F_SCALE', 'TOR_LENGTH', 'TOR_WIDTH']]CleanRaw['CZ_NAME_STR'] = CleanRaw['CZ_NAME_STR'].astype(str)CleanRaw['BEGIN_DATE'] = pd.to_datetime(CleanRaw['BEGIN_DATE'])FilteredRaw = CleanRaw.loc[CleanRaw['BEGIN_DATE'] >'2000-01-01']FilteredRaw['CZ_NAME_STR'] = FilteredRaw['CZ_NAME_STR'].apply(lambda x: x.strip(' CO.'))FilteredRaw = FilteredRaw.rename(columns={'BEGIN_DATE': 'date'})counties_list = ['ANOKA', 'CARVER', 'CHISAGO', 'DAKOTA', 'HENNEPIN', 'ISTANTI', 'LE SUEUR', 'MCLEOD', 'MILLE LACS', 'RAMSEY', 'RICE', 'SCOTT', 'SHERBURNE', 'SIBLEY', 'WASHINGTON', 'WRIGHT']TornFin = FilteredRaw[FilteredRaw['CZ_NAME_STR'].isin(counties_list)]TornFin = TornFin.drop_duplicates(subset=['date', 'CZ_NAME_STR'], keep='first')# TornFin.info()# TornFin.head()Clean2Raw = WeatherRaw.copy()Clean2Raw['date'] = pd.to_datetime(Clean2Raw['date'])date = Clean2Raw['date'].dt.dateClean2Raw['date'] = dateClean2Raw['date'] = pd.to_datetime(Clean2Raw['date'])Filtered2Raw = Clean2Raw.loc[Clean2Raw['date'] >'2000-01-01']DailyFin = Filtered2Raw.drop_duplicates(subset=['date'], keep='first')# Filtered2Raw.info()# Filtered2Raw.head()MergeRaw = pd.merge(TornFin, DailyFin, on='date', how='outer')MergeRaw = MergeRaw.drop('Unnamed: 0', axis=1)# MergeRaw.info()# MergeRaw.head()NewFeatureRaw = MergeRaw.copy()NewFeatureRaw['tornado_occurred'] = NewFeatureRaw['TOR_F_SCALE'].apply(lambda x: 1if pd.notna(x) else0)NewFeatureRaw['time_of_year'] = NewFeatureRaw['date'].dt.monthNewFeatureRaw['CZ_NAME_STR'] = NewFeatureRaw['CZ_NAME_STR'].fillna('None')NewFeatureRaw['TOR_F_SCALE'] = NewFeatureRaw['TOR_F_SCALE'].fillna('None')NewFeatureRaw['TOR_LENGTH'] = NewFeatureRaw['TOR_LENGTH'].fillna(0)NewFeatureRaw['TOR_WIDTH'] = NewFeatureRaw['TOR_WIDTH'].fillna(0)# NewFeatureRaw.info()# NewFeatureRaw.isnull().sum()NewFeatureRaw.to_csv('data/tornado_days.csv')NewFeatureRaw.head()```### Exploratory Data Analysis```{python}#| label: eda#| message: false # Perform EDA on the datasetdef UnivariateAnalysis(df, column =None):""" Perform univariate analysis on the dataset. This function displays descriptive statistics and creates the appropriate plot for both numerical and categorical variables. """if column !=None: numeric_columns = columnelse: numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist() # Get numeric columns for plotting# print(f"Numeric columns in {df} dataset:", numeric_columns) # Print the list of numericfor data in numeric_columns:print(df[data].describe()) # Display descriptive statistics for each numeric column sns.histplot(data=df, x=data, bins=30).set_title(f'Distribution of {data}') plt.show() # Display the histogram for each numeric columnif column !=None: categorical_columns = columnelse: categorical_columns = df.select_dtypes(exclude=[np.number]).columns.tolist() # Get numeric columns for plotting# print(f"Categoric columns in {df} dataset:", categorical_columns) # Print the list of categoricfor data in categorical_columns:print(df[data].describe()) # Display descriptive statistics for each categoric column sns.countplot(data=df, x=data).set_title(f'Count of {data}') plt.xticks(rotation=45) # Rotate x-axis labels for better readability plt.tight_layout() # Adjust layout to prevent label overlap plt.show() # Display the count plot for each categoric column#Down selected to a few interesting columnssns.pairplot(data=NewFeatureRaw, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'date', 'relative_humidity_2m_mean', 'time_of_year'], y_vars=['tornado_occurred'])plt.show()sns.pairplot(data=NewFeatureRaw, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'relative_humidity_2m_mean', 'tornado_occurred', 'time_of_year'], y_vars=['date'])plt.show()sns.pairplot(data=NewFeatureRaw, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'relative_humidity_2m_mean', 'tornado_occurred', 'date'], y_vars=['time_of_year'])plt.show()# '''# Perform univariate analysis on the datasets.# '''# UnivariateAnalysis(NewFeatureRaw)```# Questions1. Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area? 2. Can I create a GUI/dashboard to quickly allow users to tune/view predictions. # Analysis plan| Week | Dates | Tasks | Status | Date Finished ||---------|:-----------|--------------------------------------------------------------------:|:------------|:--------------|| 1 | 7/28->8/1 | Search and Find Data | Complete | 8/1 || 2 | 8/4->8/8 | Create Proposal and Load and Make Data Dict and Clean Data | Complete | 8/11 || 2.5 | 8/7->8/10 | Perform EDA and Create Engineering Features and Start Write Up | Complete | 8/15 || 3 | 8/11->8/13 | Create Models and Examine Measures of Accuracy and Add to Write Up | Complete | 8/15 || 3.5 | 8/14->8/17 | Tune and Validate Classification and Regression and Add to Write Up | Complete | 8/16 || 4 | 8/18->8/20 | Finish Write Up and Finish Presentation | In Progress | 8/17 |### Course of Action#### Search and Find Data:Description: * The goal of this task is to find relevent data and to define the relevant questions. * In this task I want to find data on tornadoes in the state of MN and data on daily weather patterns in the Twin Cities. * Once I find this data, I want tp add it to the repo and begin the planning to get the proposal started and start the data dictionary.Acceptance Criteria: * Data sets sourced and README Data Dictionary drafted out. * Data added to repo with relevant supporting data. * (Something New) Use API to source daily weather data. #### Create Proposal and Load and Make Data Dict and Clean Data:Description: * The goal of this task is to create the drafted proposal and make the corresponding data dictionary for the README and proposal respectively.* I want to load in and clean the data to only the relevant parameters that I need. * The columns I want specifically are all the daily weather data and the tornadoes from 2020-2025 in the surrounding counties to the twin cities.* Define New Features (I want a feature for did tornado occur, I want a feature for major storms as well) Acceptance Criteria:* Data Loaded and Merged* Data columns scrubbed for relevant data and missing data resolved* Data Dictionary finished* Data Features Defined#### Perform EDA and Create Engineering Features and Start Write Up:Description: * The goal of this task is to Create the features listed previously. * Once the features are created I want to do I primary EDA using seaborn/matlab plots. * Document my findings from this EDA so that I can put those findings in my write up.Acceptance Criteria:* Features created* EDA performed and documented 200 ish words of different observations#### Create Models and Examine Measures of Accuracy and Add to Write Up:Description:* The goal of this task is to create a simple classification model probably logistic regression or naive bayes.* From there I want to examine how accurate the model is by examining the accuracy and percision scores* I want to look at other scores potentially like the F1 score or ROC-AUC.* I want to also do some preliminary looks at implementing other classification models for more complex models like Random Forest and Support Vector Machines (SVM)Acceptance Criteria: * Simple Model(s) Created and fit examined and documented for write up. * Implement scores like F1 and ROC-AUC* Start Implementation of Random Forest and SVM#### Tune and Validate Classification and Regression and Add to Write Up:Description: * The goal of this task is to finish of the Random Forest and SVM from the previous task.* From there I want to examine measures of fit and see how well it fits using similar scoring to previous tasks* I want to add all of the measures of fit to the write up.* Start app using TKINTER * The app will use the best fiting model coeffecients to estimate if a tornado is likely given things like Precipitation, CAPE, Surface Pressure, Temp, etc. Acceptance Criteria: * Random Forest and SVM regressions finished and examined. * Documentation for all the models is finished for the write up* TKINTER App is set up and ready to begin use. #### Finish Write Up and Finish Presentation:Description: * The goal of this task is to finish up the TKINTER app.* The second goal of this task is to finish the write up.* The other major goal is to start and finish the presentation based off the final write up. Acceptance Criteria: * Write up, App, and presentation finished# Repo Organization- **.github/:** specifically designated for GitHub-related files, including workflows, actions, and templates customized for managing issues.- **\_extra/:** dedicated to storing miscellaneous files that do not categorically align with other project sections, serving as a versatile repository for various supplementary documents.- **\_freeze/:** houses frozen environment files that contain detailed information about the project's environment setup and dependencies.- **data/:** designated for storing essential data files crucial for the project's operations, including input files, datasets, and other vital data resources.- **data_dictionary/:** designated as a place to put supplemental documentation that I have used/ that has been helpful- **images/:** functioning as a central repository for visual assets utilized across the project, such as diagrams, charts, and screenshots, this directory houses essential visual components essential for project documentation and presentation purposes.- **.gitignore:** designed to define exclusions from version control, ensuring that specific files and directories are not tracked by Git, thereby simplifying the versioning process.- **README.md:** functioning as the central repository of project information, this README document provides vital details covering project setup, usage instructions, and an overarching summary of project objectives and scope.- **about.qmd:** Quarto file supplements project documentation by providing additional contextual information, describing our project purpose, as well as names and background of individual team members. - **index.qmd:** serves as the main page for our project, where our write up will eventually be. This Quarto file offers in-depth descriptions of our project, encompassing all code and visualizations, as well as eventually our results.- **presentation.qmd:** serves as a Quarto file that will present our slideshow of our final results of our project. - **proposal.qmd:** designed as the Quarto file responsible for our project proposal, housing our dataset, metadata, description, and questions, as well as our weekly plan of attack that will be updated weekly. - **requirements.txt:** specifies the dependencies and their respective versions required for the project to run successfully.Source: Kaggle Competition Shared Project Has Similar Layout For Ease I Checked that all of the Files Serve Similar Purpose