Proposal of Prediction of Tornado Occurance in Minnesota

Proposal

An analysis of historical weather/tornado patterns in the area of Twin Cities Minnesota (MN) from 2020-Current(2025) to predict Tornado Occurance

Author

Affiliation

The Crengineers - Tyler Hart

College of Information Science, College of Systems and Industrial Engineering, University of Arizona

Dataset

First and foremost I chose this project as I have recently had some friends effected by tornados in the state of MN. Since there is a large amount of weather data publically available, I thought the data would be easy to source and models already exists as a baseline to use. I think this will allow me to focus on the ML part of the course rather than the data preprocessing which is a new area to me. Furthermore, I think there are some good classification questions about how to classify weather based on basic data and some good regression questions about can I predict tornadoes occurence based on this basic daily weather data. As of now I have 2 datasets the first is the historical archive of tornadows in MN from as far back as the 1950’s it contains the day of the Tornado as well as relative size/damage that was in its path. It has 38 columns which can be seen in the sub section codebooks.

Looking at the data it seems like a lot of the data is not filled in or defaulted with zeros. This is due to the many sources of data. In my case I think this is ok as I am really using it as a query to the conditions of the day. An important note is that this contains all the Tonado data for MN. Since I only have weather data for the twin cities I will filter it out later to only the surrounding counties.

[ Image of Minnesota ]

An image of the counties I am going to pick can be seen below:

[ Image of Minnesota Counties ]

A full data dictionary can be seen in the csv for the storm data export located at data_dictionaries/Storm-Data-Export.pdf. These will eventualy be transcribed to the README.md files for each of the datasets

The other data I will be using is data queried from a weather API. the code used to do this can be seen in the main python sections of the proposal. This set contains all of the last 5 years 2020 - current daily weather data. The first couple rows can be seen below: ’HistoricRaw.head() It is 20 * 1827 different daily weather data coming out of MSP international airport. This is the major weather hub for the twin cities. Using this data it all seems to be pretty numerical and filled in pretty accurately. I do see some areas with 0 values that may need to be addressed later. ’HistoricRaw.info() My plan is to use ths to get the conditions on days where tornados occurred. This will be were most of the model data is derived from. A data dictionary can be seen in the data/README.md or below:

Codebook for [mn_weather_data] Dataset

Variable Names	Data Types	Description	Unit
date	TimeDate	Data and Time of weather collection	(YYYY-MM-DD hh-mm-ss)
weather_code	Float	The most severe weather condition on a given day	(WMO Code)
temperature_2m_mean	Float	Mean daily air temperature at 2 meters above ground	(°F)
temperature_2m_max	Float	Maximum daily air temperature at 2 meters above ground	(°F)
temperature_2m_min	Float	Minimum daily air temperature at 2 meters above ground	(°F)
precipitation_sum	Float	Sum of daily precipitation (including rain, showers and snowfall)	(mm)
rain_sum	Float	Sum of daily rain	(mm)
snowfall_sum	Float	Sum of daily snowfall	(cm)
wind_speed_10m_max	Float	Maximum wind speed on a day	(mph)
wind_gusts_10m_max	Float	Maximum wind gusts on a day	(mph)
wind_direction_10m_dominan	Float	Dominant wind direction	(°)
dew_point_2m_mean	Float	Mean daily dew point at 2 meters above ground	()
dew_point_2m_max	Float	Maximum daily dew point at 2 meters above ground	()
dew_point_2m_min	Float	Minimum daily dew point at 2 meters above ground	()
relative_humidity_2m_mean	Float	Mean daily relative humidiity at 2 meters above ground	()
relative_humidity_2m_max	Float	Maximum daily relative humidity at 2 meters above ground	()
relative_humidity_2m_min	Float	Minimum daily relative humidity at 2 meters above ground	()
surface_pressure_mean	Float	Mean daily pressure at surface	()
surface_pressure_max	Float	Maximum daily pressure at surface	()
surface_pressure_min	Float	Minimum daily daily pressure at surface	()

Codebook for [storm_data_search_results] Dataset

Variable Names	Data Types	Description	Unit
EVENT_ID	Int	ID assigned by NWS to note a single, small part that goes into a specific storm episode	Database ID
CZ_NAME_STR	String	County/Parish, Zone or Marine Name assigned to the county FIPS number or NWS Forecast Zone	None
BEGIN_LOCATION	String	Location the event began	None
BEGIN_DATE	Date	Date the event was reported	MM-DD-YYYY
BEGIN_TIME	Time	Time the event was reported	hh:mm:ss
EVENT_TYPE	String	Type of Storm Event (ex. Tornadoes, Hail, etc.)	None
TOR_F_SCALE	String	Enhanced Fujita Scale describes the strength of the tornado based on the amount and type of damage caused by the tornado	Fujita Scale
DEATHS_DIRECT	Int	The number of deaths directly related to the weather event	Deaths
INJURIES_DIRECT	Int	The number of injuries directly related to the weather event	Injuries
DAMAGE_PROPERTY_NUM	Float	Estimated monetary damage of property in the effected areas	$
DAMAGE_CROPS_NUM	Float	Estimated monetary damage of crops in the effected areas	$
STATE_ABBR	String	State code of effected area	State Code
CZ_TIMEZONE	String	Time Zone for the County/Parish, Zone or Marine Name	Time Zone
EPISODE_ID	Int	ID assigned by NWS to denote the storm episode	None
CZ_TYPE	String	Type of Jurisdiction (County/Parish, Zone or Marine Name)	None
CZ_FIPS	Int	FIPS ID number given to County/Parish, Zone or Marine Name	FIPS Code
WFO	String	National Weather Service Forecast Office’s area of responsibility in which the event occurred	WFO Code
INJURIES_INDIRECT	Int	The number of injuries indirectly related to the weather event	Injuries
DEATHS_INDIRECT	Int	The number of deaths indirectly related to the weather event	Deaths
SOURCE	String	Source of where information came from (ex. Weather Radar, Storm Chaser, Sighting, etc.)	None
FLOOD_CAUSE	String	Reported or estimated cause of the flood	None
TOR_LENGTH	Float	Length of the tornado or tornado segment while on the ground	Miles
TOR_WIDTH	Int	Width of the tornado or tornado segment while on the ground	Yards
END_LOCATION	String	Location the event ended	None
END_DATE	Date	Date the event ended	MM-DD-YYYY
END_TIME	Time	Time the event was ended	hh:mm:ss
BEGIN_LAT	Float	The latitude where the event began	(°)
BEGIN_LON	Float	The longitude where the event began	(°)
END_LAT	Float	The latitude where the event ended	(°)
END_LON	Float	The longitude where the event ended	(°)
EPISODE_NARRATIVE	Sentence	The episode narrative depicting the general nature and overall activity of the episode. The narrative is created by NWS.	None
EVENT_NARRATIVE	Sentence	The event narrative provides more specific details of the individual event. The event narrative is provided by NWS.	None

Data Clean Up

The tornado data has a lot of holes in a lot of its categories. For this analysis I really only need the date, county, and tornado statistics (fujita scale, width, length, etc.) in order to concatenate it to the daily weather data. The daily weather data will be mostly untouched as the data has no null values and contains all of the relevant necessary data. As such I will append the relevant tornado data to the weather data by date.

	CZ_NAME_STR	date	TOR_F_SCALE	weather_code	temperature_2m_mean	temperature_2m_max	temperature_2m_min	precipitation_sum	...	dew_point_2m_max	dew_point_2m_min	relative_humidity_2m_mean	relative_humidity_2m_max	relative_humidity_2m_min	surface_pressure_mean	surface_pressure_max	surface_pressure_min	time_of_year
0	None	2000-01-02	None	73.0	29.362250	33.4085	26.838501	0.039370	...	26.6585	20.088500	80.965065	90.481580	67.150380	977.19560	984.67926	972.81570	1
1	None	2000-01-03	None	3.0	24.491000	26.3885	22.788500	0.000000	...	19.5485	15.768499	74.725880	82.030160	66.527940	986.42120	988.42944	984.91630	1
2	None	2000-01-04	None	71.0	16.728498	24.2285	10.638500	0.000000	...	18.3785	-1.871502	60.066280	78.316124	44.035545	987.85956	990.71370	984.34186	1
3	None	2000-01-05	None	73.0	17.831000	28.0085	10.368500	0.090551	...	25.5785	-1.781502	69.617760	91.112190	55.900590	983.71606	990.35310	978.67990	1
4	None	2000-01-06	None	3.0	22.987251	29.3585	13.248501	0.000000	...	25.7585	3.348501	69.387764	89.471180	56.031500	986.35736	992.35600	979.56040	1

5 rows × 26 columns

Exploratory Data Analysis

Questions

Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area?
Can I create a GUI/dashboard to quickly allow users to tune/view predictions.

Analysis plan

Week	Dates	Tasks	Status	Date Finished
1	7/28->8/1	Search and Find Data	Complete	8/1
2	8/4->8/8	Create Proposal and Load and Make Data Dict and Clean Data	Complete	8/11
2.5	8/7->8/10	Perform EDA and Create Engineering Features and Start Write Up	Complete	8/15
3	8/11->8/13	Create Models and Examine Measures of Accuracy and Add to Write Up	Complete	8/15
3.5	8/14->8/17	Tune and Validate Classification and Regression and Add to Write Up	Complete	8/16
4	8/18->8/20	Finish Write Up and Finish Presentation	In Progress	8/17

Course of Action

Search and Find Data:

Description:

The goal of this task is to find relevent data and to define the relevant questions.
In this task I want to find data on tornadoes in the state of MN and data on daily weather patterns in the Twin Cities.
Once I find this data, I want tp add it to the repo and begin the planning to get the proposal started and start the data dictionary.

Acceptance Criteria:

Data sets sourced and README Data Dictionary drafted out.
Data added to repo with relevant supporting data.
(Something New) Use API to source daily weather data.

Create Proposal and Load and Make Data Dict and Clean Data:

Description:

The goal of this task is to create the drafted proposal and make the corresponding data dictionary for the README and proposal respectively.
I want to load in and clean the data to only the relevant parameters that I need.
- The columns I want specifically are all the daily weather data and the tornadoes from 2020-2025 in the surrounding counties to the twin cities.
Define New Features (I want a feature for did tornado occur, I want a feature for major storms as well)

Acceptance Criteria:

Data Loaded and Merged
Data columns scrubbed for relevant data and missing data resolved
Data Dictionary finished
Data Features Defined

Perform EDA and Create Engineering Features and Start Write Up:

Description:

The goal of this task is to Create the features listed previously.
Once the features are created I want to do I primary EDA using seaborn/matlab plots.
- Document my findings from this EDA so that I can put those findings in my write up.

Acceptance Criteria:

Features created
EDA performed and documented 200 ish words of different observations

Create Models and Examine Measures of Accuracy and Add to Write Up:

Description:

The goal of this task is to create a simple classification model probably logistic regression or naive bayes.
From there I want to examine how accurate the model is by examining the accuracy and percision scores
I want to look at other scores potentially like the F1 score or ROC-AUC.
I want to also do some preliminary looks at implementing other classification models for more complex models like Random Forest and Support Vector Machines (SVM)

Acceptance Criteria:

Simple Model(s) Created and fit examined and documented for write up.
Implement scores like F1 and ROC-AUC
Start Implementation of Random Forest and SVM

Tune and Validate Classification and Regression and Add to Write Up:

Description:

The goal of this task is to finish of the Random Forest and SVM from the previous task.
From there I want to examine measures of fit and see how well it fits using similar scoring to previous tasks
I want to add all of the measures of fit to the write up.
Start app using TKINTER
- The app will use the best fiting model coeffecients to estimate if a tornado is likely given things like Precipitation, CAPE, Surface Pressure, Temp, etc.

Acceptance Criteria:

Random Forest and SVM regressions finished and examined.
Documentation for all the models is finished for the write up
TKINTER App is set up and ready to begin use.

Finish Write Up and Finish Presentation:

Description:

The goal of this task is to finish up the TKINTER app.
The second goal of this task is to finish the write up.
The other major goal is to start and finish the presentation based off the final write up.

Acceptance Criteria:

Write up, App, and presentation finished

Repo Organization

.github/: specifically designated for GitHub-related files, including workflows, actions, and templates customized for managing issues.
_extra/: dedicated to storing miscellaneous files that do not categorically align with other project sections, serving as a versatile repository for various supplementary documents.
_freeze/: houses frozen environment files that contain detailed information about the project’s environment setup and dependencies.
data/: designated for storing essential data files crucial for the project’s operations, including input files, datasets, and other vital data resources.
data_dictionary/: designated as a place to put supplemental documentation that I have used/ that has been helpful
images/: functioning as a central repository for visual assets utilized across the project, such as diagrams, charts, and screenshots, this directory houses essential visual components essential for project documentation and presentation purposes.
.gitignore: designed to define exclusions from version control, ensuring that specific files and directories are not tracked by Git, thereby simplifying the versioning process.
README.md: functioning as the central repository of project information, this README document provides vital details covering project setup, usage instructions, and an overarching summary of project objectives and scope.
about.qmd: Quarto file supplements project documentation by providing additional contextual information, describing our project purpose, as well as names and background of individual team members.
index.qmd: serves as the main page for our project, where our write up will eventually be. This Quarto file offers in-depth descriptions of our project, encompassing all code and visualizations, as well as eventually our results.
presentation.qmd: serves as a Quarto file that will present our slideshow of our final results of our project.
proposal.qmd: designed as the Quarto file responsible for our project proposal, housing our dataset, metadata, description, and questions, as well as our weekly plan of attack that will be updated weekly.
requirements.txt: specifies the dependencies and their respective versions required for the project to run successfully.

Source: Kaggle Competition Shared Project Has Similar Layout For Ease I Checked that all of the Files Serve Similar Purpose

--- title: "Proposal of Prediction of Tornado Occurance in Minnesota" subtitle: "Proposal" author: - name: "The Crengineers - Tyler Hart" affiliations: - name: "College of Information Science, College of Systems and Industrial Engineering, University of Arizona" description: "An analysis of historical weather/tornado patterns in the area of Twin Cities Minnesota (MN) from 2020-Current(2025) to predict Tornado Occurance" format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false echo: false freeze: auto # re-render only when source changes jupyter: python3 --- ```{python} #| label: load-pkgs #| message: false # Here are the packages I need as of now to query the API and do pasic dataset operations import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import openmeteo_requests import requests_cache from retry_requests import retry # Here is a boolean to regenerate the call to the API in case I want to query more data callAPIbool = False ``` # Dataset First and foremost I chose this project as I have recently had some friends effected by tornados in the state of MN. Since there is a large amount of weather data publically available, I thought the data would be easy to source and models already exists as a baseline to use. I think this will allow me to focus on the ML part of the course rather than the data preprocessing which is a new area to me. Furthermore, I think there are some good classification questions about how to classify weather based on basic data and some good regression questions about can I predict tornadoes occurence based on this basic daily weather data. As of now I have 2 datasets the first is the historical archive of tornadows in MN from as far back as the 1950's it contains the day of the Tornado as well as relative size/damage that was in its path. It has 38 columns which can be seen in the sub section codebooks. ```{python} #| label: load-dataset #| message: false # Here is the function to call the Open-Meteo API to get daily weather data def callDailyWeatherAPI(): # Setup the Open-Meteo API client with cache and retry on error cache_session = requests_cache.CachedSession('.cache', expire_after = -1) retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2) openmeteo = openmeteo_requests.Client(session = retry_session) # The order of variables in daily is important to assign them correctly below url = "https://archive-api.open-meteo.com/v1/archive" params = { "latitude": 44.98, "longitude": -93.2638, "start_date": "2000-01-01", "end_date": "2024-12-31", "daily": ["weather_code", "temperature_2m_mean", "temperature_2m_max", "temperature_2m_min", "precipitation_sum", "rain_sum", "snowfall_sum", "wind_speed_10m_max", "wind_gusts_10m_max", "wind_direction_10m_dominant", "dew_point_2m_mean", "dew_point_2m_max", "dew_point_2m_min", "relative_humidity_2m_mean", "relative_humidity_2m_max", "relative_humidity_2m_min", "surface_pressure_mean", "surface_pressure_max", "surface_pressure_min"], "timezone": "America/Chicago", "temperature_unit": "fahrenheit", "wind_speed_unit": "mph", "precipitation_unit": "inch", } responses = openmeteo.weather_api(url, params=params) response = responses[0] print(f"Coordinates: {response.Latitude()}°N {response.Longitude()}°E") print(f"Elevation: {response.Elevation()} m asl") print(f"Timezone: {response.Timezone()}{response.TimezoneAbbreviation()}") print(f"Timezone difference to GMT+0: {response.UtcOffsetSeconds()}s") # Process daily data. The order of variables needs to be the same as requested. daily = response.Daily() daily_weather_code = daily.Variables(0).ValuesAsNumpy() daily_temperature_2m_mean = daily.Variables(1).ValuesAsNumpy() daily_temperature_2m_max = daily.Variables(2).ValuesAsNumpy() daily_temperature_2m_min = daily.Variables(3).ValuesAsNumpy() daily_precipitation_sum = daily.Variables(4).ValuesAsNumpy() daily_rain_sum = daily.Variables(5).ValuesAsNumpy() daily_snowfall_sum = daily.Variables(6).ValuesAsNumpy() daily_wind_speed_10m_max = daily.Variables(7).ValuesAsNumpy() daily_wind_gusts_10m_max = daily.Variables(8).ValuesAsNumpy() daily_wind_direction_10m_dominant = daily.Variables(9).ValuesAsNumpy() daily_dew_point_2m_mean = daily.Variables(10).ValuesAsNumpy() daily_dew_point_2m_max = daily.Variables(11).ValuesAsNumpy() daily_dew_point_2m_min = daily.Variables(12).ValuesAsNumpy() daily_relative_humidity_2m_mean = daily.Variables(13).ValuesAsNumpy() daily_relative_humidity_2m_max = daily.Variables(14).ValuesAsNumpy() daily_relative_humidity_2m_min = daily.Variables(15).ValuesAsNumpy() daily_surface_pressure_mean = daily.Variables(16).ValuesAsNumpy() daily_surface_pressure_max = daily.Variables(17).ValuesAsNumpy() daily_surface_pressure_min = daily.Variables(18).ValuesAsNumpy() daily_data = {"date": pd.date_range( start = pd.to_datetime(daily.Time(), unit = "s", utc = True), end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True), freq = pd.Timedelta(seconds = daily.Interval()), inclusive = "left" )} daily_data["weather_code"] = daily_weather_code daily_data["temperature_2m_mean"] = daily_temperature_2m_mean daily_data["temperature_2m_max"] = daily_temperature_2m_max daily_data["temperature_2m_min"] = daily_temperature_2m_min daily_data["precipitation_sum"] = daily_precipitation_sum daily_data["rain_sum"] = daily_rain_sum daily_data["snowfall_sum"] = daily_snowfall_sum daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max daily_data["wind_gusts_10m_max"] = daily_wind_gusts_10m_max daily_data["wind_direction_10m_dominant"] = daily_wind_direction_10m_dominant daily_data["dew_point_2m_mean"] = daily_dew_point_2m_mean daily_data["dew_point_2m_max"] = daily_dew_point_2m_max daily_data["dew_point_2m_min"] = daily_dew_point_2m_min daily_data["relative_humidity_2m_mean"] = daily_relative_humidity_2m_mean daily_data["relative_humidity_2m_max"] = daily_relative_humidity_2m_max daily_data["relative_humidity_2m_min"] = daily_relative_humidity_2m_min daily_data["surface_pressure_mean"] = daily_surface_pressure_mean daily_data["surface_pressure_max"] = daily_surface_pressure_max daily_data["surface_pressure_min"] = daily_surface_pressure_min WeatherRaw = pd.DataFrame(data = daily_data) WeatherRaw.to_csv('data/mn_weather_data.csv') # print("\nDaily data\n", WeatherRaw) if callAPIbool == True: callDailyWeatherAPI() else: WeatherRaw = pd.read_csv("data/mn_weather_data.csv") # WeatherRaw.head() # WeatherRaw.info() # WeatherRaw.isna().sum() HistoricRaw = pd.read_csv("data/storm_data_search_results.csv") # HistoricRaw.head() # HistoricRaw.info() # HistoricRaw.isna().sum() ``` Looking at the data it seems like a lot of the data is not filled in or defaulted with zeros. This is due to the many sources of data. In my case I think this is ok as I am really using it as a query to the conditions of the day. An important note is that this contains all the Tonado data for MN. Since I only have weather data for the twin cities I will filter it out later to only the surrounding counties. [![Image of Minnesota](images/Screenshot 2025-08-10 170556.png "Minnesota Map")] An image of the counties I am going to pick can be seen below: [![Image of Minnesota Counties](images/MN_counties "Selected Counties Map")] A full data dictionary can be seen in the csv for the storm data export located at data_dictionaries/Storm-Data-Export.pdf. These will eventualy be transcribed to the README.md files for each of the datasets The other data I will be using is data queried from a weather API. the code used to do this can be seen in the main python sections of the proposal. This set contains all of the last 5 years 2020 - current daily weather data. The first couple rows can be seen below: 'HistoricRaw.head() It is 20 * 1827 different daily weather data coming out of MSP international airport. This is the major weather hub for the twin cities. Using this data it all seems to be pretty numerical and filled in pretty accurately. I do see some areas with 0 values that may need to be addressed later. 'HistoricRaw.info() My plan is to use ths to get the conditions on days where tornados occurred. This will be were most of the model data is derived from. A data dictionary can be seen in the data/README.md or below: ### Codebook for [mn_weather_data] Dataset | Variable Names |Data Types| Description | Unit | |-----------------------------|----------|-------------|------| | date | TimeDate | Data and Time of weather collection | (YYYY-MM-DD hh-mm-ss) | | weather_code | Float | The most severe weather condition on a given day | (WMO Code) | | temperature_2m_mean | Float | Mean daily air temperature at 2 meters above ground | (°F) | | temperature_2m_max | Float | Maximum daily air temperature at 2 meters above ground | (°F) | | temperature_2m_min | Float | Minimum daily air temperature at 2 meters above ground | (°F) | | precipitation_sum | Float | Sum of daily precipitation (including rain, showers and snowfall) | (mm) | | rain_sum | Float | Sum of daily rain | (mm) | | snowfall_sum | Float | Sum of daily snowfall | (cm) | | wind_speed_10m_max | Float | Maximum wind speed on a day | (mph) | | wind_gusts_10m_max | Float | Maximum wind gusts on a day | (mph) | | wind_direction_10m_dominan | Float | Dominant wind direction | (°) | | dew_point_2m_mean | Float | Mean daily dew point at 2 meters above ground | () | | dew_point_2m_max | Float | Maximum daily dew point at 2 meters above ground | () | | dew_point_2m_min | Float | Minimum daily dew point at 2 meters above ground | () | | relative_humidity_2m_mean | Float | Mean daily relative humidiity at 2 meters above ground | () | | relative_humidity_2m_max | Float | Maximum daily relative humidity at 2 meters above ground | () | | relative_humidity_2m_min | Float | Minimum daily relative humidity at 2 meters above ground | () | | surface_pressure_mean | Float | Mean daily pressure at surface | () | | surface_pressure_max | Float | Maximum daily pressure at surface | () | | surface_pressure_min | Float | Minimum daily daily pressure at surface | () | ### Codebook for [storm_data_search_results] Dataset | Variable Names | Data Types | Description | Unit | |---------------------|-----------------------|-------------|-------| | EVENT_ID | Int | ID assigned by NWS to note a single, small part that goes into a specific storm episode | Database ID | | CZ_NAME_STR | String | County/Parish, Zone or Marine Name assigned to the county FIPS number or NWS Forecast Zone | None | | BEGIN_LOCATION | String | Location the event began | None | | BEGIN_DATE | Date | Date the event was reported | MM-DD-YYYY | | BEGIN_TIME | Time | Time the event was reported | hh:mm:ss | | EVENT_TYPE | String | Type of Storm Event (ex. Tornadoes, Hail, etc.) | None | | TOR_F_SCALE | String | Enhanced Fujita Scale describes the strength of the tornado based on the amount and type of damage caused by the tornado | Fujita Scale | | DEATHS_DIRECT | Int | The number of deaths directly related to the weather event | Deaths | | INJURIES_DIRECT | Int | The number of injuries directly related to the weather event | Injuries | | DAMAGE_PROPERTY_NUM | Float | Estimated monetary damage of property in the effected areas | $ | | DAMAGE_CROPS_NUM | Float | Estimated monetary damage of crops in the effected areas | $ | | STATE_ABBR | String | State code of effected area | State Code | | CZ_TIMEZONE | String | Time Zone for the County/Parish, Zone or Marine Name | Time Zone | | EPISODE_ID | Int | ID assigned by NWS to denote the storm episode | None | | CZ_TYPE | String | Type of Jurisdiction (County/Parish, Zone or Marine Name) | None | | CZ_FIPS | Int | FIPS ID number given to County/Parish, Zone or Marine Name | FIPS Code | | WFO | String | National Weather Service Forecast Office’s area of responsibility in which the event occurred | WFO Code | | INJURIES_INDIRECT | Int | The number of injuries indirectly related to the weather event | Injuries | | DEATHS_INDIRECT | Int | The number of deaths indirectly related to the weather event | Deaths | | SOURCE | String | Source of where information came from (ex. Weather Radar, Storm Chaser, Sighting, etc.) | None | | FLOOD_CAUSE | String | Reported or estimated cause of the flood | None | | TOR_LENGTH | Float | Length of the tornado or tornado segment while on the ground | Miles | | TOR_WIDTH | Int | Width of the tornado or tornado segment while on the ground | Yards | | END_LOCATION | String | Location the event ended | None | | END_DATE | Date | Date the event ended | MM-DD-YYYY | | END_TIME | Time | Time the event was ended | hh:mm:ss | | BEGIN_LAT | Float | The latitude where the event began | (°) | | BEGIN_LON | Float | The longitude where the event began | (°) | | END_LAT | Float | The latitude where the event ended | (°) | | END_LON | Float | The longitude where the event ended | (°) | | EPISODE_NARRATIVE | Sentence | The episode narrative depicting the general nature and overall activity of the episode. The narrative is created by NWS. | None | | EVENT_NARRATIVE | Sentence | The event narrative provides more specific details of the individual event. The event narrative is provided by NWS. | None | ### Data Clean Up The tornado data has a lot of holes in a lot of its categories. For this analysis I really only need the date, county, and tornado statistics (fujita scale, width, length, etc.) in order to concatenate it to the daily weather data. The daily weather data will be mostly untouched as the data has no null values and contains all of the relevant necessary data. As such I will append the relevant tornado data to the weather data by date. ```{python} #| label: clean-data #| message: false # Here is the function to clean up the dataset and save it CleanRaw = HistoricRaw[['CZ_NAME_STR', 'BEGIN_DATE', 'TOR_F_SCALE', 'TOR_LENGTH', 'TOR_WIDTH']] CleanRaw['CZ_NAME_STR'] = CleanRaw['CZ_NAME_STR'].astype(str) CleanRaw['BEGIN_DATE'] = pd.to_datetime(CleanRaw['BEGIN_DATE']) FilteredRaw = CleanRaw.loc[CleanRaw['BEGIN_DATE'] > '2000-01-01'] FilteredRaw['CZ_NAME_STR'] = FilteredRaw['CZ_NAME_STR'].apply(lambda x: x.strip(' CO.')) FilteredRaw = FilteredRaw.rename(columns={'BEGIN_DATE': 'date'}) counties_list = ['ANOKA', 'CARVER', 'CHISAGO', 'DAKOTA', 'HENNEPIN', 'ISTANTI', 'LE SUEUR', 'MCLEOD', 'MILLE LACS', 'RAMSEY', 'RICE', 'SCOTT', 'SHERBURNE', 'SIBLEY', 'WASHINGTON', 'WRIGHT'] TornFin = FilteredRaw[FilteredRaw['CZ_NAME_STR'].isin(counties_list)] TornFin = TornFin.drop_duplicates(subset=['date', 'CZ_NAME_STR'], keep='first') # TornFin.info() # TornFin.head() Clean2Raw = WeatherRaw.copy() Clean2Raw['date'] = pd.to_datetime(Clean2Raw['date']) date = Clean2Raw['date'].dt.date Clean2Raw['date'] = date Clean2Raw['date'] = pd.to_datetime(Clean2Raw['date']) Filtered2Raw = Clean2Raw.loc[Clean2Raw['date'] > '2000-01-01'] DailyFin = Filtered2Raw.drop_duplicates(subset=['date'], keep='first') # Filtered2Raw.info() # Filtered2Raw.head() MergeRaw = pd.merge(TornFin, DailyFin, on='date', how='outer') MergeRaw = MergeRaw.drop('Unnamed: 0', axis=1) # MergeRaw.info() # MergeRaw.head() NewFeatureRaw = MergeRaw.copy() NewFeatureRaw['tornado_occurred'] = NewFeatureRaw['TOR_F_SCALE'].apply(lambda x: 1 if pd.notna(x) else 0) NewFeatureRaw['time_of_year'] = NewFeatureRaw['date'].dt.month NewFeatureRaw['CZ_NAME_STR'] = NewFeatureRaw['CZ_NAME_STR'].fillna('None') NewFeatureRaw['TOR_F_SCALE'] = NewFeatureRaw['TOR_F_SCALE'].fillna('None') NewFeatureRaw['TOR_LENGTH'] = NewFeatureRaw['TOR_LENGTH'].fillna(0) NewFeatureRaw['TOR_WIDTH'] = NewFeatureRaw['TOR_WIDTH'].fillna(0) # NewFeatureRaw.info() # NewFeatureRaw.isnull().sum() NewFeatureRaw.to_csv('data/tornado_days.csv') NewFeatureRaw.head() ``` ### Exploratory Data Analysis ```{python} #| label: eda #| message: false # Perform EDA on the dataset def UnivariateAnalysis(df, column = None): """ Perform univariate analysis on the dataset. This function displays descriptive statistics and creates the appropriate plot for both numerical and categorical variables. """ if column != None: numeric_columns = column else: numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist() # Get numeric columns for plotting # print(f"Numeric columns in {df} dataset:", numeric_columns) # Print the list of numeric for data in numeric_columns: print(df[data].describe()) # Display descriptive statistics for each numeric column sns.histplot(data=df, x=data, bins=30).set_title(f'Distribution of {data}') plt.show() # Display the histogram for each numeric column if column != None: categorical_columns = column else: categorical_columns = df.select_dtypes(exclude=[np.number]).columns.tolist() # Get numeric columns for plotting # print(f"Categoric columns in {df} dataset:", categorical_columns) # Print the list of categoric for data in categorical_columns: print(df[data].describe()) # Display descriptive statistics for each categoric column sns.countplot(data=df, x=data).set_title(f'Count of {data}') plt.xticks(rotation=45) # Rotate x-axis labels for better readability plt.tight_layout() # Adjust layout to prevent label overlap plt.show() # Display the count plot for each categoric column #Down selected to a few interesting columns sns.pairplot(data=NewFeatureRaw, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'date', 'relative_humidity_2m_mean', 'time_of_year'], y_vars=['tornado_occurred']) plt.show() sns.pairplot(data=NewFeatureRaw, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'relative_humidity_2m_mean', 'tornado_occurred', 'time_of_year'], y_vars=['date']) plt.show() sns.pairplot(data=NewFeatureRaw, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'relative_humidity_2m_mean', 'tornado_occurred', 'date'], y_vars=['time_of_year']) plt.show() # ''' # Perform univariate analysis on the datasets. # ''' # UnivariateAnalysis(NewFeatureRaw) ``` # Questions 1. Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area? 2. Can I create a GUI/dashboard to quickly allow users to tune/view predictions. # Analysis plan | Week | Dates | Tasks | Status | Date Finished | |---------|:-----------|--------------------------------------------------------------------:|:------------|:--------------| | 1 | 7/28->8/1 | Search and Find Data | Complete | 8/1 | | 2 | 8/4->8/8 | Create Proposal and Load and Make Data Dict and Clean Data | Complete | 8/11 | | 2.5 | 8/7->8/10 | Perform EDA and Create Engineering Features and Start Write Up | Complete | 8/15 | | 3 | 8/11->8/13 | Create Models and Examine Measures of Accuracy and Add to Write Up | Complete | 8/15 | | 3.5 | 8/14->8/17 | Tune and Validate Classification and Regression and Add to Write Up | Complete | 8/16 | | 4 | 8/18->8/20 | Finish Write Up and Finish Presentation | In Progress | 8/17 | ### Course of Action #### Search and Find Data: Description: * The goal of this task is to find relevent data and to define the relevant questions. * In this task I want to find data on tornadoes in the state of MN and data on daily weather patterns in the Twin Cities. * Once I find this data, I want tp add it to the repo and begin the planning to get the proposal started and start the data dictionary. Acceptance Criteria: * Data sets sourced and README Data Dictionary drafted out. * Data added to repo with relevant supporting data. * (Something New) Use API to source daily weather data. #### Create Proposal and Load and Make Data Dict and Clean Data: Description: * The goal of this task is to create the drafted proposal and make the corresponding data dictionary for the README and proposal respectively. * I want to load in and clean the data to only the relevant parameters that I need. * The columns I want specifically are all the daily weather data and the tornadoes from 2020-2025 in the surrounding counties to the twin cities. * Define New Features (I want a feature for did tornado occur, I want a feature for major storms as well) Acceptance Criteria: * Data Loaded and Merged * Data columns scrubbed for relevant data and missing data resolved * Data Dictionary finished * Data Features Defined #### Perform EDA and Create Engineering Features and Start Write Up: Description: * The goal of this task is to Create the features listed previously. * Once the features are created I want to do I primary EDA using seaborn/matlab plots. * Document my findings from this EDA so that I can put those findings in my write up. Acceptance Criteria: * Features created * EDA performed and documented 200 ish words of different observations #### Create Models and Examine Measures of Accuracy and Add to Write Up: Description: * The goal of this task is to create a simple classification model probably logistic regression or naive bayes. * From there I want to examine how accurate the model is by examining the accuracy and percision scores * I want to look at other scores potentially like the F1 score or ROC-AUC. * I want to also do some preliminary looks at implementing other classification models for more complex models like Random Forest and Support Vector Machines (SVM) Acceptance Criteria: * Simple Model(s) Created and fit examined and documented for write up. * Implement scores like F1 and ROC-AUC * Start Implementation of Random Forest and SVM #### Tune and Validate Classification and Regression and Add to Write Up: Description: * The goal of this task is to finish of the Random Forest and SVM from the previous task. * From there I want to examine measures of fit and see how well it fits using similar scoring to previous tasks * I want to add all of the measures of fit to the write up. * Start app using TKINTER * The app will use the best fiting model coeffecients to estimate if a tornado is likely given things like Precipitation, CAPE, Surface Pressure, Temp, etc. Acceptance Criteria: * Random Forest and SVM regressions finished and examined. * Documentation for all the models is finished for the write up * TKINTER App is set up and ready to begin use. #### Finish Write Up and Finish Presentation: Description: * The goal of this task is to finish up the TKINTER app. * The second goal of this task is to finish the write up. * The other major goal is to start and finish the presentation based off the final write up. Acceptance Criteria: * Write up, App, and presentation finished # Repo Organization - **.github/:** specifically designated for GitHub-related files, including workflows, actions, and templates customized for managing issues. - **\_extra/:** dedicated to storing miscellaneous files that do not categorically align with other project sections, serving as a versatile repository for various supplementary documents. - **\_freeze/:** houses frozen environment files that contain detailed information about the project's environment setup and dependencies. - **data/:** designated for storing essential data files crucial for the project's operations, including input files, datasets, and other vital data resources. - **data_dictionary/:** designated as a place to put supplemental documentation that I have used/ that has been helpful - **images/:** functioning as a central repository for visual assets utilized across the project, such as diagrams, charts, and screenshots, this directory houses essential visual components essential for project documentation and presentation purposes. - **.gitignore:** designed to define exclusions from version control, ensuring that specific files and directories are not tracked by Git, thereby simplifying the versioning process. - **README.md:** functioning as the central repository of project information, this README document provides vital details covering project setup, usage instructions, and an overarching summary of project objectives and scope. - **about.qmd:** Quarto file supplements project documentation by providing additional contextual information, describing our project purpose, as well as names and background of individual team members. - **index.qmd:** serves as the main page for our project, where our write up will eventually be. This Quarto file offers in-depth descriptions of our project, encompassing all code and visualizations, as well as eventually our results. - **presentation.qmd:** serves as a Quarto file that will present our slideshow of our final results of our project. - **proposal.qmd:** designed as the Quarto file responsible for our project proposal, housing our dataset, metadata, description, and questions, as well as our weekly plan of attack that will be updated weekly. - **requirements.txt:** specifies the dependencies and their respective versions required for the project to run successfully. Source: Kaggle Competition Shared Project Has Similar Layout For Ease I Checked that all of the Files Serve Similar Purpose