Proposal of Prediction of Tornado Occurance in Minnesota

Proposal

An analysis of historical weather/tornado patterns in the area of Twin Cities Minnesota (MN) from 2020-Current(2025) to predict Tornado Occurance
Author
Affiliation

The Crengineers - Tyler Hart

College of Information Science, College of Systems and Industrial Engineering, University of Arizona

Dataset

First and foremost I chose this project as I have recently had some friends effected by tornados in the state of MN. Since there is a large amount of weather data publically available, I thought the data would be easy to source and models already exists as a baseline to use. I think this will allow me to focus on the ML part of the course rather than the data preprocessing which is a new area to me. Furthermore, I think there are some good classification questions about how to classify weather based on basic data and some good regression questions about can I predict tornadoes occurence based on this basic daily weather data. As of now I have 2 datasets the first is the historical archive of tornadows in MN from as far back as the 1950’s it contains the day of the Tornado as well as relative size/damage that was in its path. It has 38 columns which can be seen in the sub section codebooks.

Looking at the data it seems like a lot of the data is not filled in or defaulted with zeros. This is due to the many sources of data. In my case I think this is ok as I am really using it as a query to the conditions of the day. An important note is that this contains all the Tonado data for MN. Since I only have weather data for the twin cities I will filter it out later to only the surrounding counties.

[Image of Minnesota]

An image of the counties I am going to pick can be seen below:

[Image of Minnesota Counties]

A full data dictionary can be seen in the csv for the storm data export located at data_dictionaries/Storm-Data-Export.pdf. These will eventualy be transcribed to the README.md files for each of the datasets

The other data I will be using is data queried from a weather API. the code used to do this can be seen in the main python sections of the proposal. This set contains all of the last 5 years 2020 - current daily weather data. The first couple rows can be seen below: ’HistoricRaw.head() It is 20 * 1827 different daily weather data coming out of MSP international airport. This is the major weather hub for the twin cities. Using this data it all seems to be pretty numerical and filled in pretty accurately. I do see some areas with 0 values that may need to be addressed later. ’HistoricRaw.info() My plan is to use ths to get the conditions on days where tornados occurred. This will be were most of the model data is derived from. A data dictionary can be seen in the data/README.md or below:

Codebook for [mn_weather_data] Dataset

Variable Names Data Types Description Unit
date TimeDate Data and Time of weather collection (YYYY-MM-DD hh-mm-ss)
weather_code Float The most severe weather condition on a given day (WMO Code)
temperature_2m_mean Float Mean daily air temperature at 2 meters above ground (°F)
temperature_2m_max Float Maximum daily air temperature at 2 meters above ground (°F)
temperature_2m_min Float Minimum daily air temperature at 2 meters above ground (°F)
precipitation_sum Float Sum of daily precipitation (including rain, showers and snowfall) (mm)
rain_sum Float Sum of daily rain (mm)
snowfall_sum Float Sum of daily snowfall (cm)
wind_speed_10m_max Float Maximum wind speed on a day (mph)
wind_gusts_10m_max Float Maximum wind gusts on a day (mph)
wind_direction_10m_dominan Float Dominant wind direction (°)
dew_point_2m_mean Float Mean daily dew point at 2 meters above ground ()
dew_point_2m_max Float Maximum daily dew point at 2 meters above ground ()
dew_point_2m_min Float Minimum daily dew point at 2 meters above ground ()
relative_humidity_2m_mean Float Mean daily relative humidiity at 2 meters above ground ()
relative_humidity_2m_max Float Maximum daily relative humidity at 2 meters above ground ()
relative_humidity_2m_min Float Minimum daily relative humidity at 2 meters above ground ()
surface_pressure_mean Float Mean daily pressure at surface ()
surface_pressure_max Float Maximum daily pressure at surface ()
surface_pressure_min Float Minimum daily daily pressure at surface ()

Codebook for [storm_data_search_results] Dataset

Variable Names Data Types Description Unit
EVENT_ID Int ID assigned by NWS to note a single, small part that goes into a specific storm episode Database ID
CZ_NAME_STR String County/Parish, Zone or Marine Name assigned to the county FIPS number or NWS Forecast Zone None
BEGIN_LOCATION String Location the event began None
BEGIN_DATE Date Date the event was reported MM-DD-YYYY
BEGIN_TIME Time Time the event was reported hh:mm:ss
EVENT_TYPE String Type of Storm Event (ex. Tornadoes, Hail, etc.) None
TOR_F_SCALE String Enhanced Fujita Scale describes the strength of the tornado based on the amount and type of damage caused by the tornado Fujita Scale
DEATHS_DIRECT Int The number of deaths directly related to the weather event Deaths
INJURIES_DIRECT Int The number of injuries directly related to the weather event Injuries
DAMAGE_PROPERTY_NUM Float Estimated monetary damage of property in the effected areas $
DAMAGE_CROPS_NUM Float Estimated monetary damage of crops in the effected areas $
STATE_ABBR String State code of effected area State Code
CZ_TIMEZONE String Time Zone for the County/Parish, Zone or Marine Name Time Zone
EPISODE_ID Int ID assigned by NWS to denote the storm episode None
CZ_TYPE String Type of Jurisdiction (County/Parish, Zone or Marine Name) None
CZ_FIPS Int FIPS ID number given to County/Parish, Zone or Marine Name FIPS Code
WFO String National Weather Service Forecast Office’s area of responsibility in which the event occurred WFO Code
INJURIES_INDIRECT Int The number of injuries indirectly related to the weather event Injuries
DEATHS_INDIRECT Int The number of deaths indirectly related to the weather event Deaths
SOURCE String Source of where information came from (ex. Weather Radar, Storm Chaser, Sighting, etc.) None
FLOOD_CAUSE String Reported or estimated cause of the flood None
TOR_LENGTH Float Length of the tornado or tornado segment while on the ground Miles
TOR_WIDTH Int Width of the tornado or tornado segment while on the ground Yards
END_LOCATION String Location the event ended None
END_DATE Date Date the event ended MM-DD-YYYY
END_TIME Time Time the event was ended hh:mm:ss
BEGIN_LAT Float The latitude where the event began (°)
BEGIN_LON Float The longitude where the event began (°)
END_LAT Float The latitude where the event ended (°)
END_LON Float The longitude where the event ended (°)
EPISODE_NARRATIVE Sentence The episode narrative depicting the general nature and overall activity of the episode. The narrative is created by NWS. None
EVENT_NARRATIVE Sentence The event narrative provides more specific details of the individual event. The event narrative is provided by NWS. None

Data Clean Up

The tornado data has a lot of holes in a lot of its categories. For this analysis I really only need the date, county, and tornado statistics (fujita scale, width, length, etc.) in order to concatenate it to the daily weather data. The daily weather data will be mostly untouched as the data has no null values and contains all of the relevant necessary data. As such I will append the relevant tornado data to the weather data by date.

CZ_NAME_STR date TOR_F_SCALE TOR_LENGTH TOR_WIDTH weather_code temperature_2m_mean temperature_2m_max temperature_2m_min precipitation_sum ... dew_point_2m_max dew_point_2m_min relative_humidity_2m_mean relative_humidity_2m_max relative_humidity_2m_min surface_pressure_mean surface_pressure_max surface_pressure_min tornado_occurred time_of_year
0 None 2000-01-02 None 0.0 0.0 73.0 29.362250 33.4085 26.838501 0.039370 ... 26.6585 20.088500 80.965065 90.481580 67.150380 977.19560 984.67926 972.81570 0 1
1 None 2000-01-03 None 0.0 0.0 3.0 24.491000 26.3885 22.788500 0.000000 ... 19.5485 15.768499 74.725880 82.030160 66.527940 986.42120 988.42944 984.91630 0 1
2 None 2000-01-04 None 0.0 0.0 71.0 16.728498 24.2285 10.638500 0.000000 ... 18.3785 -1.871502 60.066280 78.316124 44.035545 987.85956 990.71370 984.34186 0 1
3 None 2000-01-05 None 0.0 0.0 73.0 17.831000 28.0085 10.368500 0.090551 ... 25.5785 -1.781502 69.617760 91.112190 55.900590 983.71606 990.35310 978.67990 0 1
4 None 2000-01-06 None 0.0 0.0 3.0 22.987251 29.3585 13.248501 0.000000 ... 25.7585 3.348501 69.387764 89.471180 56.031500 986.35736 992.35600 979.56040 0 1

5 rows × 26 columns

Exploratory Data Analysis

Questions

  1. Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area?

  2. Can I create a GUI/dashboard to quickly allow users to tune/view predictions.

Analysis plan

Week Dates Tasks Status Date Finished
1 7/28->8/1 Search and Find Data Complete 8/1
2 8/4->8/8 Create Proposal and Load and Make Data Dict and Clean Data Complete 8/11
2.5 8/7->8/10 Perform EDA and Create Engineering Features and Start Write Up Complete 8/15
3 8/11->8/13 Create Models and Examine Measures of Accuracy and Add to Write Up Complete 8/15
3.5 8/14->8/17 Tune and Validate Classification and Regression and Add to Write Up Complete 8/16
4 8/18->8/20 Finish Write Up and Finish Presentation In Progress 8/17

Course of Action

Search and Find Data:

Description:

  • The goal of this task is to find relevent data and to define the relevant questions.

  • In this task I want to find data on tornadoes in the state of MN and data on daily weather patterns in the Twin Cities.

  • Once I find this data, I want tp add it to the repo and begin the planning to get the proposal started and start the data dictionary.

Acceptance Criteria:

  • Data sets sourced and README Data Dictionary drafted out.

  • Data added to repo with relevant supporting data.

  • (Something New) Use API to source daily weather data.

Create Proposal and Load and Make Data Dict and Clean Data:

Description:

  • The goal of this task is to create the drafted proposal and make the corresponding data dictionary for the README and proposal respectively.

  • I want to load in and clean the data to only the relevant parameters that I need.

    • The columns I want specifically are all the daily weather data and the tornadoes from 2020-2025 in the surrounding counties to the twin cities.
  • Define New Features (I want a feature for did tornado occur, I want a feature for major storms as well)

Acceptance Criteria:

  • Data Loaded and Merged

  • Data columns scrubbed for relevant data and missing data resolved

  • Data Dictionary finished

  • Data Features Defined

Perform EDA and Create Engineering Features and Start Write Up:

Description:

  • The goal of this task is to Create the features listed previously.

  • Once the features are created I want to do I primary EDA using seaborn/matlab plots.

    • Document my findings from this EDA so that I can put those findings in my write up.

Acceptance Criteria:

  • Features created

  • EDA performed and documented 200 ish words of different observations

Create Models and Examine Measures of Accuracy and Add to Write Up:

Description:

  • The goal of this task is to create a simple classification model probably logistic regression or naive bayes.

  • From there I want to examine how accurate the model is by examining the accuracy and percision scores

  • I want to look at other scores potentially like the F1 score or ROC-AUC.

  • I want to also do some preliminary looks at implementing other classification models for more complex models like Random Forest and Support Vector Machines (SVM)

Acceptance Criteria:

  • Simple Model(s) Created and fit examined and documented for write up.

  • Implement scores like F1 and ROC-AUC

  • Start Implementation of Random Forest and SVM

Tune and Validate Classification and Regression and Add to Write Up:

Description:

  • The goal of this task is to finish of the Random Forest and SVM from the previous task.

  • From there I want to examine measures of fit and see how well it fits using similar scoring to previous tasks

  • I want to add all of the measures of fit to the write up.

  • Start app using TKINTER

    • The app will use the best fiting model coeffecients to estimate if a tornado is likely given things like Precipitation, CAPE, Surface Pressure, Temp, etc.

Acceptance Criteria:

  • Random Forest and SVM regressions finished and examined.

  • Documentation for all the models is finished for the write up

  • TKINTER App is set up and ready to begin use.

Finish Write Up and Finish Presentation:

Description:

  • The goal of this task is to finish up the TKINTER app.

  • The second goal of this task is to finish the write up.

  • The other major goal is to start and finish the presentation based off the final write up.

Acceptance Criteria:

  • Write up, App, and presentation finished

Repo Organization

  • .github/: specifically designated for GitHub-related files, including workflows, actions, and templates customized for managing issues.

  • _extra/: dedicated to storing miscellaneous files that do not categorically align with other project sections, serving as a versatile repository for various supplementary documents.

  • _freeze/: houses frozen environment files that contain detailed information about the project’s environment setup and dependencies.

  • data/: designated for storing essential data files crucial for the project’s operations, including input files, datasets, and other vital data resources.

  • data_dictionary/: designated as a place to put supplemental documentation that I have used/ that has been helpful

  • images/: functioning as a central repository for visual assets utilized across the project, such as diagrams, charts, and screenshots, this directory houses essential visual components essential for project documentation and presentation purposes.

  • .gitignore: designed to define exclusions from version control, ensuring that specific files and directories are not tracked by Git, thereby simplifying the versioning process.

  • README.md: functioning as the central repository of project information, this README document provides vital details covering project setup, usage instructions, and an overarching summary of project objectives and scope.

  • about.qmd: Quarto file supplements project documentation by providing additional contextual information, describing our project purpose, as well as names and background of individual team members.

  • index.qmd: serves as the main page for our project, where our write up will eventually be. This Quarto file offers in-depth descriptions of our project, encompassing all code and visualizations, as well as eventually our results.

  • presentation.qmd: serves as a Quarto file that will present our slideshow of our final results of our project.

  • proposal.qmd: designed as the Quarto file responsible for our project proposal, housing our dataset, metadata, description, and questions, as well as our weekly plan of attack that will be updated weekly.

  • requirements.txt: specifies the dependencies and their respective versions required for the project to run successfully.

Source: Kaggle Competition Shared Project Has Similar Layout For Ease I Checked that all of the Files Serve Similar Purpose