Prediction of Tornado Occurance in the Twin Cities

INFO 523 - Final Project

An analysis of historical tornado patterns in the area of Twin Ciries from 2000-Current(2025) to predict Tornado Occurence
Author
Affiliation

The Crengineers - Tyler Hart

College of Information Science, College of Systems and Industrial Engineering, University of Arizona

Abstract

TODO: Add project abstract here.

Intro

The primary goal of this project is to use publically available daily weather data to form relationships with NOAA records of tornado formation. This is crucial because unlike hurricanes which have a longer warning timeline, tornadoes are often learned about shortly before they occur. This can cause more extreme damage as people do not have time to secure crucial belongins or themselves in a non-affected area. Daily weather forecast are available and usually a part of every person’s daily life. Whether that be through a weather app or the news this publically available data could be potentially utilized to help warn/inform people of the chance of tornado occurring in their areas. One approach to this problem is to create regression models such as, Lasso, Ridge, and Random Forest using polynomial features. TODO: More Here

This project is special to me as I have had the unfortunate experience of having friends and families lives be altered by unexpected tornados. Because of this I am going to focus my analysis to the Twin Cities area, pictured in the Map of Analysis Area Section. This means I will focus my analysis to the surrounding counties and weather data from the Mineapolis-Saint Paul, MSP, airport. This has lead me to two try and answer the two questions seen below in the Questions section.

Questions:

  1. Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area?

  2. Can I create a GUI/dashboard to quickly allow users to tune/view predictions.

Map of Analysis Area:

The surrounding colonies to the Twin Cities, also known as MSP, are the only region where the data is valid. Since the airport is located in the heart of the twin cities this analysis will only encompass the counties shown in the map below:

County Map

The counties were selected on a region selection basis. To properly select the region where the data is valid I drew a 60 mile circle around the MSP airport and selected any counties within that region. 60 miles was chosen to give me both ample data but also only have roughly an hour drive from the airport as I felt that was an applicable size.

Section Definitions:

In this write-up, there will be 6 main sections, Intro, Dataset and Feature Engineering, Exploratory Data Analysis, Model Creation and Tuning, Results and Analysis, and App/Dashboard. In the Intro section I have identified the purpose, scope and major questions I intend to try and answer. In the Dataset and Feature Engineering section I will discuss the cleaned up dataset with the engineered features pertinent to the future model creation and analysis. In the Ecploratory Data Analysis Section, I will show plots pertaining to the variable of interest as well as comment on different relationships I noticed in a primary analysis. In the Model Creation and Tuning section I will discuss why I chose each model and the corresponding cross validation I used to validate/make the most representative model possible. In the Results and Analysis section, I will show analysis results and plots as well as discuss the results of the model tuning and cross validation. Laslty I will attempt to collate those results into a dashboard for user interfacing and quick use to inform of tornado creation.

DataSet and Feature Engineering

DataSet Description

I have 2 raw datasets that I have gathered in trying to answer these questions. The 1st data set found in this repo at NOAA Data* is the historical data of every tornado that has been reported as far back as 1952. This data contains the relevant county, data, and tornado information, to allow me to filter and clean the data to compare to the 2nd data set I will be using. The 2nd data set found in this repo at API Daily Data* is the historical daily weather data measured at MSP International Airport as far back as 2000. This data contains daily weather information that I will be training my regression models against to hopefully predict tornado formation. The combination** of these datasets can be found in this repo at Combined Data*. The combined dataset is the combination of these two datasets with new features from the Feature Engineering section.

Feature Engineering

This data will need a few more columns. Namely I will need to Feature Engineer a column for tornado_occurred. This column will simply be boolean represented as an integer that is true if a tornado occurred on that day in an applicable county. Another column I will need time_of_year. This column will represent the month the event occurred in. This is imoortant as tornadoes are seasonal in most of the United States. This will help find relationships and be a potential trainer of the model.

Variable Type Description Unit
tornado_occurred Bool (Float) Did a tornado occurr on this day (1 if yes, 0 if no) None
time_of_year Int The index of the month the event occurred Month

* for more information on what is contained in the dataset plese go to Data Dictionary The dataset I am going to be using is a combination of the NOAA tornado archives data and the daily weather open-meteo API data.

** for more detailed information about the merge process look in my proposal at Proposal Link

PySimpleGUI is now located on a private PyPI server.  Please add to your pip command: -i https://PySimpleGUI.net/install

The version you just installed should uninstalled:
   python -m pip uninstall PySimpleGUI
   python -m pip cache purge

Then install the latest from the private server:
python -m pip install --upgrade --extra-index-url https://PySimpleGUI.net/install PySimpleGUI

You can also force a reinstall using this command and it'll install the latest regardless of what you have installed currently
python -m pip install --force-reinstall --extra-index-url https://PySimpleGUI.net/install PySimpleGUI

Use python3 command if you're running on the Mac or Linux

Exploratory Data Analysis (EDA)

The first thing I notice is that there looks like clear conditions for what makes tornados. Mean temperature is roughly 60 to 80 degrees, dew point is 50 to 75, wind direction over 10 minutes is generaly 100 to 200 degrees, surface pressure is generally grouped, wind gust and wind spee maximums are generally 20 to 40 mph and 10 to 20 mph respectively.

Trends when compared to date look fairly normally distributed from year to year. It looks like a dense cloud for most of the variables which tells me conditions have not changed over time. This gives me hope that the trends are consistent and will continue going forward. I plan on using this information to say trends are consistent and can be extended into the future forecasting.

Since tornados are seasonal in most of the US I wanted to also see if the initial trends I noticed were correlated. On an initial look the conditions do appear to be connected as the middle months (May to August) appear to have conditions that match when the tornados occurr. This is a little concerning and I will need to look at some initial model fits to understand how they interact.

The last major thing to think about is how sporadic these events occurr. Roughly 380 events in the area of interest after duplicates are removed in a period of 24 * 365 days. I am worried that this may leak into my train test splits and something may need to be done in order to rectify this.

Model Creation and Tuning

Because of the issue of a small sample side I mentioned previously, I am electing to do a 40/60 test/train split. I am doing this so that the datasets both have ample data and one set does not have only 15-20 points. I am trying to remedy this here and will adjust as needed. I got rid of the date section here as well as the time_of_year column encompasses this variance and is more apt to the seasonal nature of the tornado event.

{'Component 1': [], 'Component 2': [], 'Component 3': [], 'Component 4': [], 'Component 5': [], 'Component 6': [], 'Component 7': ['CZ_NAME_STR'], 'Component 8': [], 'Component 9': [], 'Component 10': [], 'Component 11': ['weather_code'], 'Component 12': []}
CZ_NAME_STR TOR_F_SCALE TOR_LENGTH TOR_WIDTH weather_code temperature_2m_mean temperature_2m_max temperature_2m_min precipitation_sum rain_sum ... dew_point_2m_max dew_point_2m_min relative_humidity_2m_mean relative_humidity_2m_max relative_humidity_2m_min surface_pressure_mean surface_pressure_max surface_pressure_min tornado_occurred time_of_year
0 -0.009300 -0.073332 0.065846 0.061042 0.092541 0.336861 0.329270 0.339568 0.160680 0.175481 ... 0.353257 0.344675 0.057663 0.157992 -0.007560 -0.228894 -0.254573 -0.189044 0.075383 0.090824
1 -0.039695 -0.184275 0.184793 0.174903 0.278559 -0.207202 -0.220331 -0.186266 0.280730 0.235538 ... -0.128632 -0.148664 0.328693 0.199453 0.339753 -0.202312 -0.134906 -0.237144 0.187763 -0.066202
2 -0.091596 -0.450961 0.447813 0.434640 -0.125033 0.035215 0.043584 0.026687 -0.027067 0.001868 ... 0.000355 0.012331 -0.144284 -0.116238 -0.138544 0.181068 0.154567 0.182200 0.458211 0.007587
3 -0.004646 -0.024410 0.030190 0.029057 -0.027897 -0.007013 0.001511 -0.014834 -0.066826 -0.070795 ... -0.055338 -0.105310 -0.365922 -0.346174 -0.287834 -0.198999 -0.133762 -0.249427 0.025082 -0.075326
4 0.015050 0.072639 -0.078721 -0.094574 0.185181 0.039919 0.030063 0.053417 0.464426 0.446787 ... 0.027551 0.015534 -0.115127 -0.113053 -0.090028 0.352344 0.364245 0.311488 -0.068637 0.011909

5 rows × 25 columns

Value Train 0: Count Train 7279
Value Train 1: Count Train 32
Value Test 0: Count Test 1819
Value Test 1: Count Test 9

Regression Models

For this analysis I chose to create and tune 4 different classification models. All undergoing random search cross validation over a generated parameter sweep. For each model I deemed f1 score to be the best measure of fit because it balances percision and recall. This allows some models to fully predict the tornado cases at the expense of some false tornado alarms. I believe this is the right approach because the cost of missing an early warning alarm is much greater than giving an early warning when it is not needed.

RandomizedSearchCV(cv=5, estimator=LogisticRegression(random_state=63),
                   n_iter=100,
                   param_distributions={'C': [0.01, 0.1, 1, 10, 100],
                                        'class_weight': [None, 'balanced'],
                                        'max_iter': [100, 500, 1000],
                                        'penalty': ['l2', 'l1'],
                                        'solver': ['liblinear', 'saga']},
                   random_state=63, scoring='f1_macro')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Logistic Regression
    • Rationale:
    • Chosen Hyper-Parameters:
    Best LR parameters: {'solver': 'saga', 'penalty': 'l1', 'max_iter': 1000, 'class_weight': 'balanced', 'C': 0.1}
    • Results:
    Best LR score: 0.5297675861856518
RandomizedSearchCV(cv=5, estimator=GaussianNB(), n_iter=100,
                   param_distributions={'var_smoothing': [1e-09, 1e-08, 1e-07,
                                                          1e-06, 1e-05,
                                                          0.0001]},
                   random_state=63, scoring='f1_macro')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

bayes_grid.fit(X_train, y_train) * Naive Bayes * Rationale: * Chosen Hyper-Parameters:

::: {#da1df00b .cell execution_count=10}

::: {.cell-output .cell-output-stdout}
```
Best Bayes parameters: {'var_smoothing': 0.0001}
```
:::
:::
  • Results:

::: {#ebd8a870 .cell execution_count=11}

::: {.cell-output .cell-output-stdout} Best Bayes score: 0.49878008543006136 ::: :::

RandomizedSearchCV(cv=5, estimator=SVC(random_state=63), n_iter=100,
                   param_distributions={'C': [0.1, 1, 10, 100],
                                        'degree': [2, 3, 4],
                                        'gamma': ['scale', 'auto', 0.01, 0.1,
                                                  1],
                                        'kernel': ['linear', 'rbf', 'poly']},
                   random_state=63, scoring='f1_macro')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Vector Control Regression

    • Rationale:

    • Chosen Hyper-Parameters:

      Best SVC parameters: {'kernel': 'rbf', 'gamma': 1, 'degree': 2, 'C': 10}
    • Results:

    ::: {#5f92abb6 .cell execution_count=14}

    ::: {.cell-output .cell-output-stdout} Best SVC score: 0.7305947951647276 ::: :::

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=63),
                   n_iter=100,
                   param_distributions={'max_depth': [None, 5, 10, 20],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [50, 100, 200]},
                   random_state=63, scoring='f1_macro')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  • Random Forest Classifier

    • Rationale:

    • Chosen Hyper-Parameters:

      Best RF parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 10}
    • Results:

    ::: {#bced74f9 .cell execution_count=17}

    ::: {.cell-output .cell-output-stdout} Best RF score: 0.7305947951647276 ::: :::

Results and Analysis

As can be seen in the confusion matrices, the logistic regression model performs the best when trained. It correctly perdicts all of the tornado events at the cost of 128 false tornado predictions over a 25 year time span. Similarly, the Naive Bayes predictor also correctly classified the tornado conditions at the expense of 202 false positives. I believe this is an artifact of the model complexity being high for this model. The last two models, the Random Forest Classifier and the Support Vector Machines, performed exactly the same. Both of these models correctly classified all of the non-tonado days, but failed to classify all but 2 of the tornado days.

Overall in terms of accuracy, the Support Vector Machines and Random Forest models had the highest accuracy score only incorrectly classifiying 7 events each. This does not mean they were in fact the best. Both of these models as the 7 events they misclassified are costly. i.e. lost of life, belongings, and loved ones. For this reason I have elected to select the Logistic Regression as the best model. It both correctly classified the tornadoes and minimized the number of false tornado predictions when compared to the Naive Bayes model. The classification report for the Logistic Regression model can be seen below:

              precision    recall  f1-score   support

           0       1.00      0.93      0.96      1819
           1       0.07      1.00      0.12         9

    accuracy                           0.93      1828
   macro avg       0.53      0.96      0.54      1828
weighted avg       1.00      0.93      0.96      1828

App/Dashboard

Below is a simple application/GUI meant to take the best model and predict