Prediction of Tornado Occurance in the Twin Cities

INFO 523 - Final Project

An analysis of historical tornado patterns in the area of Twin Ciries from 2000-Current(2025) to predict Tornado Occurence

Author

Affiliation

The Crengineers - Tyler Hart

College of Information Science, College of Systems and Industrial Engineering, University of Arizona

Abstract

TODO: Add project abstract here.

Intro

The primary goal of this project is to use publically available daily weather data to form relationships with NOAA records of tornado formation. This is crucial because unlike hurricanes which have a longer warning timeline, tornadoes are often learned about shortly before they occur. This can cause more extreme damage as people do not have time to secure crucial belongins or themselves in a non-affected area. Daily weather forecast are available and usually a part of every person’s daily life. Whether that be through a weather app or the news this publically available data could be potentially utilized to help warn/inform people of the chance of tornado occurring in their areas. One approach to this problem is to create regression models such as, Lasso, Ridge, and Random Forest using polynomial features. TODO: More Here

This project is special to me as I have had the unfortunate experience of having friends and families lives be altered by unexpected tornados. Because of this I am going to focus my analysis to the Twin Cities area, pictured in the Map of Analysis Area Section. This means I will focus my analysis to the surrounding counties and weather data from the Mineapolis-Saint Paul, MSP, airport. This has lead me to two try and answer the two questions seen below in the Questions section.

Questions:

Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area?
Can I create a GUI/dashboard to quickly allow users to tune/view predictions.

Map of Analysis Area:

The surrounding colonies to the Twin Cities, also known as MSP, are the only region where the data is valid. Since the airport is located in the heart of the twin cities this analysis will only encompass the counties shown in the map below:

Picture of Applicable Counties — County Map

The counties were selected on a region selection basis. To properly select the region where the data is valid I drew a 60 mile circle around the MSP airport and selected any counties within that region. 60 miles was chosen to give me both ample data but also only have roughly an hour drive from the airport as I felt that was an applicable size.

Section Definitions:

In this write-up, there will be 6 main sections, Intro, Dataset and Feature Engineering, Exploratory Data Analysis, Model Creation and Tuning, Results and Analysis, and App/Dashboard. In the Intro section I have identified the purpose, scope and major questions I intend to try and answer. In the Dataset and Feature Engineering section I will discuss the cleaned up dataset with the engineered features pertinent to the future model creation and analysis. In the Ecploratory Data Analysis Section, I will show plots pertaining to the variable of interest as well as comment on different relationships I noticed in a primary analysis. In the Model Creation and Tuning section I will discuss why I chose each model and the corresponding cross validation I used to validate/make the most representative model possible. In the Results and Analysis section, I will show analysis results and plots as well as discuss the results of the model tuning and cross validation. Laslty I will attempt to collate those results into a dashboard for user interfacing and quick use to inform of tornado creation.

DataSet and Feature Engineering

DataSet Description

I have 2 raw datasets that I have gathered in trying to answer these questions. The 1st data set found in this repo at NOAA Data* is the historical data of every tornado that has been reported as far back as 1952. This data contains the relevant county, data, and tornado information, to allow me to filter and clean the data to compare to the 2nd data set I will be using. The 2nd data set found in this repo at API Daily Data* is the historical daily weather data measured at MSP International Airport as far back as 2000. This data contains daily weather information that I will be training my regression models against to hopefully predict tornado formation. The combination** of these datasets can be found in this repo at Combined Data*. The combined dataset is the combination of these two datasets with new features from the Feature Engineering section.

Feature Engineering

This data will need a few more columns. Namely I will need to Feature Engineer a column for tornado_occurred. This column will simply be boolean represented as an integer that is true if a tornado occurred on that day in an applicable county. Another column I will need time_of_year. This column will represent the month the event occurred in. This is imoortant as tornadoes are seasonal in most of the United States. This will help find relationships and be a potential trainer of the model.

Variable	Type	Description	Unit
tornado_occurred	Bool (Float)	Did a tornado occurr on this day (1 if yes, 0 if no)	None
time_of_year	Int	The index of the month the event occurred	Month

* for more information on what is contained in the dataset plese go to Data Dictionary The dataset I am going to be using is a combination of the NOAA tornado archives data and the daily weather open-meteo API data.

** for more detailed information about the merge process look in my proposal at Proposal Link

PySimpleGUI is now located on a private PyPI server.  Please add to your pip command: -i https://PySimpleGUI.net/install

The version you just installed should uninstalled:
   python -m pip uninstall PySimpleGUI
   python -m pip cache purge

Then install the latest from the private server:
python -m pip install --upgrade --extra-index-url https://PySimpleGUI.net/install PySimpleGUI

You can also force a reinstall using this command and it'll install the latest regardless of what you have installed currently
python -m pip install --force-reinstall --extra-index-url https://PySimpleGUI.net/install PySimpleGUI

Use python3 command if you're running on the Mac or Linux

Exploratory Data Analysis (EDA)

The first thing I notice is that there looks like clear conditions for what makes tornados. Mean temperature is roughly 60 to 80 degrees, dew point is 50 to 75, wind direction over 10 minutes is generaly 100 to 200 degrees, surface pressure is generally grouped, wind gust and wind spee maximums are generally 20 to 40 mph and 10 to 20 mph respectively.

Trends when compared to date look fairly normally distributed from year to year. It looks like a dense cloud for most of the variables which tells me conditions have not changed over time. This gives me hope that the trends are consistent and will continue going forward. I plan on using this information to say trends are consistent and can be extended into the future forecasting.

Since tornados are seasonal in most of the US I wanted to also see if the initial trends I noticed were correlated. On an initial look the conditions do appear to be connected as the middle months (May to August) appear to have conditions that match when the tornados occurr. This is a little concerning and I will need to look at some initial model fits to understand how they interact.

The last major thing to think about is how sporadic these events occurr. Roughly 380 events in the area of interest after duplicates are removed in a period of 24 * 365 days. I am worried that this may leak into my train test splits and something may need to be done in order to rectify this.

Model Creation and Tuning

Because of the issue of a small sample side I mentioned previously, I am electing to do a 40/60 test/train split. I am doing this so that the datasets both have ample data and one set does not have only 15-20 points. I am trying to remedy this here and will adjust as needed. I got rid of the date section here as well as the time_of_year column encompasses this variance and is more apt to the seasonal nature of the tornado event.

{'Component 1': [], 'Component 2': [], 'Component 3': [], 'Component 4': [], 'Component 5': [], 'Component 6': [], 'Component 7': ['CZ_NAME_STR'], 'Component 8': [], 'Component 9': [], 'Component 10': [], 'Component 11': ['weather_code'], 'Component 12': []}

	CZ_NAME_STR	TOR_F_SCALE	TOR_LENGTH	TOR_WIDTH	weather_code	temperature_2m_mean	temperature_2m_max	temperature_2m_min	precipitation_sum	rain_sum	...	dew_point_2m_max	dew_point_2m_min	relative_humidity_2m_mean	relative_humidity_2m_max	relative_humidity_2m_min	surface_pressure_mean	surface_pressure_max	surface_pressure_min	tornado_occurred	time_of_year
0	-0.009300	-0.073332	0.065846	0.061042	0.092541	0.336861	0.329270	0.339568	0.160680	0.175481	...	0.353257	0.344675	0.057663	0.157992	-0.007560	-0.228894	-0.254573	-0.189044	0.075383	0.090824
1	-0.039695	-0.184275	0.184793	0.174903	0.278559	-0.207202	-0.220331	-0.186266	0.280730	0.235538	...	-0.128632	-0.148664	0.328693	0.199453	0.339753	-0.202312	-0.134906	-0.237144	0.187763	-0.066202
2	-0.091596	-0.450961	0.447813	0.434640	-0.125033	0.035215	0.043584	0.026687	-0.027067	0.001868	...	0.000355	0.012331	-0.144284	-0.116238	-0.138544	0.181068	0.154567	0.182200	0.458211	0.007587
3	-0.004646	-0.024410	0.030190	0.029057	-0.027897	-0.007013	0.001511	-0.014834	-0.066826	-0.070795	...	-0.055338	-0.105310	-0.365922	-0.346174	-0.287834	-0.198999	-0.133762	-0.249427	0.025082	-0.075326
4	0.015050	0.072639	-0.078721	-0.094574	0.185181	0.039919	0.030063	0.053417	0.464426	0.446787	...	0.027551	0.015534	-0.115127	-0.113053	-0.090028	0.352344	0.364245	0.311488	-0.068637	0.011909

5 rows × 25 columns

Value Train 0: Count Train 7279
Value Train 1: Count Train 32
Value Test 0: Count Test 1819
Value Test 1: Count Test 9

Regression Models

For this analysis I chose to create and tune 4 different classification models. All undergoing random search cross validation over a generated parameter sweep. For each model I deemed f1 score to be the best measure of fit because it balances percision and recall. This allows some models to fully predict the tornado cases at the expense of some false tornado alarms. I believe this is the right approach because the cost of missing an early warning alarm is much greater than giving an early warning when it is not needed.

RandomizedSearchCV(cv=5, estimator=LogisticRegression(random_state=63),
                   n_iter=100,
                   param_distributions={'C': [0.01, 0.1, 1, 10, 100],
                                        'class_weight': [None, 'balanced'],
                                        'max_iter': [100, 500, 1000],
                                        'penalty': ['l2', 'l1'],
                                        'solver': ['liblinear', 'saga']},
                   random_state=63, scoring='f1_macro')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Logistic Regression

Rationale:
Chosen Hyper-Parameters:

Best LR parameters: {'solver': 'saga', 'penalty': 'l1', 'max_iter': 1000, 'class_weight': 'balanced', 'C': 0.1}

Results:

Best LR score: 0.5297675861856518

RandomizedSearchCV(cv=5, estimator=GaussianNB(), n_iter=100,
                   param_distributions={'var_smoothing': [1e-09, 1e-08, 1e-07,
                                                          1e-06, 1e-05,
                                                          0.0001]},
                   random_state=63, scoring='f1_macro')

bayes_grid.fit(X_train, y_train) * Naive Bayes * Rationale: * Chosen Hyper-Parameters:

::: {#da1df00b .cell execution_count=10}

::: {.cell-output .cell-output-stdout}
```
Best Bayes parameters: {'var_smoothing': 0.0001}
```
:::
:::

Results:

::: {#ebd8a870 .cell execution_count=11}

::: {.cell-output .cell-output-stdout} Best Bayes score: 0.49878008543006136 ::: :::

RandomizedSearchCV(cv=5, estimator=SVC(random_state=63), n_iter=100,
                   param_distributions={'C': [0.1, 1, 10, 100],
                                        'degree': [2, 3, 4],
                                        'gamma': ['scale', 'auto', 0.01, 0.1,
                                                  1],
                                        'kernel': ['linear', 'rbf', 'poly']},
                   random_state=63, scoring='f1_macro')

Vector Control Regression
- Rationale:
- Chosen Hyper-Parameters:
```
Best SVC parameters: {'kernel': 'rbf', 'gamma': 1, 'degree': 2, 'C': 10}
```
- Results:
::: {#5f92abb6 .cell execution_count=14}

::: {.cell-output .cell-output-stdout} Best SVC score: 0.7305947951647276 ::: :::

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=63),
                   n_iter=100,
                   param_distributions={'max_depth': [None, 5, 10, 20],
                                        'max_features': ['auto', 'sqrt',
                                                         'log2'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [50, 100, 200]},
                   random_state=63, scoring='f1_macro')

Random Forest Classifier
- Rationale:
- Chosen Hyper-Parameters:
```
Best RF parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 10}
```
- Results:
::: {#bced74f9 .cell execution_count=17}

::: {.cell-output .cell-output-stdout} Best RF score: 0.7305947951647276 ::: :::

Results and Analysis

As can be seen in the confusion matrices, the logistic regression model performs the best when trained. It correctly perdicts all of the tornado events at the cost of 128 false tornado predictions over a 25 year time span. Similarly, the Naive Bayes predictor also correctly classified the tornado conditions at the expense of 202 false positives. I believe this is an artifact of the model complexity being high for this model. The last two models, the Random Forest Classifier and the Support Vector Machines, performed exactly the same. Both of these models correctly classified all of the non-tonado days, but failed to classify all but 2 of the tornado days.

Overall in terms of accuracy, the Support Vector Machines and Random Forest models had the highest accuracy score only incorrectly classifiying 7 events each. This does not mean they were in fact the best. Both of these models as the 7 events they misclassified are costly. i.e. lost of life, belongings, and loved ones. For this reason I have elected to select the Logistic Regression as the best model. It both correctly classified the tornadoes and minimized the number of false tornado predictions when compared to the Naive Bayes model. The classification report for the Logistic Regression model can be seen below:

              precision    recall  f1-score   support

           0       1.00      0.93      0.96      1819
           1       0.07      1.00      0.12         9

    accuracy                           0.93      1828
   macro avg       0.53      0.96      0.54      1828
weighted avg       1.00      0.93      0.96      1828

App/Dashboard

Below is a simple application/GUI meant to take the best model and predict

--- title: "Prediction of Tornado Occurance in the Twin Cities" subtitle: "INFO 523 - Final Project" author: - name: "The Crengineers - Tyler Hart" affiliations: - name: "College of Information Science, College of Systems and Industrial Engineering, University of Arizona" description: "An analysis of historical tornado patterns in the area of Twin Ciries from 2000-Current(2025) to predict Tornado Occurence" format: html: code-tools: true code-overflow: wrap embed-resources: true editor: visual execute: warning: false echo: false freeze: auto # re-render only when source changes jupyter: python3 --- ## Abstract TODO: Add project abstract here. ## Intro The primary goal of this project is to use publically available daily weather data to form relationships with NOAA records of tornado formation. This is crucial because unlike hurricanes which have a longer warning timeline, tornadoes are often learned about shortly before they occur. This can cause more extreme damage as people do not have time to secure crucial belongins or themselves in a non-affected area. Daily weather forecast are available and usually a part of every person's daily life. Whether that be through a weather app or the news this publically available data could be potentially utilized to help warn/inform people of the chance of tornado occurring in their areas. One approach to this problem is to create regression models such as, Lasso, Ridge, and Random Forest using polynomial features. TODO: More Here This project is special to me as I have had the unfortunate experience of having friends and families lives be altered by unexpected tornados. Because of this I am going to focus my analysis to the Twin Cities area, pictured in the Map of Analysis Area Section. This means I will focus my analysis to the surrounding counties and weather data from the Mineapolis-Saint Paul, MSP, airport. This has lead me to two try and answer the two questions seen below in the Questions section. # Questions: 1. Can I develop a model that successfully classifies which days tornado formation will occur using daily weather data for the Twin Cities area? 2. Can I create a GUI/dashboard to quickly allow users to tune/view predictions. # Map of Analysis Area: The surrounding colonies to the Twin Cities, also known as MSP, are the only region where the data is valid. Since the airport is located in the heart of the twin cities this analysis will only encompass the counties shown in the map below: ![County Map](images/MN-counties.png "Picture of Applicable Counties") The counties were selected on a region selection basis. To properly select the region where the data is valid I drew a 60 mile circle around the MSP airport and selected any counties within that region. 60 miles was chosen to give me both ample data but also only have roughly an hour drive from the airport as I felt that was an applicable size. # Section Definitions: In this write-up, there will be 6 main sections, Intro, Dataset and Feature Engineering, Exploratory Data Analysis, Model Creation and Tuning, Results and Analysis, and App/Dashboard. In the Intro section I have identified the purpose, scope and major questions I intend to try and answer. In the Dataset and Feature Engineering section I will discuss the cleaned up dataset with the engineered features pertinent to the future model creation and analysis. In the Ecploratory Data Analysis Section, I will show plots pertaining to the variable of interest as well as comment on different relationships I noticed in a primary analysis. In the Model Creation and Tuning section I will discuss why I chose each model and the corresponding cross validation I used to validate/make the most representative model possible. In the Results and Analysis section, I will show analysis results and plots as well as discuss the results of the model tuning and cross validation. Laslty I will attempt to collate those results into a dashboard for user interfacing and quick use to inform of tornado creation. ## DataSet and Feature Engineering # DataSet Description I have 2 raw datasets that I have gathered in trying to answer these questions. The 1st data set found in this repo at [NOAA Data](https://github.com/INFO-523-SU25/final-project-hart/blob/main/data/storm_data_search_results.csv)* is the historical data of every tornado that has been reported as far back as 1952. This data contains the relevant county, data, and tornado information, to allow me to filter and clean the data to compare to the 2nd data set I will be using. The 2nd data set found in this repo at [API Daily Data](https://github.com/INFO-523-SU25/final-project-hart/blob/main/data/mn_weather_data.csv)* is the historical daily weather data measured at MSP International Airport as far back as 2000. This data contains daily weather information that I will be training my regression models against to hopefully predict tornado formation. The combination** of these datasets can be found in this repo at [Combined Data](https://github.com/INFO-523-SU25/final-project-hart/blob/main/data/tornado_days.csv)*. The combined dataset is the combination of these two datasets with new features from the Feature Engineering section. # Feature Engineering This data will need a few more columns. Namely I will need to Feature Engineer a column for tornado_occurred. This column will simply be boolean represented as an integer that is true if a tornado occurred on that day in an applicable county. Another column I will need time_of_year. This column will represent the month the event occurred in. This is imoortant as tornadoes are seasonal in most of the United States. This will help find relationships and be a potential trainer of the model. | Variable | Type | Description | Unit | |----------|------|-------------|------| | tornado_occurred | Bool (Float) | Did a tornado occurr on this day (1 if yes, 0 if no) | None | | time_of_year | Int | The index of the month the event occurred | Month | \* for more information on what is contained in the dataset plese go to [Data Dictionary](https://github.com/INFO-523-SU25/final-project-hart/blob/main/data/README.md) The dataset I am going to be using is a combination of the NOAA tornado archives data and the daily weather open-meteo API data. ** for more detailed information about the merge process look in my proposal at [Proposal Link](https://github.com/INFO-523-SU25/final-project-hart/blob/main/proposal.qmd) ```{python} #| label: load-pkgs #| message: false # Here are the packages I need as of now to do basic dataset operations import numpy as np import statsmodels.api as sm import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import PySimpleGUI as sg from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import NearestNeighbors from sklearn import svm from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import GridSearchCV, RandomizedSearchCV,cross_val_score from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report,roc_curve, roc_auc_score, accuracy_score from sklearn.decomposition import PCA ``` ```{python} #| label: clean-data #| message: false AllDaysWithTornados = pd.read_csv('data/tornado_days.csv') AllDaysWithTornados = AllDaysWithTornados.drop('Unnamed: 0', axis=1) AllDaysWithTornados['CZ_NAME_STR'] = AllDaysWithTornados['CZ_NAME_STR'].fillna('None') AllDaysWithTornados['TOR_F_SCALE'] = AllDaysWithTornados['TOR_F_SCALE'].fillna('None') label_encoder = LabelEncoder() AllDaysWithTornados['CZ_NAME_STR'] = label_encoder.fit_transform(AllDaysWithTornados['CZ_NAME_STR']) # print("Encoded Labels:", AllDaysWithTornados['CZ_NAME_STR']) AllDaysWithTornados['TOR_F_SCALE'] = label_encoder.fit_transform(AllDaysWithTornados['TOR_F_SCALE']) # print("Encoded Labels:", AllDaysWithTornados['TOR_F_SCALE']) # AllDaysWithTornados['TOR_F_SCALE'].unique() # AllDaysWithTornados['CZ_NAME_STR'].unique() ``` ## Exploratory Data Analysis (EDA) ```{python} #| label: eda #| message: false def UnivariateAnalysis(df, column = None): """ Perform univariate analysis on the dataset. This function displays descriptive statistics and creates the appropriate plot for both numerical and categorical variables. """ if column != None: numeric_columns = column else: numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist() # Get numeric columns for plotting # print(f"Numeric columns in {df} dataset:", numeric_columns) # Print the list of numeric for data in numeric_columns: print(df[data].describe()) # Display descriptive statistics for each numeric column sns.histplot(data=df, x=data, bins=30).set_title(f'Distribution of {data}') plt.show() # Display the histogram for each numeric column if column != None: categorical_columns = column else: categorical_columns = df.select_dtypes(exclude=[np.number]).columns.tolist() # Get numeric columns for plotting # print(f"Categoric columns in {df} dataset:", categorical_columns) # Print the list of categoric for data in categorical_columns: print(df[data].describe()) # Display descriptive statistics for each categoric column sns.countplot(data=df, x=data).set_title(f'Count of {data}') plt.xticks(rotation=45) # Rotate x-axis labels for better readability plt.tight_layout() # Adjust layout to prevent label overlap plt.show() # Display the count plot for each categoric column #Down selected to a few interesting columns sns.pairplot(data=AllDaysWithTornados, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'date', 'relative_humidity_2m_mean', 'time_of_year', 'weather_code'], y_vars=['tornado_occurred']) plt.show() sns.pairplot(data=AllDaysWithTornados, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'relative_humidity_2m_mean', 'tornado_occurred', 'time_of_year', 'weather_code'], y_vars=['date']) plt.show() sns.pairplot(data=AllDaysWithTornados, x_vars=['temperature_2m_mean', 'precipitation_sum', 'dew_point_2m_mean', 'wind_direction_10m_dominant', 'surface_pressure_mean', 'wind_gusts_10m_max', 'wind_speed_10m_max', 'relative_humidity_2m_mean', 'tornado_occurred', 'date', 'weather_code'], y_vars=['time_of_year']) plt.show() # ''' # Perform univariate analysis on the datasets. # ''' # UnivariateAnalysis(InitialDS) ``` The first thing I notice is that there looks like clear conditions for what makes tornados. Mean temperature is roughly 60 to 80 degrees, dew point is 50 to 75, wind direction over 10 minutes is generaly 100 to 200 degrees, surface pressure is generally grouped, wind gust and wind spee maximums are generally 20 to 40 mph and 10 to 20 mph respectively. Trends when compared to date look fairly normally distributed from year to year. It looks like a dense cloud for most of the variables which tells me conditions have not changed over time. This gives me hope that the trends are consistent and will continue going forward. I plan on using this information to say trends are consistent and can be extended into the future forecasting. Since tornados are seasonal in most of the US I wanted to also see if the initial trends I noticed were correlated. On an initial look the conditions do appear to be connected as the middle months (May to August) appear to have conditions that match when the tornados occurr. This is a little concerning and I will need to look at some initial model fits to understand how they interact. The last major thing to think about is how sporadic these events occurr. Roughly 380 events in the area of interest after duplicates are removed in a period of 24 * 365 days. I am worried that this may leak into my train test splits and something may need to be done in order to rectify this. ## Model Creation and Tuning Because of the issue of a small sample side I mentioned previously, I am electing to do a 40/60 test/train split. I am doing this so that the datasets both have ample data and one set does not have only 15-20 points. I am trying to remedy this here and will adjust as needed. I got rid of the date section here as well as the time_of_year column encompasses this variance and is more apt to the seasonal nature of the tornado event. ```{python} #| label: pca #| message: false def DR_PCA(df): numerical_cols = df.select_dtypes(include = ['int64', 'float64']).columns scaler = StandardScaler() scaled_data = scaler.fit_transform(df[numerical_cols]) pca = PCA(n_components = 0.95) reduced_data = pca.fit_transform(scaled_data) loading_scores = pd.DataFrame(pca.components_, columns = numerical_cols) pca_full = PCA() pca_full.fit(scaled_data) cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_) plt.figure(figsize = (8, 6)) plt.plot(cumulative_variance) plt.xlabel('Number of Components') plt.ylabel('Cumulative Explained Variance') plt.title('Scree Plot') plt.grid(True) inflection_point = np.argmax(cumulative_variance >= 0.95) + 1 plt.axvline(x=inflection_point, color='red', linestyle='--') plt.axhline(y=cumulative_variance[inflection_point], color='red', linestyle='--') plt.text(inflection_point+1, 0.5, f'Inflection Point:\n{inflection_point} Components', color = 'red') plt.show() return loading_scores def get_top_features_for_each_component(loading_scores, threshold): top_features = {} # print(loading_scores.shape[0]) for i in range(loading_scores.shape[0]): component = f"Component {i+1}" scores = loading_scores.iloc[i] top_features_for_component = scores[abs(scores) > threshold].index.tolist() top_features[component] = top_features_for_component return top_features pca_data = DR_PCA(AllDaysWithTornados) top_feats = get_top_features_for_each_component(pca_data, 0.7) print(top_feats) pca_data.head() ``` ```{python} #| label: tt-split #| message: false ScaledTornados = AllDaysWithTornados.copy() ScaledTornados.head() dropped = ScaledTornados.drop('tornado_occurred', axis = 1) numerical_cols = dropped.select_dtypes(include = ['int64', 'float64']).columns scaler = StandardScaler() ScaledTornados[numerical_cols] = scaler.fit_transform(ScaledTornados[numerical_cols]) feature_names = ScaledTornados.drop(['tornado_occurred', 'CZ_NAME_STR', 'TOR_F_SCALE', 'TOR_LENGTH', 'TOR_WIDTH', 'date'], axis = 1).columns.to_list() X = ScaledTornados[feature_names] y = ScaledTornados['tornado_occurred'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=63) X_train_with_const = sm.add_constant(X_train) X_test_with_const = sm.add_constant(X_test) X_train_with_const = X_train_with_const.astype(int) X_test_with_const = X_test_with_const.astype(int) unique, counts = np.unique(y_train, return_counts=True) for val, count in zip(unique, counts): print(f"Value Train {val}: Count Train {count}") unique, counts = np.unique(y_test, return_counts=True) for val, count in zip(unique, counts): print(f"Value Test {val}: Count Test {count}") ``` # Regression Models For this analysis I chose to create and tune 4 different classification models. All undergoing random search cross validation over a generated parameter sweep. For each model I deemed f1 score to be the best measure of fit because it balances percision and recall. This allows some models to fully predict the tornado cases at the expense of some false tornado alarms. I believe this is the right approach because the cost of missing an early warning alarm is much greater than giving an early warning when it is not needed. ```{python} #| label: logreg-model #| message: false # Model Creation # All param grids generated by copilot when asked for a param grid for {model_type} logreg_params = { 'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l2', 'l1'], 'solver': ['liblinear', 'saga'], 'class_weight': [None, 'balanced'], 'max_iter': [100, 500, 1000] } logreg_model = LogisticRegression(random_state=63) logreg_grid = RandomizedSearchCV(logreg_model, param_distributions = logreg_params, cv = 5, scoring = 'f1_macro', n_iter = 100, random_state = 63) logreg_grid.fit(X_train, y_train) ``` * **Logistic Regression** * Rationale: * Chosen Hyper-Parameters: ```{python} print("Best LR parameters:", logreg_grid.best_params_) ``` * Results: ```{python} print("Best LR score:", logreg_grid.best_score_) ``` ```{python} #| label: bayes-model #| message: false # Model Creation # All param grids generated by copilot when asked for a param grid for {model_type} bayes_params = { 'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4] } bayes_model = GaussianNB() bayes_grid = RandomizedSearchCV(bayes_model, param_distributions = bayes_params, cv = 5, scoring = 'f1_macro', n_iter = 100, random_state = 63) bayes_grid.fit(X_train, y_train) ``` bayes_grid.fit(X_train, y_train) * **Naive Bayes** * Rationale: * Chosen Hyper-Parameters: ```{python} print("Best Bayes parameters:", bayes_grid.best_params_) ``` * Results: ```{python} print("Best Bayes score:", bayes_grid.best_score_) ``` ```{python} #| label: svc-model #| message: false # Model Creation # All param grids generated by copilot when asked for a param grid for {model_type} svc_params = { 'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf', 'poly'], 'gamma': ['scale', 'auto', 0.01, 0.1, 1], 'degree': [2, 3, 4] # Only used for 'poly' kernel } svc_model = svm.SVC(random_state = 63) svc_grid = RandomizedSearchCV(svc_model, param_distributions = svc_params, cv = 5, scoring = 'f1_macro', n_iter = 100, random_state = 63) svc_grid.fit(X_train, y_train) ``` * **Vector Control Regression** * Rationale: * Chosen Hyper-Parameters: ```{python} print("Best SVC parameters:", svc_grid.best_params_) ``` * Results: ```{python} print("Best SVC score:", svc_grid.best_score_) ``` ```{python} #| label: rf-model #| message: false # Model Creation # All param grids generated by copilot when asked for a param grid for {model_type} rf_params = { 'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10, 20], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'max_features': ['auto', 'sqrt', 'log2'] } rf_model = RandomForestClassifier(random_state = 63) rf_grid = RandomizedSearchCV(rf_model, rf_params, cv = 5, scoring = 'f1_macro', n_iter = 100, random_state = 63) rf_grid.fit(X_train, y_train) ``` * **Random Forest Classifier** * Rationale: * Chosen Hyper-Parameters: ```{python} print("Best RF parameters:", rf_grid.best_params_) ``` * Results: ```{python} print("Best RF score:", rf_grid.best_score_) ``` ```{python} #| label: assign-best #| message: false # Model Creation # All param grids generated by copilot when asked for a param grid for {model_type} best_logreg_model = logreg_grid.best_estimator_ best_bayes_model = bayes_grid.best_estimator_ best_svc_model = svc_grid.best_estimator_ best_rf_model = rf_grid.best_estimator_ ``` ## Results and Analysis ```{python} def show_metrics(model, X_test, y_test): y_pred = model.predict(X_test) # print("Accuracy:", accuracy_score(y_test, y_pred)) # print(classification_report(y_test, y_pred)) cm = confusion_matrix(y_test, y_pred) ConfusionMatrixDisplay(cm).plot() show_metrics(best_logreg_model, X_test, y_test) show_metrics(best_bayes_model, X_test, y_test) show_metrics(best_svc_model, X_test, y_test) show_metrics(best_rf_model, X_test, y_test) ``` As can be seen in the confusion matrices, the logistic regression model performs the best when trained. It correctly perdicts all of the tornado events at the cost of 128 false tornado predictions over a 25 year time span. Similarly, the Naive Bayes predictor also correctly classified the tornado conditions at the expense of 202 false positives. I believe this is an artifact of the model complexity being high for this model. The last two models, the Random Forest Classifier and the Support Vector Machines, performed exactly the same. Both of these models correctly classified all of the non-tonado days, but failed to classify all but 2 of the tornado days. Overall in terms of accuracy, the Support Vector Machines and Random Forest models had the highest accuracy score only incorrectly classifiying 7 events each. This does not mean they were in fact the best. Both of these models as the 7 events they misclassified are costly. i.e. lost of life, belongings, and loved ones. For this reason I have elected to select the Logistic Regression as the best model. It both correctly classified the tornadoes and minimized the number of false tornado predictions when compared to the Naive Bayes model. The classification report for the Logistic Regression model can be seen below: ```{python} #Determine which model is best best_model = best_logreg_model y_pred = best_model.predict(X_test) print(classification_report(y_test, y_pred)) ``` ## App/Dashboard Below is a simple application/GUI meant to take the best model and predict ```{python} #| label: app #| message: false # Configure and run App # Get min/max for each feature for slider ranges # feature_ranges = {feat: ((ScaledTornados[feat].min()), (ScaledTornados[feat].max())) for feat in feature_names} # def plot_probability(prob): # plt.figure(figsize=(4, 3)) # plt.bar(['Tornado'], [prob], color='red') # plt.ylim(0, 1) # plt.ylabel('Predicted Probability') # plt.title('Tornado Prediction') # plt.tight_layout() # fig = plt.gcf() # return fig # # Layout for sliders # layout = [ # [sg.Text('Adjust Weather Conditions:')], # ] # for feat in feature_names: # min_val, max_val = feature_ranges[feat] # layout.append([ # sg.Text(feat, size=(25, 1)), # sg.Slider(range=(min_val, max_val), orientation='h', size=(40, 15), key=feat, resolution=0.01, default_value=(min_val + max_val) / 2) # ]) # layout += [ # [sg.Button('Predict'), sg.Button('Exit')], # [sg.Text('', size=(40, 2), key='-RESULT-')], # [sg.Canvas(key='-CANVAS-')] # ] # window = sg.Window('Tornado Predictor', layout, finalize=True) # while True: # event, values = window.read() # if event in (sg.WIN_CLOSED, 'Exit'): # break # if event == 'Predict': # input_vals = np.array([values[feat] for feat in feature_names]).reshape(1, -1) # pred = model.predict(input_vals)[0] # prob = model.predict_proba(input_vals)[0][1] if hasattr(model, "predict_proba") else None # if pred == 1: # if prob is not None: # result_text = f"Warning: Conditions are pertinent for a tornado!\nProbability: {prob:.2f}" # else: # result_text = "Warning: Conditions are pertinent for a tornado!" # else: # if prob is not None: # result_text = f"No tornado predicted.\nProbability: {prob:.2f}" # else: # result_text = "No tornado predicted." # window['-RESULT-'].update(result_text) # # Plot probability # if prob is not None: # fig = plot_probability(prob) # # Draw with PySimpleGUI # fig_canvas_agg = sg.pyplot.FigureCanvasTkAgg(fig, window['-CANVAS-'].TKCanvas) # fig_canvas_agg.draw() # plt.close(fig) # window.close() ```