Navigating the S&P 500 with Machine Learning: What Can We Discover?

Proposal

Machine learning gives us smart tools to find patterns, make predictions, and understand complicated trends in the stock market. By using these techniques with S&P 500 data, we can answer important questions about where the market might go, how risky it is, and how stocks relate to each other. This project explores new methods—from deep learning (which helps predict future returns), to models that sort days by how volatile they are, to clustering methods that group similar stocks together. With these tools, we aim to discover fresh insights about the S&P 500 in 2024.

Author

Affiliation

Macdonald-Kuthalaraja - Trevor Macdonald, Nandakumar Kuthalaraja

College of Information Science, University of Arizona

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping

Dataset

print("Downloading SPX data...")
data = yf.download('^GSPC', start='2014-01-01', end='2024-12-31')

# Flatten MultiIndex columns if necessary
if isinstance(data.columns, pd.MultiIndex):
    data.columns = ['_'.join(col).strip() for col in data.columns.values]

Downloading SPX data...

This dataset contains daily historical price data for the S&P 500 index (^GSPC), downloaded using the yfinance Python package. It spans from January 1, 2014 to December 31, 2024s.

The dataset includes the following key variables: • Open: Opening price of the index each day • High: Highest price during the trading day • Low: Lowest price during the trading day • Close: Closing price of the day

We chose the S&P 500 dataset because it serves as a benchmark for the U.S. stock market, reflecting the performance of 500 leading publicly traded companies. This makes it ideal for exploring machine learning techniques in financial time series analysis, including trend prediction, volatility classification, and return forecasting.

With these techniques in proposal, we aim to uncover specific insights such as improved short- and long-term return forecasts from LSTM models, clearer volatility regime patterns via classification, and meaningful stock groupings through clustering of factor exposures. By quantifying accuracy across forecast horizons, identifying key predictors of risk regimes, and mapping stocks by shared characteristics, we move beyond general findings to actionable understanding of market behavior in 2024.

Questions

The four questions you want to answer.

Q1. Can a Long Short-Term Memory (LSTM) model accurately forecast short, medium, or long term S&P 500 returns?

Q2. How does forecast accuracy degrade as a function of prediction horizon, and what does this suggest about LSTM’s ability to model longer term financial trends?

Q3. How well can we classify each trading day in 2024 into low, medium, or high volatility regimes based on recent price action and market indicators?

Q4. How can hierarchical clustering organize S&P 500 stocks into a taxonomy based on multi-factor risk or return exposures?

Analysis plan (Q1, Q2)

The analysis will begin with acquisition of data from yahoo finance. We will have 4 tickers and merge into one data frame with preferred variables, features, etc.
The variables will receive a basic visual inspection and dimensional analysis
The data will be cleaned and standardized to produce a “tidy” set to be split for training LSTM model.
The model will be tested on unseen data set and results recorded.
The performance will evaluated for each time horizon and compared using plot visuals.

Ticker	Description
`^GSPC`	S&P 500 Index (price and volume data)
`^VIX`	30-day implied volatility
`^VVIX`	Volatility of volatility
`^VIX9D`	9-day implied volatility

Variable	Description
`Open`	Opening price of SPX
`High`	Daily high price of SPX
`Low`	Daily low price of SPX
`Close`	Daily closing price of SPX
`Volume`	Daily trading volume of SPX
`VIX`	Implied volatility index (30-day horizon)
`VVIX`	Volatility-of-volatility index
`VIX9D`	9-day implied volatility index

Features	Description
`log_return_t`	Log returns of SPX: `log(Close_t / Close_{t-1})`
`ParkinsonVol`	Realized volatility from high/low: `ln(High/Low)^2 / (4ln2)`
`EMA_10`, `SMA_21`	Short-term and medium-term trend indicators
`lag_volatility`	Lagged daily volatility measures (RV, VIX, VVIX)

Target Variable	Definition
`Return_t+1`	1-day ahead return: `pct_change(1).shift(-1)`
`Return_t+5`	5-day ahead return: `pct_change(5).shift(-5)`
`Return_t+21`	21-day ahead return: `pct_change(21).shift(-21)`

Step	Description
Data Acquisition	Download OHLCV for `^GSPC(SPX)`, and volatility indices: `^VIX`, `^VVIX`, `^VIX9D` via `yfinance`
Data Cleaning/ Inspection	Align, index, remove nulls, filter data for consistency
Feature Engineering/Standardization	Construct technical indicators, lag features, and volatility-based predictors
Train/Test Split	80/20 time based split
Model Architecture	LSTM
Evaluation Metrics	MAE and RMSE for each forecast horizon

We selected Long Short-Term Memory (LSTM) networks because they are specifically designed to capture long-term dependencies in sequential data, making them ideal for time series forecasting like stock prices. Unlike simpler models such as linear regression or basic moving averages, LSTMs can learn patterns across varying time lags and handle noisy, non-linear trends that are common in financial markets. This allows for potentially more accurate forecasts of market movements compared to models that assume fixed, short-term dependencies.

Analysis plan (Q3)

This uses Supervised Machine Learning Methods for exploring

Feature Engineering: Calculate recent price action and market indicators (e.g., rolling volatility, intraday range, moving averages, volume, VIX).
Target Creation: Label each trading day as “low”, “medium”, or “high” volatility based on daily realized volatility bins (e.g., using tertiles or quantiles).
Model Training: Use supervised learning models (e.g., decision tree, random forest, XGBoost) to classify the volatility regime from engineered features.
Evaluation: Evaluate classification results with accuracy, confusion matrix, and analyze which features most influence volatility regime assignment.

Step	Description
Prepare Features	Gather recent price action (returns, volatility) and market indicators for each trading day.
Create Volatility Classes	Divide days into low, medium, and high volatility using quantiles or thresholds.
Train Classifier	Fit a supervised learning model to predict volatility class from the features.
Evaluate Model	Assess model accuracy and review which features are most predictive.

Potential Variables

Variable	Description
prev_day_return	Return from the previous trading day
rolling_std_5	5-day rolling standard deviation (volatility)
rolling_std_21	21-day (1-month) rolling standard deviation
ATR_14	14-day Average True Range (volatility indicator)
volume_zscore_5	5-day Z-score of trading volume
RSI_14	14-day Relative Strength Index
VIX	Implied volatility index (if available for the SPX)

Analysis plan (Q4)

This uses Unsupervised Machine Learning Methods for exploring

Collect and preprocess S&P 500 stock data for 2024, calculating each stock’s exposure to multiple risk and return factors (e.g., momentum, value, size, volatility).
Construct a feature matrix where each row represents a stock and columns represent factor exposures.
Apply hierarchical clustering (such as Ward’s method) to the feature matrix to group stocks with similar profiles.
Visualize the dendrogram and analyze the resulting clusters to interpret the taxonomy and identify meaningful stock groupings.

Step	Description
Data Prep	Collect and preprocess S&P 500 stock data for 2024, computing multi-factor exposures.
Matrix Build	Create a matrix with stocks as rows and factor exposures as columns.
Clustering	Perform hierarchical clustering (e.g., Ward’s method) on the feature matrix.
Interpretation	Visualize and interpret the dendrogram to analyze and describe stock group taxonomy.

Potential Variables for Use

Variable	Description
avg_return_2024	Average daily return in 2024
volatility_2024	Standard deviation of daily returns in 2024
momentum_3m	3-month price momentum (percent change)
value_ratio	Price-to-earnings (P/E) or price-to-book (P/B) ratio
size	Market capitalization
beta_market	Beta (sensitivity to S&P 500 index)
dividend_yield	Dividend yield
sector	Categorical variable for sector (for interpretation/grouping)
skewness_2024	Skewness of daily returns in 2024
kurtosis_2024	Kurtosis of daily returns in 2024

--- title: "Navigating the S&P 500 with Machine Learning: What Can We Discover?" subtitle: "Proposal" author: - name: "Macdonald-Kuthalaraja - Trevor Macdonald, Nandakumar Kuthalaraja " affiliations: - name: "College of Information Science, University of Arizona" description: "Machine learning gives us smart tools to find patterns, make predictions, and understand complicated trends in the stock market. By using these techniques with S&P 500 data, we can answer important questions about where the market might go, how risky it is, and how stocks relate to each other. This project explores new methods—from deep learning (which helps predict future returns), to models that sort days by how volatile they are, to clustering methods that group similar stocks together. With these tools, we aim to discover fresh insights about the S&P 500 in 2024." format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false jupyter: python3 --- ```{python} #| label: Set Up #| message: false import numpy as np import pandas as pd import yfinance as yf import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error, mean_absolute_error from keras.models import Sequential from keras.layers import LSTM, Dense, Dropout from keras.callbacks import EarlyStopping ``` ## Dataset ```{python} #| label: load-dataset #| message: false print("Downloading SPX data...") data = yf.download('^GSPC', start='2014-01-01', end='2024-12-31') # Flatten MultiIndex columns if necessary if isinstance(data.columns, pd.MultiIndex): data.columns = ['_'.join(col).strip() for col in data.columns.values] ``` This dataset contains daily historical price data for the S&P 500 index (^GSPC), downloaded using the yfinance Python package. It spans from January 1, 2014 to December 31, 2024s. The dataset includes the following key variables: • Open: Opening price of the index each day • High: Highest price during the trading day • Low: Lowest price during the trading day • Close: Closing price of the day We chose the S&P 500 dataset because it serves as a benchmark for the U.S. stock market, reflecting the performance of 500 leading publicly traded companies. This makes it ideal for exploring machine learning techniques in financial time series analysis, including trend prediction, volatility classification, and return forecasting. With these techniques in proposal, we aim to uncover specific insights such as improved short- and long-term return forecasts from LSTM models, clearer volatility regime patterns via classification, and meaningful stock groupings through clustering of factor exposures. By quantifying accuracy across forecast horizons, identifying key predictors of risk regimes, and mapping stocks by shared characteristics, we move beyond general findings to actionable understanding of market behavior in 2024. ## Questions The four questions you want to answer. Q1. Can a Long Short-Term Memory (LSTM) model accurately forecast short, medium, or long term S&P 500 returns? Q2. How does forecast accuracy degrade as a function of prediction horizon, and what does this suggest about LSTM’s ability to model longer term financial trends? Q3. How well can we classify each trading day in 2024 into low, medium, or high volatility regimes based on recent price action and market indicators? Q4. How can hierarchical clustering organize S&P 500 stocks into a taxonomy based on multi-factor risk or return exposures? ## Analysis plan (Q1, Q2) - The analysis will begin with acquisition of data from yahoo finance. We will have 4 tickers and merge into one data frame with preferred variables, features, etc. - The variables will receive a basic visual inspection and dimensional analysis - The data will be cleaned and standardized to produce a "tidy" set to be split for training LSTM model. - The model will be tested on unseen data set and results recorded. - The performance will evaluated for each time horizon and compared using plot visuals. | Ticker | Description | |----------|---------------------------------------| | `^GSPC` | S&P 500 Index (price and volume data) | | `^VIX` | 30-day implied volatility | | `^VVIX` | Volatility of volatility | | `^VIX9D` | 9-day implied volatility | | Variable | Description | |----------|-------------------------------------------| | `Open` | Opening price of SPX | | `High` | Daily high price of SPX | | `Low` | Daily low price of SPX | | `Close` | Daily closing price of SPX | | `Volume` | Daily trading volume of SPX | | `VIX` | Implied volatility index (30-day horizon) | | `VVIX` | Volatility-of-volatility index | | `VIX9D` | 9-day implied volatility index | | Features | Description | |----|----| | `log_return_t` | Log returns of SPX: `log(Close_t / Close_{t-1})` | | `ParkinsonVol` | Realized volatility from high/low: `ln(High/Low)^2 / (4ln2)` | | `EMA_10`, `SMA_21` | Short-term and medium-term trend indicators | | `lag_volatility` | Lagged daily volatility measures (RV, VIX, VVIX) | | Target Variable | Definition | |-----------------|--------------------------------------------------| | `Return_t+1` | 1-day ahead return: `pct_change(1).shift(-1)` | | `Return_t+5` | 5-day ahead return: `pct_change(5).shift(-5)` | | `Return_t+21` | 21-day ahead return: `pct_change(21).shift(-21)` | | Step | Description | |----|----| | **Data Acquisition** | Download OHLCV for `^GSPC(SPX)`, and volatility indices: `^VIX`, `^VVIX`, `^VIX9D` via `yfinance` | | **Data Cleaning/ Inspection** | Align, index, remove nulls, filter data for consistency | | **Feature Engineering/Standardization** | Construct technical indicators, lag features, and volatility-based predictors | | **Train/Test Split** | 80/20 time based split | | **Model Architecture** | LSTM | | **Evaluation Metrics** | MAE and RMSE for each forecast horizon | We selected Long Short-Term Memory (LSTM) networks because they are specifically designed to capture long-term dependencies in sequential data, making them ideal for time series forecasting like stock prices. Unlike simpler models such as linear regression or basic moving averages, LSTMs can learn patterns across varying time lags and handle noisy, non-linear trends that are common in financial markets. This allows for potentially more accurate forecasts of market movements compared to models that assume fixed, short-term dependencies. ## Analysis plan (Q3) This uses Supervised Machine Learning Methods for exploring 1. Feature Engineering: Calculate recent price action and market indicators (e.g., rolling volatility, intraday range, moving averages, volume, VIX). 2. Target Creation: Label each trading day as “low”, “medium”, or “high” volatility based on daily realized volatility bins (e.g., using tertiles or quantiles). 3. Model Training: Use supervised learning models (e.g., decision tree, random forest, XGBoost) to classify the volatility regime from engineered features. 4. Evaluation: Evaluate classification results with accuracy, confusion matrix, and analyze which features most influence volatility regime assignment. | Step | Description | |-------------------------|-----------------------------------------------------------------------------------------------| | Prepare Features | Gather recent price action (returns, volatility) and market indicators for each trading day. | | Create Volatility Classes | Divide days into low, medium, and high volatility using quantiles or thresholds. | | Train Classifier | Fit a supervised learning model to predict volatility class from the features. | | Evaluate Model | Assess model accuracy and review which features are most predictive. | Potential Variables | Variable | Description | |----------------------|---------------------------------------------------------| | prev_day_return | Return from the previous trading day | | rolling_std_5 | 5-day rolling standard deviation (volatility) | | rolling_std_21 | 21-day (1-month) rolling standard deviation | | ATR_14 | 14-day Average True Range (volatility indicator) | | volume_zscore_5 | 5-day Z-score of trading volume | | RSI_14 | 14-day Relative Strength Index | | VIX | Implied volatility index (if available for the SPX) | ## Analysis plan (Q4) This uses Unsupervised Machine Learning Methods for exploring 1. Collect and preprocess S&P 500 stock data for 2024, calculating each stock’s exposure to multiple risk and return factors (e.g., momentum, value, size, volatility). 2. Construct a feature matrix where each row represents a stock and columns represent factor exposures. 3. Apply hierarchical clustering (such as Ward’s method) to the feature matrix to group stocks with similar profiles. 4. Visualize the dendrogram and analyze the resulting clusters to interpret the taxonomy and identify meaningful stock groupings. | Step | Description | |--------------|---------------------------------------------------------------------------------------------| | Data Prep | Collect and preprocess S&P 500 stock data for 2024, computing multi-factor exposures. | | Matrix Build | Create a matrix with stocks as rows and factor exposures as columns. | | Clustering | Perform hierarchical clustering (e.g., Ward’s method) on the feature matrix. | | Interpretation | Visualize and interpret the dendrogram to analyze and describe stock group taxonomy. | Potential Variables for Use | Variable | Description | |-------------------------|---------------------------------------------------------------| | avg_return_2024 | Average daily return in 2024 | | volatility_2024 | Standard deviation of daily returns in 2024 | | momentum_3m | 3-month price momentum (percent change) | | value_ratio | Price-to-earnings (P/E) or price-to-book (P/B) ratio | | size | Market capitalization | | beta_market | Beta (sensitivity to S&P 500 index) | | dividend_yield | Dividend yield | | sector | Categorical variable for sector (for interpretation/grouping) | | skewness_2024 | Skewness of daily returns in 2024 | | kurtosis_2024 | Kurtosis of daily returns in 2024 |