Navigating the S&P 500 with Machine Learning: What Can We Discover?

Proposal

Machine learning gives us smart tools to find patterns, make predictions, and understand complicated trends in the stock market. By using these techniques with S&P 500 data, we can answer important questions about where the market might go, how risky it is, and how stocks relate to each other. This project explores new methods—from deep learning (which helps predict future returns), to models that sort days by how volatile they are, to clustering methods that group similar stocks together. With these tools, we aim to discover fresh insights about the S&P 500 in 2024.
Author
Affiliation

Macdonald-Kuthalaraja - Trevor Macdonald, Nandakumar Kuthalaraja

College of Information Science, University of Arizona

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping

Dataset

print("Downloading SPX data...")
data = yf.download('^GSPC', start='2014-01-01', end='2024-12-31')

# Flatten MultiIndex columns if necessary
if isinstance(data.columns, pd.MultiIndex):
    data.columns = ['_'.join(col).strip() for col in data.columns.values]
Downloading SPX data...

This dataset contains daily historical price data for the S&P 500 index (^GSPC), downloaded using the yfinance Python package. It spans from January 1, 2014 to December 31, 2024s.

The dataset includes the following key variables: • Open: Opening price of the index each day • High: Highest price during the trading day • Low: Lowest price during the trading day • Close: Closing price of the day

We chose the S&P 500 dataset because it serves as a benchmark for the U.S. stock market, reflecting the performance of 500 leading publicly traded companies. This makes it ideal for exploring machine learning techniques in financial time series analysis, including trend prediction, volatility classification, and return forecasting.

With these techniques in proposal, we aim to uncover specific insights such as improved short- and long-term return forecasts from LSTM models, clearer volatility regime patterns via classification, and meaningful stock groupings through clustering of factor exposures. By quantifying accuracy across forecast horizons, identifying key predictors of risk regimes, and mapping stocks by shared characteristics, we move beyond general findings to actionable understanding of market behavior in 2024.

Questions

The four questions you want to answer.

Q1. Can a Long Short-Term Memory (LSTM) model accurately forecast short, medium, or long term S&P 500 returns?

Q2. How does forecast accuracy degrade as a function of prediction horizon, and what does this suggest about LSTM’s ability to model longer term financial trends?

Q3. How well can we classify each trading day in 2024 into low, medium, or high volatility regimes based on recent price action and market indicators?

Q4. How can hierarchical clustering organize S&P 500 stocks into a taxonomy based on multi-factor risk or return exposures?

Analysis plan (Q1, Q2)

  • The analysis will begin with acquisition of data from yahoo finance. We will have 4 tickers and merge into one data frame with preferred variables, features, etc.

  • The variables will receive a basic visual inspection and dimensional analysis

  • The data will be cleaned and standardized to produce a “tidy” set to be split for training LSTM model.

  • The model will be tested on unseen data set and results recorded.

  • The performance will evaluated for each time horizon and compared using plot visuals.

Ticker Description
^GSPC S&P 500 Index (price and volume data)
^VIX 30-day implied volatility
^VVIX Volatility of volatility
^VIX9D 9-day implied volatility
Variable Description
Open Opening price of SPX
High Daily high price of SPX
Low Daily low price of SPX
Close Daily closing price of SPX
Volume Daily trading volume of SPX
VIX Implied volatility index (30-day horizon)
VVIX Volatility-of-volatility index
VIX9D 9-day implied volatility index
Features Description
log_return_t Log returns of SPX: log(Close_t / Close_{t-1})
ParkinsonVol Realized volatility from high/low: ln(High/Low)^2 / (4ln2)
EMA_10SMA_21 Short-term and medium-term trend indicators
lag_volatility Lagged daily volatility measures (RV, VIX, VVIX)
Target Variable Definition
Return_t+1 1-day ahead return: pct_change(1).shift(-1)
Return_t+5 5-day ahead return: pct_change(5).shift(-5)
Return_t+21 21-day ahead return: pct_change(21).shift(-21)
Step Description
Data Acquisition Download OHLCV for ^GSPC(SPX), and volatility indices: ^VIX^VVIX^VIX9D via yfinance
Data Cleaning/ Inspection Align, index, remove nulls, filter data for consistency
Feature Engineering/Standardization Construct technical indicators, lag features, and volatility-based predictors
Train/Test Split 80/20 time based split
Model Architecture LSTM
Evaluation Metrics MAE and RMSE for each forecast horizon

We selected Long Short-Term Memory (LSTM) networks because they are specifically designed to capture long-term dependencies in sequential data, making them ideal for time series forecasting like stock prices. Unlike simpler models such as linear regression or basic moving averages, LSTMs can learn patterns across varying time lags and handle noisy, non-linear trends that are common in financial markets. This allows for potentially more accurate forecasts of market movements compared to models that assume fixed, short-term dependencies.

Analysis plan (Q3)

This uses Supervised Machine Learning Methods for exploring

  1. Feature Engineering: Calculate recent price action and market indicators (e.g., rolling volatility, intraday range, moving averages, volume, VIX).
  2. Target Creation: Label each trading day as “low”, “medium”, or “high” volatility based on daily realized volatility bins (e.g., using tertiles or quantiles).
  3. Model Training: Use supervised learning models (e.g., decision tree, random forest, XGBoost) to classify the volatility regime from engineered features.
  4. Evaluation: Evaluate classification results with accuracy, confusion matrix, and analyze which features most influence volatility regime assignment.
Step Description
Prepare Features Gather recent price action (returns, volatility) and market indicators for each trading day.
Create Volatility Classes Divide days into low, medium, and high volatility using quantiles or thresholds.
Train Classifier Fit a supervised learning model to predict volatility class from the features.
Evaluate Model Assess model accuracy and review which features are most predictive.

Potential Variables

Variable Description
prev_day_return Return from the previous trading day
rolling_std_5 5-day rolling standard deviation (volatility)
rolling_std_21 21-day (1-month) rolling standard deviation
ATR_14 14-day Average True Range (volatility indicator)
volume_zscore_5 5-day Z-score of trading volume
RSI_14 14-day Relative Strength Index
VIX Implied volatility index (if available for the SPX)

Analysis plan (Q4)

This uses Unsupervised Machine Learning Methods for exploring

  1. Collect and preprocess S&P 500 stock data for 2024, calculating each stock’s exposure to multiple risk and return factors (e.g., momentum, value, size, volatility).
  2. Construct a feature matrix where each row represents a stock and columns represent factor exposures.
  3. Apply hierarchical clustering (such as Ward’s method) to the feature matrix to group stocks with similar profiles.
  4. Visualize the dendrogram and analyze the resulting clusters to interpret the taxonomy and identify meaningful stock groupings.
Step Description
Data Prep Collect and preprocess S&P 500 stock data for 2024, computing multi-factor exposures.
Matrix Build Create a matrix with stocks as rows and factor exposures as columns.
Clustering Perform hierarchical clustering (e.g., Ward’s method) on the feature matrix.
Interpretation Visualize and interpret the dendrogram to analyze and describe stock group taxonomy.

Potential Variables for Use

Variable Description
avg_return_2024 Average daily return in 2024
volatility_2024 Standard deviation of daily returns in 2024
momentum_3m 3-month price momentum (percent change)
value_ratio Price-to-earnings (P/E) or price-to-book (P/B) ratio
size Market capitalization
beta_market Beta (sensitivity to S&P 500 index)
dividend_yield Dividend yield
sector Categorical variable for sector (for interpretation/grouping)
skewness_2024 Skewness of daily returns in 2024
kurtosis_2024 Kurtosis of daily returns in 2024