import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping
Navigating the S&P 500 with Machine Learning: What Can We Discover?
Proposal
Dataset
print("Downloading SPX data...")
data = yf.download('^GSPC', start='2014-01-01', end='2024-12-31')
# Flatten MultiIndex columns if necessary
if isinstance(data.columns, pd.MultiIndex):
data.columns = ['_'.join(col).strip() for col in data.columns.values]
Downloading SPX data...
This dataset contains daily historical price data for the S&P 500 index (^GSPC), downloaded using the yfinance Python package. It spans from January 1, 2014 to December 31, 2024s.
The dataset includes the following key variables: • Open: Opening price of the index each day • High: Highest price during the trading day • Low: Lowest price during the trading day • Close: Closing price of the day
We chose the S&P 500 dataset because it serves as a benchmark for the U.S. stock market, reflecting the performance of 500 leading publicly traded companies. This makes it ideal for exploring machine learning techniques in financial time series analysis, including trend prediction, volatility classification, and return forecasting.
With these techniques in proposal, we aim to uncover specific insights such as improved short- and long-term return forecasts from LSTM models, clearer volatility regime patterns via classification, and meaningful stock groupings through clustering of factor exposures. By quantifying accuracy across forecast horizons, identifying key predictors of risk regimes, and mapping stocks by shared characteristics, we move beyond general findings to actionable understanding of market behavior in 2024.
Questions
The four questions you want to answer.
Q1. Can a Long Short-Term Memory (LSTM) model accurately forecast short, medium, or long term S&P 500 returns?
Q2. How does forecast accuracy degrade as a function of prediction horizon, and what does this suggest about LSTM’s ability to model longer term financial trends?
Q3. How well can we classify each trading day in 2024 into low, medium, or high volatility regimes based on recent price action and market indicators?
Q4. How can hierarchical clustering organize S&P 500 stocks into a taxonomy based on multi-factor risk or return exposures?
Analysis plan (Q1, Q2)
The analysis will begin with acquisition of data from yahoo finance. We will have 4 tickers and merge into one data frame with preferred variables, features, etc.
The variables will receive a basic visual inspection and dimensional analysis
The data will be cleaned and standardized to produce a “tidy” set to be split for training LSTM model.
The model will be tested on unseen data set and results recorded.
The performance will evaluated for each time horizon and compared using plot visuals.
Ticker | Description |
---|---|
^GSPC |
S&P 500 Index (price and volume data) |
^VIX |
30-day implied volatility |
^VVIX |
Volatility of volatility |
^VIX9D |
9-day implied volatility |
Variable | Description |
---|---|
Open |
Opening price of SPX |
High |
Daily high price of SPX |
Low |
Daily low price of SPX |
Close |
Daily closing price of SPX |
Volume |
Daily trading volume of SPX |
VIX |
Implied volatility index (30-day horizon) |
VVIX |
Volatility-of-volatility index |
VIX9D |
9-day implied volatility index |
Features | Description |
---|---|
log_return_t |
Log returns of SPX: log(Close_t / Close_{t-1}) |
ParkinsonVol |
Realized volatility from high/low: ln(High/Low)^2 / (4ln2) |
EMA_10 , SMA_21 |
Short-term and medium-term trend indicators |
lag_volatility |
Lagged daily volatility measures (RV, VIX, VVIX) |
Target Variable | Definition |
---|---|
Return_t+1 |
1-day ahead return: pct_change(1).shift(-1) |
Return_t+5 |
5-day ahead return: pct_change(5).shift(-5) |
Return_t+21 |
21-day ahead return: pct_change(21).shift(-21) |
Step | Description |
---|---|
Data Acquisition | Download OHLCV for ^GSPC(SPX) , and volatility indices: ^VIX , ^VVIX , ^VIX9D via yfinance |
Data Cleaning/ Inspection | Align, index, remove nulls, filter data for consistency |
Feature Engineering/Standardization | Construct technical indicators, lag features, and volatility-based predictors |
Train/Test Split | 80/20 time based split |
Model Architecture | LSTM |
Evaluation Metrics | MAE and RMSE for each forecast horizon |
We selected Long Short-Term Memory (LSTM) networks because they are specifically designed to capture long-term dependencies in sequential data, making them ideal for time series forecasting like stock prices. Unlike simpler models such as linear regression or basic moving averages, LSTMs can learn patterns across varying time lags and handle noisy, non-linear trends that are common in financial markets. This allows for potentially more accurate forecasts of market movements compared to models that assume fixed, short-term dependencies.
Analysis plan (Q3)
This uses Supervised Machine Learning Methods for exploring
- Feature Engineering: Calculate recent price action and market indicators (e.g., rolling volatility, intraday range, moving averages, volume, VIX).
- Target Creation: Label each trading day as “low”, “medium”, or “high” volatility based on daily realized volatility bins (e.g., using tertiles or quantiles).
- Model Training: Use supervised learning models (e.g., decision tree, random forest, XGBoost) to classify the volatility regime from engineered features.
- Evaluation: Evaluate classification results with accuracy, confusion matrix, and analyze which features most influence volatility regime assignment.
Step | Description |
---|---|
Prepare Features | Gather recent price action (returns, volatility) and market indicators for each trading day. |
Create Volatility Classes | Divide days into low, medium, and high volatility using quantiles or thresholds. |
Train Classifier | Fit a supervised learning model to predict volatility class from the features. |
Evaluate Model | Assess model accuracy and review which features are most predictive. |
Potential Variables
Variable | Description |
---|---|
prev_day_return | Return from the previous trading day |
rolling_std_5 | 5-day rolling standard deviation (volatility) |
rolling_std_21 | 21-day (1-month) rolling standard deviation |
ATR_14 | 14-day Average True Range (volatility indicator) |
volume_zscore_5 | 5-day Z-score of trading volume |
RSI_14 | 14-day Relative Strength Index |
VIX | Implied volatility index (if available for the SPX) |
Analysis plan (Q4)
This uses Unsupervised Machine Learning Methods for exploring
- Collect and preprocess S&P 500 stock data for 2024, calculating each stock’s exposure to multiple risk and return factors (e.g., momentum, value, size, volatility).
- Construct a feature matrix where each row represents a stock and columns represent factor exposures.
- Apply hierarchical clustering (such as Ward’s method) to the feature matrix to group stocks with similar profiles.
- Visualize the dendrogram and analyze the resulting clusters to interpret the taxonomy and identify meaningful stock groupings.
Step | Description |
---|---|
Data Prep | Collect and preprocess S&P 500 stock data for 2024, computing multi-factor exposures. |
Matrix Build | Create a matrix with stocks as rows and factor exposures as columns. |
Clustering | Perform hierarchical clustering (e.g., Ward’s method) on the feature matrix. |
Interpretation | Visualize and interpret the dendrogram to analyze and describe stock group taxonomy. |
Potential Variables for Use
Variable | Description |
---|---|
avg_return_2024 | Average daily return in 2024 |
volatility_2024 | Standard deviation of daily returns in 2024 |
momentum_3m | 3-month price momentum (percent change) |
value_ratio | Price-to-earnings (P/E) or price-to-book (P/B) ratio |
size | Market capitalization |
beta_market | Beta (sensitivity to S&P 500 index) |
dividend_yield | Dividend yield |
sector | Categorical variable for sector (for interpretation/grouping) |
skewness_2024 | Skewness of daily returns in 2024 |
kurtosis_2024 | Kurtosis of daily returns in 2024 |