This project explores the predictive modeling of short-term volatility anomalies in the Chinese equity market, with a specific focus on the stock AtHub (603881.SH)—a data center infrastructure company. The goal is to develop a machine learning pipeline that can detect abnormal daily price or volume movements using technical analysis (TA) indicators.
Author
Affiliation
Annabelle Zhu
College of Information Science, University of Arizona
📝 Proposal
This project proposes the development of an interpretable machine learning model for forecasting short-term volatility anomalies in the Chinese equity market, using AtHub (603881.SH) as a case study. AtHub is a data center infrastructure provider whose stock demonstrates unusually high daily volatility and frequent sensitivity to external events such as government policy announcements (Lin et al. 2024). Rather than predicting stock prices directly—a notoriously noisy and non-stationary target—this study focuses on detecting next-day abnormal price or volume events, defined as daily returns exceeding ±5% or volume spikes greater than 2× the rolling average.
We aim to construct a binary classifier that leverages over 30 technical analysis (TA) indicators across momentum, volume, trend, and volatility domains. These features are engineered using the Tushare API and the tsta library, covering 218 trading days of AtHub data. The project will incorporate time-aware cross-validation to avoid look-ahead bias and SHAP analysis for post-hoc interpretability. Ensemble models like LightGBM and XGBoost will serve as the backbone of the predictive framework, selected for their robustness in handling noisy, nonlinear tabular data. The final outcome will include an interactive visualization of feature contributions, along with a brief report and presentation.
This proposal reflects a practical and scalable approach to market anomaly detection, especially relevant for traders and risk managers seeking data-driven early warning systems.
🎯 High-Level Goal
To develop a machine learning classifier that predicts next-day abnormal volatility events in AtHub (603881.SH) stock using technical analysis (TA) indicators, with anomalies defined as price movements exceeding ±5% or volume surges >2× the 30-day average.
This feature-rich time series provides a robust foundation for testing anomaly detection models under realistic, noisy conditions.
Target Anomalies
Anomaly Class Distribution
Distribution of Anomaly Labels (Target Variable)
Feature Construction Strategy
To effectively model price and volume anomalies, we engineered over 30 technical indicators across four core dimensions widely adopted in quantitative trading:
Momentum (e.g., RSI, MACD, Williams %R): capture price velocity and potential reversals.
Trend strength (e.g., ADX, Aroon, CCI): detect the emergence or weakening of price trends.
These indicators were computed using the tsta Python library and merged with daily OHLCV data. We conducted correlation analysis to filter redundant signals and retain complementary ones, as illustrated in the figure below.
Correlation Between Features and Target (Anomaly)
This engineered feature space provides interpretable signals that are sensitive to both directional shifts and liquidity changes—two major components in anomaly formation.
Model Development Approach
Data Splitting Strategy
Given the time-series nature of stock data, we implement:
Chronological split: First 80% for training, last 20% for testing
We prioritize: - Recall: Minimizing false negatives (missed anomalies)
F1-score: Balancing precision/recall
Matthews Correlation Coefficient: Robust to class imbalance
Baseline Models
Model
Strengths
Weaknesses
XGBoost
Handles nonlinear relationships
Requires careful tuning
LightGBM
Efficient with large features
Sensitive to outliers
Logistic Regression
Interpretable coefficients
Limited nonlinear capacity
📍 Motivation & Goals
Why AtHub (603881.SH)? AtHub is a leading Chinese data center infrastructure provider whose stock exhibits unusually high short-term volatility, making it a strong candidate for anomaly-based forecasting. Over the past six months, its daily return volatility (\(\sigma \approx\) 35%) has far exceeded the industry average (\(\approx\) 22%). Moreover, its price reacts sharply to regulatory announcements and policy shifts, reflecting its sensitivity to macro-level and sector-specific events.
This project aims to detect and forecast short-term abnormal volatility events in AtHub’s stock using supervised machine learning. Instead of heuristic rules, we define volatility anomalies using quantifiable thresholds: price changes beyond ±5% or trading volumes exceeding 2\(\times\) the 30-day average. Our key goals are:
Build an interpretable prediction model using over 30 engineered technical indicators (e.g., MACD, RSI, OBV, ATR).
Evaluate event-driven prediction performance using time-series-aware cross-validation and dynamic thresholding strategies.
Provide real-world utility in the form of a probabilistic alert system for volatility-prone trading days.
SHAP analysis is integrated to uncover feature interactions that precede volatility (e.g., “high RSI + declining OBV” may precede reversals), offering not just predictive power but also interpretability.
👓 Research Questions
Q1. Can TA features detect anomalies 1–3 days in advance? Which indicators lead?
Q2. Which features drive predictions? Do they align with financial theory?
Q3. How do anomaly thresholds (\(\pm\) 3% vs. \(\pm\) 5% vs. \(\pm\) 7% price; 1.8 \(\times\) vs. 2.5\(\times\) volume) impact model performance?
🚩 Analysis plan
Here’s a refined weekly plan in bullet-point format that incorporates EDA before feature engineering, aligns with your research questions, and includes all required deliverables (write-up, presentation, website). Tools are listed separately for clarity:
Weekly Plan: Predicting Abnormal Volatility in AtHub (603881.SH)
Week 1: Data Collection & Exploratory Analysis (EDA)
Tasks:
Collect 1+ year of OHLCV data for AtHub using Tushare API.
Generate TA features (momentum, volume, volatility, trend indicators).
Perform EDA:
Visualize price/volume trends and anomaly frequency.
Check for missing data, outliers, and stationarity.
Analyze correlation between raw price/volume metrics.
---title: "Forecasting Anomalies in AtHub’s Stock Behavior"subtitle: "Data-Driven Detection of Local Peaks and Dips"author: - name: "Annabelle Zhu" affiliations: - name: "College of Information Science, University of Arizona"description: "This project explores the predictive modeling of short-term volatility anomalies in the Chinese equity market, with a specific focus on the stock *AtHub (603881.SH)*—a data center infrastructure company. The goal is to develop a machine learning pipeline that can detect abnormal daily price or volume movements using technical analysis (TA) indicators."format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: trueeditor: visualbibliography: references.bibcode-annotations: hoverexecute: warning: false echo: falsejupyter: python3---# 📝 ProposalThis project proposes the development of an interpretable machine learning model for forecasting short-term volatility anomalies in the Chinese equity market, using AtHub (603881.SH) as a case study. AtHub is a data center infrastructure provider whose stock demonstrates unusually high daily volatility and frequent sensitivity to external events such as government policy announcements [@lin2024datahub]. Rather than predicting stock prices directly—a notoriously noisy and non-stationary target—this study focuses on detecting next-day abnormal price or volume events, defined as daily returns exceeding ±5% or volume spikes greater than 2× the rolling average.We aim to construct a binary classifier that leverages over 30 technical analysis (TA) indicators across momentum, volume, trend, and volatility domains. These features are engineered using the Tushare API and the `tsta` library, covering 218 trading days of AtHub data. The project will incorporate time-aware cross-validation to avoid look-ahead bias and SHAP analysis for post-hoc interpretability. Ensemble models like LightGBM and XGBoost will serve as the backbone of the predictive framework, selected for their robustness in handling noisy, nonlinear tabular data. The final outcome will include an interactive visualization of feature contributions, along with a brief report and presentation.This proposal reflects a practical and scalable approach to market anomaly detection, especially relevant for traders and risk managers seeking data-driven early warning systems.------------------------------------------------------------------------# 🎯 High-Level GoalTo develop a machine learning classifier that predicts next-day abnormal volatility events in AtHub (603881.SH) stock using technical analysis (TA) indicators, with anomalies defined as price movements exceeding ±5% or volume surges \>2× the 30-day average.```{python}#| label: load-pkgs#| message: falseimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltimport pandas as pd```------------------------------------------------------------------------# 📊 Dataset```{python}#| label: load-dataset#| message: falsedf = pd.read_csv("data/stock_cleaned.csv")df.head()```## Dataset info```{python}f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns."``````{python}df.info()```## Dataset SummaryThe dataset is sourced via the **Tushare API** and engineered using Python’s `tsta` technical indicator library. It includes:- **375 daily records** of AtHub stock trading from the past \~18 months- **31 columns**, including: - **Price data**: `open`, `high`, `low`, `close`, `pct_chg` - **Volume metrics**: `vol`, `amount`, `volume_obv`, `volume_cmf`, `volume_vpt`, `volume_vwap`, `volume_mfi` - **Volatility indicators**: `volatility_bbw`, `volatility_atr`, `volatility_ui` - **Trend & momentum indicators**: `trend_macd`, `trend_adx`, `momentum_rsi`, `momentum_wr`, `momentum_roc`, `trend_aroon_up`, etc.This feature-rich time series provides a robust foundation for testing anomaly detection models under realistic, noisy conditions.## Target `Anomalies````{python}#| label: target-visualization#| fig-cap: "Distribution of Anomaly Labels (Target Variable)"#| fig-subcap: ["Anomaly Class Distribution", "Daily Returns Histogram"]# Calculate anomaly daysdf['anomaly'] = ((df['pct_chg'].abs() >=5) | (df['vol'] >2* df['vol'].rolling(30).mean())).astype(int)# Plotfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))df['anomaly'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['#66b3ff','#ff9999'], ax=ax1)ax1.set_title('Anomaly Class Distribution')df['pct_chg'].plot(kind='hist', bins=50, color='#66b3ff', ax=ax2)ax2.axvline(x=5, color='red', linestyle='--')ax2.axvline(x=-5, color='red', linestyle='--')ax2.set_title('Daily Returns Distribution (±5% Threshold)')plt.tight_layout()plt.show()```## Feature Construction StrategyTo effectively model price and volume anomalies, we engineered over 30 technical indicators across four core dimensions widely adopted in quantitative trading:- **Momentum** (e.g., RSI, MACD, Williams %R): capture price velocity and potential reversals.- **Volume-based** (e.g., OBV, MFI, VPT): track accumulation/distribution behavior.- **Volatility** (e.g., ATR, Bollinger Band Width, Ulcer Index): quantify market turbulence.- **Trend strength** (e.g., ADX, Aroon, CCI): detect the emergence or weakening of price trends.These indicators were computed using the `tsta` Python library and merged with daily OHLCV data. We conducted correlation analysis to filter redundant signals and retain complementary ones, as illustrated in the figure below.```{python}#| label: feature-target-corr#| fig-cap: "Correlation Between Features and Target (Anomaly)"from scipy.stats import pointbiserialrselected_features = ['momentum_rsi', 'volume_obv', 'volatility_atr', 'trend_macd_diff', 'volume_vpt', 'trend_adx']corr_with_target = {}for col in selected_features: corr, _ = pointbiserialr(df['anomaly'], df[col]) corr_with_target[col] = corr# Plotplt.figure(figsize=(8,5))sns.barplot(x=list(corr_with_target.values()), y=list(corr_with_target.keys()), palette='coolwarm')plt.xlabel('Point Biserial Correlation with Target (Anomaly)')plt.title('Feature-Target Correlation')plt.grid(axis='x', linestyle='--', alpha=0.5)plt.tight_layout()plt.show()```This engineered feature space provides interpretable signals that are sensitive to both directional shifts and liquidity changes—two major components in anomaly formation.## Model Development Approach### Data Splitting StrategyGiven the time-series nature of stock data, we implement:- **Chronological split**: First 80% for training, last 20% for testing- **Walk-forward validation**: Expanding window cross-validation### Evaluation MetricsWe prioritize: - **Recall**: Minimizing false negatives (missed anomalies)- **F1-score**: Balancing precision/recall- **Matthews Correlation Coefficient**: Robust to class imbalance### Baseline Models| Model | Strengths | Weaknesses ||----|----|----|| **XGBoost** | Handles nonlinear relationships | Requires careful tuning || **LightGBM** | Efficient with large features | Sensitive to outliers || **Logistic Regression** | Interpretable coefficients | Limited nonlinear capacity |------------------------------------------------------------------------# 📍 Motivation & Goals**Why AtHub (603881.SH)?** AtHub is a leading Chinese data center infrastructure provider whose stock exhibits unusually high short-term volatility, making it a strong candidate for anomaly-based forecasting. Over the past six months, its daily return volatility ($\sigma \approx$ 35%) has far exceeded the industry average ($\approx$ 22%). Moreover, its price reacts sharply to regulatory announcements and policy shifts, reflecting its sensitivity to macro-level and sector-specific events.This project aims to detect and forecast short-term abnormal volatility events in AtHub’s stock using supervised machine learning. Instead of heuristic rules, we define volatility anomalies using **quantifiable thresholds**: price changes beyond ±5% or trading volumes exceeding 2$\times$ the 30-day average. Our key goals are:- **Build an interpretable prediction model** using over 30 engineered technical indicators (e.g., MACD, RSI, OBV, ATR).- **Evaluate event-driven prediction performance** using time-series-aware cross-validation and dynamic thresholding strategies.- **Provide real-world utility** in the form of a probabilistic alert system for volatility-prone trading days.SHAP analysis is integrated to uncover feature interactions that precede volatility (e.g., “high RSI + declining OBV” may precede reversals), offering not just predictive power but also interpretability.------------------------------------------------------------------------# 👓 Research Questions- Q1. Can TA features detect anomalies 1–3 days in advance? Which indicators lead?- Q2. Which features drive predictions? Do they align with financial theory?- Q3. How do anomaly thresholds ($\pm$ 3% vs. $\pm$ 5% vs. $\pm$ 7% price; 1.8 $\times$ vs. 2.5$\times$ volume) impact model performance?------------------------------------------------------------------------# 🚩 Analysis planHere's a refined **weekly plan** in bullet-point format that incorporates EDA before feature engineering, aligns with your research questions, and includes all required deliverables (write-up, presentation, website). Tools are listed separately for clarity:## **Weekly Plan: Predicting Abnormal Volatility in AtHub (603881.SH)**#### **Week 1: Data Collection & Exploratory Analysis (EDA)**- **Tasks**: - Collect 1+ year of OHLCV data for AtHub using Tushare API. - Generate TA features (momentum, volume, volatility, trend indicators). - Perform EDA: - Visualize price/volume trends and anomaly frequency. - Check for missing data, outliers, and stationarity. - Analyze correlation between raw price/volume metrics. - Define preliminary anomaly thresholds (\$\pm$5% returns, 2$\times\$ volume).- **Tools**: `tushare`, `pandas`, `matplotlib`, `ta`, `seaborn`.#### **Week 2: Feature Engineering & Baseline Model**- **Tasks**: - Refine anomaly labels based on EDA insights. - Split data chronologically (e.g., 80% train, 20% test). - Train baseline models (XGBoost/LightGBM) and evaluate with accuracy/F1.- **Research Questions Addressed**: - *Q3 (Threshold Impact)*: Test initial thresholds.- **Tools**: `scikit-learn`, `xgboost`.#### **Week 3: Model Tuning & Interpretability**- **Tasks**: - Optimize hyperparameters using time-series cross-validation. - Compare performance across thresholds ($\pm$ 3%, $\pm$ 5%, $\pm$ 7%). - Apply SHAP to identify top predictive features and patterns. - Test feature lead times (1–3 days pre-anomaly).- **Research Questions Addressed**: - *Q1 (Predictive Horizon)*: Lag feature analysis. - *Q2 (Feature Importance)*: SHAP/partial dependence plots.- **Tools**: `optuna`, `shap`, `statsmodels` (Granger causality).#### **Week 4: Final Evaluation & Deliverables**- **Tasks**: - **Write-up (1,000–2,000 words)**: - Introduction, Methods, Results (SHAP plots, threshold analysis), Conclusion. - **Presentation (5 mins)**: - Quarto slides covering motivation, methods, key findings, Q&A prep. - **Website**: - Host report, code, and interactive visualizations (e.g., Plotly dashboards). - **Repo Organization**: - Logical structure (e.g., `data/`, `notebooks/`, `results/`). - Clear `index.qmd` as entry point.- **Tools**: Quarto, `plotly`, `pkgdown`## Expected Outcomes1. **Threshold Analysis Results**: - Precision-recall curves for $\pm$ 3% vs. $\pm$ 5% vs. $\pm$ 7% thresholds - Optimal threshold selection based on trading costs2. **Top Predictive Features**: - SHAP summary plot of top 10 influential indicators - Temporal importance patterns (e.g., volume leads price)3. **Practical Trading Rules**: - Actionable signals like:\ *"When RSI \> 70 AND OBV \< 30-day average* $\to$ 67% probability of next-day drop \>5%"4. **Interactive Dashboard**: - Dynamic visualization of anomaly predictions - Threshold adjustment interface------------------------------------------------------------------------# 📁 Repository Organization| Folder / File Name | Description ||----|----||`.quarto/`| Internal Quarto system files; manages cache and config for rendering. Not manually edited. ||`_extra/`| Holds supplementary files or artifacts not directly part of deliverables. ||`_freeze/`| Stores frozen snapshots of document outputs to ensure reproducibility across builds. ||`_site/`| Output folder generated when the site is rendered; contains final HTML files. ||`data/`| Contains all datasets used in the project, both raw and processed. Includes README for data schema and source. ||`images/`| Stores all image assets, including plots and figures used in `.qmd` files. ||`style/`| Contains custom theming files (e.g., `customtheming.scss`) used to style the website. ||`index.qmd`| Landing page of the Quarto website; typically includes a high-level **project overview** or introduction. ||`about.qmd`| Additional project background or author info. Can serve as a **detailed project description**. ||`proposal.qmd`| Contains the research proposal, including motivation, methodology, timeline, and repo organization. ||`presentation.qmd`| A Quarto-based presentation (slides) summarizing key findings from the final report. |