Forecasting Anomalies in AtHub’s Stock Behavior

Data-Driven Detection of Local Peaks and Dips

This project explores the predictive modeling of short-term volatility anomalies in the Chinese equity market, with a specific focus on the stock AtHub (603881.SH)—a data center infrastructure company. The goal is to develop a machine learning pipeline that can detect abnormal daily price or volume movements using technical analysis (TA) indicators.
Author
Affiliation

Annabelle Zhu

College of Information Science, University of Arizona

📝 Proposal

This project proposes the development of an interpretable machine learning model for forecasting short-term volatility anomalies in the Chinese equity market, using AtHub (603881.SH) as a case study. AtHub is a data center infrastructure provider whose stock demonstrates unusually high daily volatility and frequent sensitivity to external events such as government policy announcements (Lin et al. 2024). Rather than predicting stock prices directly—a notoriously noisy and non-stationary target—this study focuses on detecting next-day abnormal price or volume events, defined as daily returns exceeding ±5% or volume spikes greater than 2× the rolling average.

We aim to construct a binary classifier that leverages over 30 technical analysis (TA) indicators across momentum, volume, trend, and volatility domains. These features are engineered using the Tushare API and the tsta library, covering 218 trading days of AtHub data. The project will incorporate time-aware cross-validation to avoid look-ahead bias and SHAP analysis for post-hoc interpretability. Ensemble models like LightGBM and XGBoost will serve as the backbone of the predictive framework, selected for their robustness in handling noisy, nonlinear tabular data. The final outcome will include an interactive visualization of feature contributions, along with a brief report and presentation.

This proposal reflects a practical and scalable approach to market anomaly detection, especially relevant for traders and risk managers seeking data-driven early warning systems.


🎯 High-Level Goal

To develop a machine learning classifier that predicts next-day abnormal volatility events in AtHub (603881.SH) stock using technical analysis (TA) indicators, with anomalies defined as price movements exceeding ±5% or volume surges >2× the 30-day average.


📊 Dataset

ts_code open high low close pct_chg vol amount volume_obv volume_cmf ... trend_adx_neg momentum_rsi momentum_wr momentum_roc momentum_ao momentum_ppo_hist trend_cci trend_aroon_up trend_aroon_down trend_aroon_ind
0 603881.SH 26.78 26.80 26.31 26.42 -1.7479 274104.38 725328.815 274104.38 -0.551020 ... 0.0 100.0 -77.551020 0.0 0.0 0.000000 0.000000 0 0 0
1 603881.SH 26.81 26.95 26.50 26.89 -0.5547 328799.42 879021.084 602903.80 0.149414 ... 0.0 100.0 -9.375000 0.0 0.0 0.113379 66.666667 4 0 4
2 603881.SH 27.77 27.99 27.02 27.04 -2.8387 453556.90 1240793.683 1056460.70 -0.326345 ... 0.0 100.0 -56.547619 0.0 0.0 0.214038 100.000000 8 0 8
3 603881.SH 27.81 27.98 27.38 27.83 -0.8550 438206.80 1214125.954 1494667.50 -0.084077 ... 0.0 100.0 -9.523810 0.0 0.0 0.453638 94.972067 8 0 8
4 603881.SH 28.30 28.86 27.56 28.07 0.2858 694568.92 1954292.549 2189236.42 -0.125737 ... 0.0 100.0 -30.980392 0.0 0.0 0.633289 107.892527 16 0 16

5 rows × 31 columns

Dataset info

'Dataset contains 375 rows and 31 columns.'
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 375 entries, 0 to 374
Data columns (total 31 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ts_code            375 non-null    object 
 1   open               375 non-null    float64
 2   high               375 non-null    float64
 3   low                375 non-null    float64
 4   close              375 non-null    float64
 5   pct_chg            375 non-null    float64
 6   vol                375 non-null    float64
 7   amount             375 non-null    float64
 8   volume_obv         375 non-null    float64
 9   volume_cmf         375 non-null    float64
 10  volume_vpt         375 non-null    float64
 11  volume_vwap        375 non-null    float64
 12  volume_mfi         375 non-null    float64
 13  volatility_bbw     375 non-null    float64
 14  volatility_atr     375 non-null    float64
 15  volatility_ui      375 non-null    float64
 16  trend_macd         375 non-null    float64
 17  trend_macd_signal  375 non-null    float64
 18  trend_macd_diff    375 non-null    float64
 19  trend_adx          375 non-null    float64
 20  trend_adx_pos      375 non-null    float64
 21  trend_adx_neg      375 non-null    float64
 22  momentum_rsi       375 non-null    float64
 23  momentum_wr        375 non-null    float64
 24  momentum_roc       375 non-null    float64
 25  momentum_ao        375 non-null    float64
 26  momentum_ppo_hist  375 non-null    float64
 27  trend_cci          375 non-null    float64
 28  trend_aroon_up     375 non-null    int64  
 29  trend_aroon_down   375 non-null    int64  
 30  trend_aroon_ind    375 non-null    int64  
dtypes: float64(27), int64(3), object(1)
memory usage: 90.9+ KB

Dataset Summary

The dataset is sourced via the Tushare API and engineered using Python’s tsta technical indicator library. It includes:

  • 375 daily records of AtHub stock trading from the past ~18 months

  • 31 columns, including:

    • Price data: open, high, low, close, pct_chg
    • Volume metrics: vol, amount, volume_obv, volume_cmf, volume_vpt, volume_vwap, volume_mfi
    • Volatility indicators: volatility_bbw, volatility_atr, volatility_ui
    • Trend & momentum indicators: trend_macd, trend_adx, momentum_rsi, momentum_wr, momentum_roc, trend_aroon_up, etc.

This feature-rich time series provides a robust foundation for testing anomaly detection models under realistic, noisy conditions.

Target Anomalies

Anomaly Class Distribution

Distribution of Anomaly Labels (Target Variable)

Feature Construction Strategy

To effectively model price and volume anomalies, we engineered over 30 technical indicators across four core dimensions widely adopted in quantitative trading:

  • Momentum (e.g., RSI, MACD, Williams %R): capture price velocity and potential reversals.
  • Volume-based (e.g., OBV, MFI, VPT): track accumulation/distribution behavior.
  • Volatility (e.g., ATR, Bollinger Band Width, Ulcer Index): quantify market turbulence.
  • Trend strength (e.g., ADX, Aroon, CCI): detect the emergence or weakening of price trends.

These indicators were computed using the tsta Python library and merged with daily OHLCV data. We conducted correlation analysis to filter redundant signals and retain complementary ones, as illustrated in the figure below.

Correlation Between Features and Target (Anomaly)

This engineered feature space provides interpretable signals that are sensitive to both directional shifts and liquidity changes—two major components in anomaly formation.

Model Development Approach

Data Splitting Strategy

Given the time-series nature of stock data, we implement:

  • Chronological split: First 80% for training, last 20% for testing

  • Walk-forward validation: Expanding window cross-validation

Evaluation Metrics

We prioritize: - Recall: Minimizing false negatives (missed anomalies)

  • F1-score: Balancing precision/recall

  • Matthews Correlation Coefficient: Robust to class imbalance

Baseline Models

Model Strengths Weaknesses
XGBoost Handles nonlinear relationships Requires careful tuning
LightGBM Efficient with large features Sensitive to outliers
Logistic Regression Interpretable coefficients Limited nonlinear capacity

📍 Motivation & Goals

Why AtHub (603881.SH)? AtHub is a leading Chinese data center infrastructure provider whose stock exhibits unusually high short-term volatility, making it a strong candidate for anomaly-based forecasting. Over the past six months, its daily return volatility (\(\sigma \approx\) 35%) has far exceeded the industry average (\(\approx\) 22%). Moreover, its price reacts sharply to regulatory announcements and policy shifts, reflecting its sensitivity to macro-level and sector-specific events.

This project aims to detect and forecast short-term abnormal volatility events in AtHub’s stock using supervised machine learning. Instead of heuristic rules, we define volatility anomalies using quantifiable thresholds: price changes beyond ±5% or trading volumes exceeding 2\(\times\) the 30-day average. Our key goals are:

  • Build an interpretable prediction model using over 30 engineered technical indicators (e.g., MACD, RSI, OBV, ATR).
  • Evaluate event-driven prediction performance using time-series-aware cross-validation and dynamic thresholding strategies.
  • Provide real-world utility in the form of a probabilistic alert system for volatility-prone trading days.

SHAP analysis is integrated to uncover feature interactions that precede volatility (e.g., “high RSI + declining OBV” may precede reversals), offering not just predictive power but also interpretability.


👓 Research Questions

  • Q1. Can TA features detect anomalies 1–3 days in advance? Which indicators lead?

  • Q2. Which features drive predictions? Do they align with financial theory?

  • Q3. How do anomaly thresholds (\(\pm\) 3% vs. \(\pm\) 5% vs. \(\pm\) 7% price; 1.8 \(\times\) vs. 2.5\(\times\) volume) impact model performance?


🚩 Analysis plan

Here’s a refined weekly plan in bullet-point format that incorporates EDA before feature engineering, aligns with your research questions, and includes all required deliverables (write-up, presentation, website). Tools are listed separately for clarity:

Weekly Plan: Predicting Abnormal Volatility in AtHub (603881.SH)

Week 1: Data Collection & Exploratory Analysis (EDA)

  • Tasks:
    • Collect 1+ year of OHLCV data for AtHub using Tushare API.
    • Generate TA features (momentum, volume, volatility, trend indicators).
    • Perform EDA:
      • Visualize price/volume trends and anomaly frequency.
      • Check for missing data, outliers, and stationarity.
      • Analyze correlation between raw price/volume metrics.
    • Define preliminary anomaly thresholds ($\(5% returns, 2\)$ volume).
  • Tools: tushare, pandas, matplotlib, ta, seaborn.

Week 2: Feature Engineering & Baseline Model

  • Tasks:
    • Refine anomaly labels based on EDA insights.
    • Split data chronologically (e.g., 80% train, 20% test).
    • Train baseline models (XGBoost/LightGBM) and evaluate with accuracy/F1.
  • Research Questions Addressed:
    • Q3 (Threshold Impact): Test initial thresholds.
  • Tools: scikit-learn, xgboost.

Week 3: Model Tuning & Interpretability

  • Tasks:
    • Optimize hyperparameters using time-series cross-validation.
    • Compare performance across thresholds (\(\pm\) 3%, \(\pm\) 5%, \(\pm\) 7%).
    • Apply SHAP to identify top predictive features and patterns.
    • Test feature lead times (1–3 days pre-anomaly).
  • Research Questions Addressed:
    • Q1 (Predictive Horizon): Lag feature analysis.
    • Q2 (Feature Importance): SHAP/partial dependence plots.
  • Tools: optuna, shap, statsmodels (Granger causality).

Week 4: Final Evaluation & Deliverables

  • Tasks:
    • Write-up (1,000–2,000 words):
      • Introduction, Methods, Results (SHAP plots, threshold analysis), Conclusion.
    • Presentation (5 mins):
      • Quarto slides covering motivation, methods, key findings, Q&A prep.
    • Website:
      • Host report, code, and interactive visualizations (e.g., Plotly dashboards).
    • Repo Organization:
      • Logical structure (e.g., data/, notebooks/, results/).
      • Clear index.qmd as entry point.
  • Tools: Quarto, plotly, pkgdown

Expected Outcomes

  1. Threshold Analysis Results:
    • Precision-recall curves for \(\pm\) 3% vs. \(\pm\) 5% vs. \(\pm\) 7% thresholds
    • Optimal threshold selection based on trading costs
  2. Top Predictive Features:
    • SHAP summary plot of top 10 influential indicators
    • Temporal importance patterns (e.g., volume leads price)
  3. Practical Trading Rules:
    • Actionable signals like:
      *“When RSI > 70 AND OBV < 30-day average* \(\to\) 67% probability of next-day drop >5%”
  4. Interactive Dashboard:
    • Dynamic visualization of anomaly predictions
    • Threshold adjustment interface

📁 Repository Organization

Folder / File Name Description
.quarto/ Internal Quarto system files; manages cache and config for rendering. Not manually edited.
_extra/ Holds supplementary files or artifacts not directly part of deliverables.
_freeze/ Stores frozen snapshots of document outputs to ensure reproducibility across builds.
_site/ Output folder generated when the site is rendered; contains final HTML files.
data/ Contains all datasets used in the project, both raw and processed. Includes README for data schema and source.
images/ Stores all image assets, including plots and figures used in .qmd files.
style/ Contains custom theming files (e.g., customtheming.scss) used to style the website.
index.qmd Landing page of the Quarto website; typically includes a high-level project overview or introduction.
about.qmd Additional project background or author info. Can serve as a detailed project description.
proposal.qmd Contains the research proposal, including motivation, methodology, timeline, and repo organization.
presentation.qmd A Quarto-based presentation (slides) summarizing key findings from the final report.

References

Lin, Honghao, Haiqing Tang, Yihong Wang, and Xinjun Miao. 2024. “数据港(603881):国有数据中心龙头,整体经营稳步向上.” 天风证券. https://www.tfzq.com.