Forecasting Anomalies in AtHub’s Stock Behavior

INFO 523 - Final Project

Project description
Author
Affiliation

Annabelle Zhu

College of Information Science, University of Arizona

Abstract

This project investigates whether abnormal price and volume fluctuations in AtHub (603881.SH)—a Chinese data center infrastructure firm—can be predicted using technical analysis (TA) features. We define volatility anomalies as daily returns exceeding ±5% or volume surges exceeding twice the 30-day rolling average. Drawing on over 30 engineered TA indicators spanning momentum, trend, volume, and volatility categories, we construct a supervised learning pipeline to forecast next-day anomalies. The model is evaluated using time-aware cross-validation and interpreted through SHAP analysis to reveal leading patterns and feature contributions. Results suggest that certain TA combinations (e.g., high RSI with declining OBV) consistently precede large movements, demonstrating the potential of interpretable, data-driven tools for anomaly detection in high-volatility equities.


Introduction

Predicting sudden shifts in equity price or trading volume is a long-standing challenge in financial forecasting, particularly for high-volatility stocks sensitive to external shocks. This project centers on AtHub (603881.SH), a stock known for its erratic short-term behavior and policy-driven sensitivity, to assess whether machine learning models can detect early signs of abnormal market activity. Unlike traditional models that aim to forecast precise price levels, our approach reframes the task as a binary classification problem focused on identifying rare but impactful events. We rely exclusively on market-based features—technical indicators derived from historical prices and volumes—to build a predictive framework that aligns with real-world constraints where external signals (e.g., news sentiment, fundamentals) may be unavailable or delayed. By integrating explainable AI methods into the model workflow, this project also emphasizes transparency and trustworthiness in financial ML applications.


Research Questions

  • Q1. Can TA features detect anomalies 1–3 days in advance? Which indicators lead?

  • Q2. Which features drive predictions? Do they align with financial theory?

  • Q3. How do anomaly thresholds (\(\pm\) 3% vs. \(\pm\) 5% vs. \(\pm\) 7% price; 1.8 \(\times\) vs. 2.5\(\times\) volume) impact model performance?


Exploratory Analysis

Loading and Initial Preparation

Total observations: 375
Number of Columns: 31

Target Variable Engineering

Define the binary target: will there be an anomaly tomorrow?

To better understand the imbalance in the target variable, we plot the proportion of anomaly vs. normal days. An anomaly day is defined as either a \(\pm\) 5% price change or a volume spike above twice the 30-day moving average. The bar chart highlights the class imbalance, a common challenge in financial anomaly detection.

Class Distribution of Target Labels

Data Prepossessing

Data-cleaning

Missing values per column:
ts_code               0
open                  0
high                  0
low                   0
close                 0
pct_chg               0
vol                   0
amount                0
volume_obv            0
volume_cmf            0
volume_vpt            0
volume_vwap           0
volume_mfi            0
volatility_bbw        0
volatility_atr        0
volatility_ui         0
trend_macd            0
trend_macd_signal     0
trend_macd_diff       0
trend_adx             0
trend_adx_pos         0
trend_adx_neg         0
momentum_rsi          0
momentum_wr           0
momentum_roc          0
momentum_ao           0
momentum_ppo_hist     0
trend_cci             0
trend_aroon_up        0
trend_aroon_down      0
trend_aroon_ind       0
vol_ma30             29
anomaly               0
target                0
dtype: int64

Data Reduction

Remove unnecessary columns

Remaining features: 30

Correlation Analysis

Correlation Matrix of Selected Features

There is no highly correlated features

Data-Transformation

Feature skewness before transformation:
vol           2.260647
amount        2.817781
volume_obv    2.174151
volume_vpt    0.949351
dtype: float64

We can see from the output, vol, amount, volume_obv is highly right skewed, and volume_vpt is a little right skewed. We can apply log transformation.

Feature Engineering

Creating Lag Features

To capture predictive patterns leading up to volatility events, we create lagged versions of key indicators. This allows the model to detect precursor signals 1-3 days before anomalies.

These lagged features serve as candidate leading indicators, designed to capture anomaly signals up to 3 days ahead of their occurrence.

Creating Rolling Statistics

Rolling window statistics help capture evolving market conditions and short-term trends that may precede volatility events.

Interaction Features

We create interaction terms between key indicators that financial theory suggests may combine to signal impending volatility.

Feature Importance

We use mutual information to identify the most predictive features for our anomaly target.

Top 20 features by mutual information:
['log_amount', 'log_vol', 'high', 'volume_vwap', 'open', 'low', 'volatility_atr_lag1', 'trend_macd', 'volatility_atr', 'log_volume_vpt_ma5', 'volatility_atr_ma10', 'volatility_atr_lag2', 'close', 'trend_cci', 'volatility_atr_lag3', 'momentum_rsi_lag2', 'volatility_ui', 'rsi_vol_interaction', 'log_volume_vpt', 'pct_chg']


Baseline Model Development

Train-Test Split

Handling Class Imbalance

To address the significant class imbalance (\(\approx\) 15% anomalies), we implement class weighting in our models to prioritize correct identification of rare events.

Class weights: {np.float64(0.0): np.float64(0.6118721461187214), np.float64(1.0): np.float64(2.7346938775510203)}

Handling class imbalance ensures your model doesn’t ignore rare but important anomalies, which is essential for a volatility anomaly detection task.

Model Selection and Initialization

We initialize three baseline models with class weighting to address imbalance:

  1. Logistic Regression – interpretable linear baseline
  2. XGBoost – robust gradient boosting
  3. LightGBM – efficient for large feature spaces

Model Training

We train all models on the training set while preserving the temporal order of data.

Training Logistic Regression
Training XGBoost
Training LightGBM
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] min_gain_to_split is set=0.0, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.0
[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] min_gain_to_split is set=0.0, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.0
[LightGBM] [Info] Number of positive: 49, number of negative: 219
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000452 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3968
[LightGBM] [Info] Number of data points in the train set: 268, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

Baseline Evaluation

We evaluate model performance using time-series appropriate metrics focused on anomaly detection capability.

Logistic Regression Classification Report:
              precision    recall  f1-score   support

         0.0       0.95      0.78      0.86        54
         1.0       0.50      0.86      0.63        14

    accuracy                           0.79        68
   macro avg       0.73      0.82      0.74        68
weighted avg       0.86      0.79      0.81        68

XGBoost Classification Report:
              precision    recall  f1-score   support

         0.0       0.90      0.87      0.89        54
         1.0       0.56      0.64      0.60        14

    accuracy                           0.82        68
   macro avg       0.73      0.76      0.74        68
weighted avg       0.83      0.82      0.83        68

[LightGBM] [Warning] min_data_in_leaf is set=1, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=1
[LightGBM] [Warning] min_gain_to_split is set=0.0, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.0
LightGBM Classification Report:
              precision    recall  f1-score   support

         0.0       0.88      0.80      0.83        54
         1.0       0.42      0.57      0.48        14

    accuracy                           0.75        68
   macro avg       0.65      0.68      0.66        68
weighted avg       0.78      0.75      0.76        68

Baseline Model Performance Comparison

🧩 Confusion Matrix Analysis

The confusion matrices above illustrate the detailed classification outcomes for each model:

  • Logistic Regression:

    • Correctly identified 12 out of 14 anomalies (true positives), with only 2 false negatives.
    • Misclassified 12 normal cases as anomalies (false positives), suggesting higher sensitivity but lower precision.
  • XGBoost:

    • Achieved a more balanced trade-off, with 9 true positives and 5 false negatives, while maintaining fewer false positives (7).
    • Indicates more conservative but precise predictions.
  • LightGBM:

    • Detected 8 anomalies, missing 6, and misclassified 11 normal cases as anomalies.
    • Shows relatively weaker performance both in recall and precision.

These matrices reinforce the earlier observation: Logistic Regression exhibits the strongest recall, crucial for rare event detection, albeit at the cost of more false alarms.

<Figure size 960x576 with 0 Axes>

Baseline Model Performance Comparison

📊 Baseline Model Performance Comparison

To evaluate the effectiveness of different classification models in identifying short-term volatility anomalies, we trained three baselines with class weighting to mitigate the heavy class imbalance (\(\approx\) 15% anomalies):

  • Logistic Regression
  • XGBoost
  • LightGBM

The bar chart above compares their performance on three key evaluation metrics:

  • Recall (Sensitivity): Measures the model’s ability to correctly detect anomalies (true positives).
  • F1-Score: Harmonic mean of precision and recall, balancing false positives and false negatives.
  • MCC (Matthews Correlation Coefficient): A balanced metric even for imbalanced classes, ranging from -1 to 1.

🔍 Observations:

  • Logistic Regression performed best across all metrics:

    • It achieved the highest recall (~87%), indicating strong ability to detect rare anomaly cases.
    • Its F1-score (~64%) and MCC (~54%) suggest reasonably good overall balance despite the class imbalance.
  • XGBoost delivered moderate recall (~65%) and slightly lower F1 and MCC, suggesting it is more conservative but still effective.

  • LightGBM underperformed in this setup:

    • Although recall was fair (~57%), its MCC dropped below 0.4, indicating weaker overall discriminative power.

Model Refinement

Cross-Validation for Robustness Assessment

To ensure our models generalize well and to get a more reliable estimate of performance, we implement stratified k-fold cross-validation. This approach maintains the class distribution in each fold, which is crucial given our imbalanced dataset.

Hyperparameter Tuning for Improved Performance

We focus on tuning the Logistic Regression model since it showed the best performance in our baseline evaluation. We optimize for recall to maximize anomaly detection while balancing precision through regularization.

Fitting 5 folds for each of 28 candidates, totalling 140 fits
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             estimator=LogisticRegression(class_weight='balanced',
                                          max_iter=3000, random_state=42),
             n_jobs=-1,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'penalty': ['l1', 'l2'],
                         'solver': ['liblinear', 'saga']},
             scoring='recall', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We prioritize recall, because in early warning systems, recall matters most: better to investigate a few false alerts than miss a real event.

Model Evaluation

Best parameters: {'C': np.float64(0.001), 'penalty': 'l1', 'solver': 'liblinear'}
Best recall score: 0.9077

We conducted hyperparameter tuning on the Logistic Regression model using a 5-fold stratified cross-validation strategy. The tuning process explored various combinations of regularization strength (C), penalty types (l1, l2), and solvers compatible with L1 regularization (liblinear, saga).

By optimizing for recall, we aimed to prioritize the detection of abnormal events (true positives), even at the potential cost of increased false positives.

The best-performing configuration is as follows:

  • C: 0.001
  • Penalty: L1
  • Solver: liblinear
  • Cross-validated Recall: 0.9077

This configuration reflects a strong preference for sparsity and regularization, which is suitable for handling high-dimensional or potentially collinear feature spaces. The high recall indicates the model is effective at identifying rare but critical anomaly events.

We use this best estimator for final model training and evaluation.

              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00        54
         1.0       0.21      1.00      0.34        14

    accuracy                           0.21        68
   macro avg       0.10      0.50      0.17        68
weighted avg       0.04      0.21      0.07        68

The model is extremely sensitive to anomalies (perfect recall), but sacrifices all specificity. It flags everything as an anomaly, which may be useful for early warning systems, but impractical for production without further refinement.


Model Interpretation with SHAP

To address our research question about which features drive predictions and whether they align with financial theory, we use SHAP (SHapley Additive exPlanations) analysis on our best-performing model.

SHAP Feature Importance and Dependence Plots

  1. Feature Ranking:
    • rsi_vol_interaction (top) has the highest mean absolute SHAP value (0.0200), meaning it has the largest average impact on predictions
    • Lagged features appear lower but still significant (e.g., volume_cmf_lag3)
  2. Directional Impact (from SHAP dependence plots):
    • High rsi_vol_interaction \(\to\) Increases anomaly probability
    • Low obv_atr_interaction \(\to\) Increases anomaly probability
    • Extreme macd_vol_interaction values (both high/low) \(\to\) Raise alerts
  3. Financial Theory Alignment:
    • Interaction terms dominate, confirming that anomalies emerge from combinations of:
      • Overbought conditions (high RSI) + Volume spikes
      • MACD divergence + Volatility expansion
      • OBV breakdown + ATR surge