You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.5 KiB

Data Pipeline Bug Analysis

Summary

The generated embeddings do not match the database 0_7 embeddings due to multiple bugs in the data pipeline migration from qlib to standalone Polars implementation.


Bugs Fixed

1. Market Classification (FlagMarketInjector) ✓ FIXED

Original (incorrect):

market_0 = (instrument >= 600000)  # SH
market_1 = (instrument < 600000)   # SZ

Fixed:

inst_str = str(instrument).zfill(6)
market_0 = inst_str.startswith('6')  # SH: 6xxxxx
market_1 = inst_str.startswith('0') | inst_str.startswith('3')  # SZ: 0xxx, 3xxx
market_2 = inst_str.startswith('4') | inst_str.startswith('8')  # NE: 4xxx, 8xxx

Impact: 167 instruments (4xxxxx, 8xxxxx - 新三板) were misclassified.


2. ColumnRemover Missing IsN ✓ FIXED

Original (incorrect):

columns_to_remove = ['TotalValue_diff', 'IsZt', 'IsDt']

Fixed:

columns_to_remove = ['TotalValue_diff', 'IsN', 'IsZt', 'IsDt']

Impact: Extra column caused feature dimension mismatch.


3. RobustZScoreNorm Applied to Wrong Columns ✓ FIXED

Original (incorrect): Applied normalization to ALL 341 features including market flags and indus_idx.

Fixed: Only normalize alpha158 + alpha158_ntrl + market_ext + market_ext_ntrl (330 features), excluding:

  • Market flags (Limit, Stopping, IsTp, IsXD, IsXR, IsDR, market_0, market_1, market_2, IsST)
  • indus_idx

Critical Remaining Issue: Data Schema Mismatch

Limit and Stopping Column Types Changed

Original qlib pipeline expected:

  • Limit: Boolean flag (True = limit up)
  • Stopping: Boolean flag (True = suspended trading)

Current Parquet data has:

  • Limit: Float64 price change percentage (0.0 to 1301.3)
  • Stopping: Float64 price change percentage

Evidence:

Limit values sample: [8.86, 9.36, 31.0, 7.32, 2.28, 6.39, 5.38, 4.03, 3.86, 9.89]
Limit == 0: only 2 rows
Limit > 0: 3738 rows

This is a fundamental data schema change. The current Parquet files contain different data than what the original VAE model was trained on.

Possible fixes:

  1. Convert Limit and Stopping to boolean flags using a threshold
  2. Find the original data source that had boolean flags
  3. Re-train the VAE model with the new data schema

Correlation Results

After fixing bugs 1-3, the embedding correlation with database 0_7:

Metric Value
Mean correlation (32 dims) 0.0068
Median correlation 0.0094
Overall correlation 0.2330

Conclusion: Embeddings remain essentially uncorrelated (≈0).


Root Cause

The Limit/Stopping data schema change is the most likely root cause. The VAE model learned to encode features that included binary limit/stopping flags, but the standalone pipeline feeds it continuous price change percentages instead.


Next Steps

  1. Verify original data schema:

    • Check if the original DolphinDB table had boolean Limit and Stopping columns
    • Compare with the current Parquet schema
  2. Fix the data loading:

    • Either convert continuous values to binary flags
    • Or use the correct boolean columns (IsZt, IsDt) for limit flags
  3. Verify feature order:

    • Ensure the qlib RobustZScoreNorm parameters are applied in the correct order
    • Check that [alpha158, alpha158_ntrl, market_ext, market_ext_ntrl] matches the 330-parameter shape
  4. Re-run comparison:

    • Generate new embeddings with the corrected pipeline
    • Compare correlation with database