6.6 KiB
Data Pipeline Bug Analysis - Final Status
Summary
After fixing all identified bugs, the feature count now matches (341), but the embeddings remain uncorrelated with the database 0_7 version.
Latest Version: v6
- Feature count: 341 ✓ (matches VAE input dim)
- Mean correlation with DB: 0.0050 (essentially zero)
- Status: All identified bugs fixed, IsST issue documented
- New: Polars-based dataset generation script added (
scripts/dump_polars_dataset.py)
Bugs Fixed
1. Market Classification (FlagMarketInjector) ✓ FIXED
- Bug: Used
instrument >= 600000which misclassified 新三板 instruments - Fix: Use string prefix matching with vocab_size=2 (not 3)
- Impact: 167 instruments corrected
2. ColumnRemover Missing IsN ✓ FIXED
- Bug: Only removed
IsZt, IsDtbut notIsN - Fix: Added
IsNto removal list - Impact: Feature count alignment
3. RobustZScoreNorm Scope ✓ FIXED
- Bug: Applied normalization to all 341 features
- Fix: Only normalize 330 features (alpha158 + market_ext, both original + neutralized)
- Impact: Correct normalization scope
4. Wrong Data Sources for Market Flags ✓ FIXED
- Bug: Used
Limit, Stopping(Float64) from kline_adjusted - Fix: Load from correct sources:
- kline_adjusted:
IsZt, IsDt, IsN, IsXD, IsXR, IsDR(Boolean) - market_flag:
open_limit, close_limit, low_limit, high_stop(Boolean, 4 cols)
- kline_adjusted:
- Impact: Correct boolean flag data
5. Feature Count Mismatch ✓ FIXED
- Bug: 344 features (3 extra)
- Fix: vocab_size=2 + 4 market_flag cols = 341 features
- Impact: VAE input dimension matches
6. Fixed* Processors Not Adding Required Columns ✓ FIXED
- Bug:
FixedFlagMarketInjectoronly converted dtype but didn't addmarket_0,market_1columns - Bug:
FixedFlagSTInjectoronly converted dtype but didn't createIsSTcolumn fromST_S,ST_Y - Fix:
FixedFlagMarketInjector: Now addsmarket_0(SH60xxx, SZ00xxx) andmarket_1(SH688xxx, SH689xxx, SZ300xxx, SZ301xxx)FixedFlagSTInjector: Now createsIsST = ST_S | ST_Y
- Impact: Processed data now has 408 columns (was 405), matching original qlib output
Important Discovery: IsST Column Issue in Gold-Standard Code
Problem Description
The FlagSTInjector processor in the original qlib proc_list is supposed to create an IsST column in the feature_flag group from the ST_S and ST_Y columns in the st_flag group. However, this processor fails silently even in the gold-standard qlib code.
Root Cause
The FlagSTInjector processor attempts to access columns using a format that doesn't match the actual column structure in the data:
- Expected format: The processor expects columns like
st_flag::ST_Sandst_flag::ST_Y(string format with::separator) - Actual format: The qlib handler produces MultiIndex tuple columns like
('st_flag', 'ST_S')and('st_flag', 'ST_Y')
This format mismatch causes the processor to fail to find the ST flag columns, and thus no IsST column is created.
Evidence
# Check proc_list
import pickle as pkl
with open('proc_list.proc', 'rb') as f:
proc_list = pkl.load(f)
# FlagSTInjector config
flag_st = proc_list[2]
print(f"fields_group: {flag_st.fields_group}") # 'feature_flag'
print(f"col_name: {flag_st.col_name}") # 'IsST'
print(f"st_group: {flag_st.st_group}") # 'st_flag'
# Check if IsST exists in processed data
with open('processed_data.pkl', 'rb') as f:
df = pkl.load(f)
feature_flag_cols = [c[1] for c in df.columns if c[0] == 'feature_flag']
print('IsST' in feature_flag_cols) # False!
Impact
- VAE training: The VAE model was trained on data without the
IsSTcolumn - VAE input dimension: 341 features (excluding IsST), not 342
- Polars pipeline: Should also skip
IsSTto maintain compatibility
Resolution
The polars-based pipeline (dump_polars_dataset.py) now correctly skips the FlagSTInjector step to match the gold-standard behavior:
# Step 3: FlagSTInjector - SKIPPED (fails even in gold-standard)
print("[3] Skipping FlagSTInjector (as per gold-standard behavior)...")
market_flag_with_st = market_flag_with_market # No IsST added
Lessons Learned
-
Verify processor execution: Don't assume all processors in the proc_list executed successfully. Check the output data to verify expected columns exist.
-
Column format matters: The qlib processors were designed for specific column formats (MultiIndex tuples vs
::separator strings). Format mismatches can cause silent failures. -
Match the gold-standard bugs: When replicating a pipeline, sometimes you need to replicate the bugs too. The VAE was trained on data without
IsST, so our pipeline must also exclude it. -
Debug by comparing intermediate outputs: Use scripts like
debug_data_divergence.pyto compare raw and processed data between the gold-standard and polars pipelines.
Correlation Results (v5)
| Metric | Value |
|---|---|
| Mean correlation (32 dims) | 0.0050 |
| Median correlation | 0.0079 |
| Min | -0.0420 |
| Max | 0.0372 |
| Overall (flattened) | 0.2225 |
Conclusion: Embeddings remain essentially uncorrelated with database.
Possible Remaining Issues
-
Different input data values: The alpha158_0_7_beta Parquet files may contain different values than the original DolphinDB data used to train the VAE.
-
Feature ordering mismatch: The 330 RobustZScoreNorm parameters must be applied in the exact order:
- [0:158] = alpha158 original
- [158:316] = alpha158_ntrl
- [316:323] = market_ext original (7 cols)
- [323:330] = market_ext_ntrl (7 cols)
-
Industry neutralization differences: Our
IndusNtrlInjectorimplementation may differ from qlib's. -
Missing transformations: There may be additional preprocessing steps not captured in handler.yaml.
-
VAE model mismatch: The VAE model may have been trained with different data than what handler.yaml specifies.
Recommended Next Steps
-
Compare intermediate features: Run both the qlib pipeline and our pipeline on the same input data and compare outputs at each step.
-
Verify RobustZScoreNorm parameter order: Check if our feature ordering matches the order used during VAE training.
-
Compare predictions, not embeddings: Instead of comparing VAE embeddings, compare the final d033 model predictions with the original 0_7 predictions.
-
Check alpha158 data source: Verify that
stg_1day_wind_alpha158_0_7_beta_1Dcontains the same data as the original DolphinDBstg_1day_wind_alpha158_0_7_betatable.