# Data Pipeline Bug Analysis ## Summary The generated embeddings do not match the database 0_7 embeddings due to multiple bugs in the data pipeline migration from qlib to standalone Polars implementation. --- ## Bugs Fixed ### 1. Market Classification (`FlagMarketInjector`) ✓ FIXED **Original (incorrect):** ```python market_0 = (instrument >= 600000) # SH market_1 = (instrument < 600000) # SZ ``` **Fixed:** ```python inst_str = str(instrument).zfill(6) market_0 = inst_str.startswith('6') # SH: 6xxxxx market_1 = inst_str.startswith('0') | inst_str.startswith('3') # SZ: 0xxx, 3xxx market_2 = inst_str.startswith('4') | inst_str.startswith('8') # NE: 4xxx, 8xxx ``` **Impact:** 167 instruments (4xxxxx, 8xxxxx - 新三板) were misclassified. --- ### 2. ColumnRemover Missing `IsN` ✓ FIXED **Original (incorrect):** ```python columns_to_remove = ['TotalValue_diff', 'IsZt', 'IsDt'] ``` **Fixed:** ```python columns_to_remove = ['TotalValue_diff', 'IsN', 'IsZt', 'IsDt'] ``` **Impact:** Extra column caused feature dimension mismatch. --- ### 3. RobustZScoreNorm Applied to Wrong Columns ✓ FIXED **Original (incorrect):** Applied normalization to ALL 341 features including market flags and indus_idx. **Fixed:** Only normalize `alpha158 + alpha158_ntrl + market_ext + market_ext_ntrl` (330 features), excluding: - Market flags (Limit, Stopping, IsTp, IsXD, IsXR, IsDR, market_0, market_1, market_2, IsST) - indus_idx --- ## Critical Remaining Issue: Data Schema Mismatch ### `Limit` and `Stopping` Column Types Changed **Original qlib pipeline expected:** - `Limit`: **Boolean** flag (True = limit up) - `Stopping`: **Boolean** flag (True = suspended trading) **Current Parquet data has:** - `Limit`: **Float64** price change percentage (0.0 to 1301.3) - `Stopping`: **Float64** price change percentage **Evidence:** ``` Limit values sample: [8.86, 9.36, 31.0, 7.32, 2.28, 6.39, 5.38, 4.03, 3.86, 9.89] Limit == 0: only 2 rows Limit > 0: 3738 rows ``` This is a **fundamental data schema change**. The current Parquet files contain different data than what the original VAE model was trained on. **Possible fixes:** 1. Convert `Limit` and `Stopping` to boolean flags using a threshold 2. Find the original data source that had boolean flags 3. Re-train the VAE model with the new data schema --- ## Correlation Results After fixing bugs 1-3, the embedding correlation with database 0_7: | Metric | Value | |--------|-------| | Mean correlation (32 dims) | 0.0068 | | Median correlation | 0.0094 | | Overall correlation | 0.2330 | **Conclusion:** Embeddings remain essentially uncorrelated (≈0). --- ## Root Cause The **Limit/Stopping data schema change** is the most likely root cause. The VAE model learned to encode features that included binary limit/stopping flags, but the standalone pipeline feeds it continuous price change percentages instead. --- ## Next Steps 1. **Verify original data schema:** - Check if the original DolphinDB table had boolean `Limit` and `Stopping` columns - Compare with the current Parquet schema 2. **Fix the data loading:** - Either convert continuous values to binary flags - Or use the correct boolean columns (`IsZt`, `IsDt`) for limit flags 3. **Verify feature order:** - Ensure the qlib RobustZScoreNorm parameters are applied in the correct order - Check that `[alpha158, alpha158_ntrl, market_ext, market_ext_ntrl]` matches the 330-parameter shape 4. **Re-run comparison:** - Generate new embeddings with the corrected pipeline - Compare correlation with database