11 KiB
Alpha158 0_7 vs 0_7_beta Prediction Comparison
This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions.
Overview
The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model.
Directory Structure
stock_1d/d033/alpha158_beta/
├── README.md # This file
├── config.yaml # VAE model configuration
├── pipeline.py # Main orchestration script
├── scripts/ # Core pipeline scripts
│ ├── generate_beta_embedding.py # Generate VAE embeddings from beta factors
│ ├── generate_returns.py # Generate actual returns from kline data
│ ├── fetch_predictions.py # Fetch original predictions from DolphinDB
│ ├── predict_with_embedding.py # Generate predictions using beta embeddings
│ ├── compare_predictions.py # Compare 0_7 vs 0_7_beta predictions
│ ├── dump_polars_dataset.py # Dump raw and processed datasets using polars pipeline
│ └── extract_qlib_params.py # Extract RobustZScoreNorm parameters from Qlib proc_list
├── src/ # Source modules
│ └── qlib_loader.py # Qlib data loader with configurable date range
├── config/ # Configuration files
│ └── handler.yaml # Modified handler with configurable end date
├── data/ # Data files (gitignored)
│ ├── robust_zscore_params/ # Pre-fitted normalization parameters
│ │ └── csiallx_feature2_ntrla_flag_pnlnorm/
│ │ ├── mean_train.npy
│ │ ├── std_train.npy
│ │ └── metadata.json
│ ├── embedding_0_7_beta.parquet
│ ├── predictions_beta_embedding.parquet
│ ├── original_predictions_0_7.parquet
│ ├── actual_returns.parquet
│ ├── raw_data_*.pkl # Raw data before preprocessing
│ └── processed_data_*.pkl # Processed data after preprocessing
└── data_polars/ # Polars-generated datasets (gitignored)
├── raw_data_*.pkl
└── processed_data_*.pkl
Data Loading with Configurable Date Range
handler.yaml Modification
The original handler.yaml uses <TODAY> placeholder which always loads data until today's date. The modified version in config/handler.yaml uses <LOAD_END> placeholder that can be controlled via arguments:
# Original (always loads until today)
load_start: &load_start <SINCE_DATE>
load_end: &load_end <TODAY>
# Modified (configurable end date)
load_start: &load_start <LOAD_START>
load_end: &load_end <LOAD_END>
Using qlib_loader.py
from stock_1d.d033.alpha158_beta.src.qlib_loader import (
load_data_from_handler,
load_data_with_proc_list,
load_and_dump_data
)
# Load data with configurable date range
df = load_data_from_handler(
since_date="2019-01-01",
end_date="2019-01-31",
buffer_days=20, # Extra days for diff calculations
verbose=True
)
# Load and apply preprocessing pipeline
df_processed = load_data_with_proc_list(
since_date="2019-01-01",
end_date="2019-01-31",
proc_list_path="/path/to/proc_list.proc",
buffer_days=20
)
# Load and dump both raw and processed data to pickle files
raw_df, processed_df = load_and_dump_data(
since_date="2019-01-01",
end_date="2019-01-31",
output_dir="data/",
fill_con_rating_nan=True, # Fill NaN in con_rating_strength column
verbose=True
)
Key Features
- Configurable end date: Unlike the original handler.yaml, the end date is now respected
- Buffer period handling: Automatically loads extra days before
since_datefor diff calculations - NaN handling: Optional filling of NaN values in
con_rating_strengthcolumn - Dual output: Saves both raw (before proc_list) and processed (after proc_list) data
Processor Fixes
The qlib_loader.py includes fixed implementations of qlib processors that correctly handle the :: separator column format:
FixedDiff- Fixes column naming bug (creates properfeature::col_diffnames)FixedColumnRemover- Handles::separator formatFixedRobustZScoreNorm- Uses trainedmean_train/std_trainparameters from pickleFixedIndusNtrlInjector- Industry neutralization with::formatFixedFlagMarketInjector- Addsmarket_0,market_1columns based on instrument codesFixedFlagSTInjector- CreatesIsSTcolumn fromST_S,ST_Yflags
All fixed processors preserve the trained parameters from the original proc_list pickle.
Polars Dataset Generation
The scripts/dump_features.py script generates datasets using a polars-based pipeline that replicates the qlib preprocessing:
# Generate merged features (flat columns)
python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups merged
# Generate with struct columns (packed feature groups)
python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups merged --pack-struct
# Generate specific feature groups
python scripts/dump_features.py --start-date 2024-01-01 --end-date 2024-01-31 --groups alpha158 market_ext
This script:
- Loads data from Parquet files (alpha158, kline, market flags, industry flags)
- Applies the full processor pipeline:
- Diff processor (adds diff features)
- FlagMarketInjector (adds market_0, market_1)
- ColumnRemover (removes log_size_diff, IsN, IsZt, IsDt)
- FlagToOnehot (converts 29 industry flags to indus_idx)
- IndusNtrlInjector (industry neutralization)
- RobustZScoreNorm (using pre-fitted qlib parameters via
from_version()) - Fillna (fill NaN with 0)
- Saves to parquet/pickle format
Output modes:
- Flat mode (default): All columns as separate fields (348 columns for merged)
- Struct mode (
--pack-struct): Feature groups packed into struct columns:features_alpha158(316 fields)features_market_ext(14 fields)features_market_flag(11 fields)
Note: The FlagSTInjector step is skipped because it fails silently even in the gold-standard qlib code (see BUG_ANALYSIS_FINAL.md for details).
Output structure:
- Raw data: ~204 columns (158 feature + 4 feature_ext + 12 feature_flag + 30 indus_flag)
- Processed data: 348 columns (318 alpha158 + 14 market_ext + 14 market_flag + 2 index)
- VAE input dimension: 341 (excluding indus_idx)
RobustZScoreNorm Parameter Extraction
The pipeline uses pre-fitted normalization parameters extracted from Qlib's proc_list.proc file. These parameters are stored in data/robust_zscore_params/ and can be loaded using the RobustZScoreNorm.from_version() method.
Extract parameters from Qlib proc_list:
python scripts/extract_qlib_params.py --version csiallx_feature2_ntrla_flag_pnlnorm
This creates:
data/robust_zscore_params/{version}/mean_train.npy- Pre-fitted mean parameters (330,)data/robust_zscore_params/{version}/std_train.npy- Pre-fitted std parameters (330,)data/robust_zscore_params/{version}/metadata.json- Feature column names and metadata
Use in Polars processors:
from cta_1d.src.processors import RobustZScoreNorm
# Load pre-fitted parameters by version name
processor = RobustZScoreNorm.from_version("csiallx_feature2_ntrla_flag_pnlnorm")
# Apply normalization to DataFrame
df = processor.process(df)
Parameter details:
- Fit period: 2013-01-01 to 2018-12-31
- Feature count: 330 (158 alpha158_ntrl + 158 alpha158_raw + 7 market_ext_ntrl + 7 market_ext_raw)
- Fields: ['feature', 'feature_ext']
Workflow
1. Generate Beta Embeddings
Generate VAE embeddings from the alpha158_0_7_beta factors:
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
This loads data from Parquet, applies the full feature transformation pipeline, and encodes with the VAE model.
Output: data/embedding_0_7_beta.parquet
2. Fetch Original Predictions
Fetch the original 0_7 predictions from DolphinDB:
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30
Output: data/original_predictions_0_7.parquet
3. Generate Predictions with Beta Embeddings
Use the d033 model to generate predictions from the beta embeddings:
python scripts/predict_with_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
Output: data/predictions_beta_embedding.parquet
4. Generate Actual Returns
Generate actual returns from kline data for IC calculation:
python scripts/generate_returns.py
Output: data/actual_returns.parquet
5. Compare Predictions
Compare the 0_7 vs 0_7_beta predictions:
python scripts/compare_predictions.py
This calculates:
- Prediction correlation (Pearson and Spearman)
- Daily correlation statistics
- IC metrics (mean, std, IR)
- RankIC metrics
- Top-tier returns (top 10%)
Quick Start
Run the full pipeline:
python pipeline.py --start-date 2019-01-01 --end-date 2020-11-30
Or run individual steps:
# Step 1: Generate embeddings
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
# Step 2: Fetch original predictions
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30
# Step 3: Generate beta predictions
python scripts/predict_with_embedding.py
# Step 4: Generate returns
python scripts/generate_returns.py
# Step 5: Compare
python scripts/compare_predictions.py
Data Dependencies
Input Data (from Parquet)
/data/parquet/dataset/stg_1day_wind_alpha158_0_7_beta_1D/- Alpha158 beta factors/data/parquet/dataset/stg_1day_wind_kline_adjusted_1D/- Market data (kline)/data/parquet/dataset/stg_1day_gds_indus_flag_cc1_1D/- Industry flags
Models
/home/guofu/Workspaces/alpha/data_ops/tasks/dwm_feature_vae/model/csiallx_feature2_ntrla_flag_pnlnorm_vae4_dim32a_beta0001/module.pt- VAE encoder/home/guofu/Workspaces/alpha/data_ops/tasks/app_longsignal/model/host140_exp20_d033/module.pt- d033 prediction model
DolphinDB
- Table:
dfs://daily_stock_run_multicast/app_1day_multicast_longsignal_port - Version:
host140_exp20_d033
Key Metrics
The comparison script outputs:
| Metric | Description |
|---|---|
| Pearson Correlation | Overall correlation between 0_7 and beta predictions |
| Spearman Correlation | Rank correlation between predictions |
| Daily Correlation | Mean and std of daily correlations |
| IC Mean | Average information coefficient |
| IC Std | Standard deviation of IC |
| IC IR | Information ratio (IC Mean / IC Std) |
| RankIC | Spearman correlation with returns |
| Top-tier Return | Average return of top 10% predictions |
Notes
- All scripts can be run from the
alpha158_beta/directory - Scripts use relative paths (
../data/) to locate data files - The VAE model expects 341 input features after the transformation pipeline
- The d033 model expects 32-dimensional embeddings with a 40-day lookback window