You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

230 lines
7.5 KiB

# Alpha158 0_7 vs 0_7_beta Prediction Comparison
This directory contains a workflow for comparing Alpha158 version 0_7 (original) vs 0_7_beta (enhanced with VAE embeddings) predictions.
## Overview
The goal is to evaluate whether the beta version of Alpha158 factors produces better predictions than the original 0_7 version when used with the d033 prediction model.
## Directory Structure
```
stock_1d/d033/alpha158_beta/
├── README.md # This file
├── config.yaml # VAE model configuration
├── pipeline.py # Main orchestration script
├── scripts/ # Core pipeline scripts
│ ├── generate_beta_embedding.py # Generate VAE embeddings from beta factors
│ ├── generate_returns.py # Generate actual returns from kline data
│ ├── fetch_predictions.py # Fetch original predictions from DolphinDB
│ ├── predict_with_embedding.py # Generate predictions using beta embeddings
│ └── compare_predictions.py # Compare 0_7 vs 0_7_beta predictions
├── src/ # Source modules
│ └── qlib_loader.py # Qlib data loader with configurable date range
├── config/ # Configuration files
│ └── handler.yaml # Modified handler with configurable end date
└── data/ # Data files (gitignored)
├── embedding_0_7_beta.parquet
├── predictions_beta_embedding.parquet
├── original_predictions_0_7.parquet
├── actual_returns.parquet
├── raw_data_*.pkl # Raw data before preprocessing
└── processed_data_*.pkl # Processed data after preprocessing
```
## Data Loading with Configurable Date Range
### handler.yaml Modification
The original `handler.yaml` uses `<TODAY>` placeholder which always loads data until today's date. The modified version in `config/handler.yaml` uses `<LOAD_END>` placeholder that can be controlled via arguments:
```yaml
# Original (always loads until today)
load_start: &load_start <SINCE_DATE>
load_end: &load_end <TODAY>
# Modified (configurable end date)
load_start: &load_start <LOAD_START>
load_end: &load_end <LOAD_END>
```
### Using qlib_loader.py
```python
from stock_1d.d033.alpha158_beta.src.qlib_loader import (
load_data_from_handler,
load_data_with_proc_list,
load_and_dump_data
)
# Load data with configurable date range
df = load_data_from_handler(
since_date="2019-01-01",
end_date="2019-01-31",
buffer_days=20, # Extra days for diff calculations
verbose=True
)
# Load and apply preprocessing pipeline
df_processed = load_data_with_proc_list(
since_date="2019-01-01",
end_date="2019-01-31",
proc_list_path="/path/to/proc_list.proc",
buffer_days=20
)
# Load and dump both raw and processed data to pickle files
raw_df, processed_df = load_and_dump_data(
since_date="2019-01-01",
end_date="2019-01-31",
output_dir="data/",
fill_con_rating_nan=True, # Fill NaN in con_rating_strength column
verbose=True
)
```
### Key Features
1. **Configurable end date**: Unlike the original handler.yaml, the end date is now respected
2. **Buffer period handling**: Automatically loads extra days before `since_date` for diff calculations
3. **NaN handling**: Optional filling of NaN values in `con_rating_strength` column
4. **Dual output**: Saves both raw (before proc_list) and processed (after proc_list) data
### Processor Fixes
The `qlib_loader.py` includes fixed implementations of qlib processors that correctly handle the `::` separator column format:
- `FixedDiff` - Fixes column naming bug (creates proper `feature::col_diff` names)
- `FixedColumnRemover` - Handles `::` separator format
- `FixedRobustZScoreNorm` - Uses trained `mean_train`/`std_train` parameters from pickle
- `FixedIndusNtrlInjector` - Industry neutralization with `::` format
- Other fixed processors for the full preprocessing pipeline
All fixed processors preserve the trained parameters from the original proc_list pickle.
## Workflow
### 1. Generate Beta Embeddings
Generate VAE embeddings from the alpha158_0_7_beta factors:
```bash
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
```
This loads data from Parquet, applies the full feature transformation pipeline, and encodes with the VAE model.
Output: `data/embedding_0_7_beta.parquet`
### 2. Fetch Original Predictions
Fetch the original 0_7 predictions from DolphinDB:
```bash
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30
```
Output: `data/original_predictions_0_7.parquet`
### 3. Generate Predictions with Beta Embeddings
Use the d033 model to generate predictions from the beta embeddings:
```bash
python scripts/predict_with_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
```
Output: `data/predictions_beta_embedding.parquet`
### 4. Generate Actual Returns
Generate actual returns from kline data for IC calculation:
```bash
python scripts/generate_returns.py
```
Output: `data/actual_returns.parquet`
### 5. Compare Predictions
Compare the 0_7 vs 0_7_beta predictions:
```bash
python scripts/compare_predictions.py
```
This calculates:
- Prediction correlation (Pearson and Spearman)
- Daily correlation statistics
- IC metrics (mean, std, IR)
- RankIC metrics
- Top-tier returns (top 10%)
## Quick Start
Run the full pipeline:
```bash
python pipeline.py --start-date 2019-01-01 --end-date 2020-11-30
```
Or run individual steps:
```bash
# Step 1: Generate embeddings
python scripts/generate_beta_embedding.py --start-date 2019-01-01 --end-date 2020-11-30
# Step 2: Fetch original predictions
python scripts/fetch_predictions.py --start-date 2019-01-01 --end-date 2020-11-30
# Step 3: Generate beta predictions
python scripts/predict_with_embedding.py
# Step 4: Generate returns
python scripts/generate_returns.py
# Step 5: Compare
python scripts/compare_predictions.py
```
## Data Dependencies
### Input Data (from Parquet)
- `/data/parquet/dataset/stg_1day_wind_alpha158_0_7_beta_1D/` - Alpha158 beta factors
- `/data/parquet/dataset/stg_1day_wind_kline_adjusted_1D/` - Market data (kline)
- `/data/parquet/dataset/stg_1day_gds_indus_flag_cc1_1D/` - Industry flags
### Models
- `/home/guofu/Workspaces/alpha/data_ops/tasks/dwm_feature_vae/model/csiallx_feature2_ntrla_flag_pnlnorm_vae4_dim32a_beta0001/module.pt` - VAE encoder
- `/home/guofu/Workspaces/alpha/data_ops/tasks/app_longsignal/model/host140_exp20_d033/module.pt` - d033 prediction model
### DolphinDB
- Table: `dfs://daily_stock_run_multicast/app_1day_multicast_longsignal_port`
- Version: `host140_exp20_d033`
## Key Metrics
The comparison script outputs:
| Metric | Description |
|--------|-------------|
| Pearson Correlation | Overall correlation between 0_7 and beta predictions |
| Spearman Correlation | Rank correlation between predictions |
| Daily Correlation | Mean and std of daily correlations |
| IC Mean | Average information coefficient |
| IC Std | Standard deviation of IC |
| IC IR | Information ratio (IC Mean / IC Std) |
| RankIC | Spearman correlation with returns |
| Top-tier Return | Average return of top 10% predictions |
## Notes
- All scripts can be run from the `alpha158_beta/` directory
- Scripts use relative paths (`../data/`) to locate data files
- The VAE model expects 341 input features after the transformation pipeline
- The d033 model expects 32-dimensional embeddings with a 40-day lookback window