You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

186 lines
6.0 KiB

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Overview
Alpha Lab is a quantitative research experiment framework for the `qshare` library. It uses a notebook-centric approach for exploring trading strategies and ML models. The codebase is organized around two prediction tasks:
- **cta_1d**: CTA (Commodity Trading Advisor) futures 1-day return prediction
- **stock_15m**: Stock 15-minute forward return prediction using high-frequency features
## Directory Structure
```
alpha_lab/
├── common/ # Shared utilities
│ ├── __init__.py
│ ├── paths.py # Path management
│ └── plotting.py # Common plotting functions
├── cta_1d/ # CTA 1-day return prediction
│ ├── __init__.py # Re-exports from src/
│ ├── config.yaml # Task configuration
│ ├── src/ # Implementation modules
│ │ ├── __init__.py
│ │ ├── loader.py # CTA1DLoader
│ │ ├── train.py # Training functions
│ │ ├── backtest.py # Backtest functions
│ │ └── labels.py # Label blending utilities
│ └── *.ipynb # Experiment notebooks
├── stock_15m/ # Stock 15-minute return prediction
│ ├── __init__.py # Re-exports from src/
│ ├── config.yaml # Task configuration
│ ├── src/ # Implementation modules
│ │ ├── __init__.py
│ │ ├── loader.py # Stock15mLoader
│ │ └── train.py # Training functions
│ └── *.ipynb # Experiment notebooks
└── results/ # Output directory (gitignored)
```
## Common Commands
### Development Setup
```bash
# Install dependencies
pip install -r requirements.txt
# Create environment configuration
cp .env.template .env
# Edit .env with your DolphinDB host and data paths
```
### Running Experiments
```bash
# Start Jupyter for interactive experiments
jupyter notebook
# Train CTA model from config
python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01
# Train Stock 15m model
python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01
# Run CTA backtest
python -m cta_1d.backtest \
--model results/cta_1d/exp01/model.json \
--dt-range 2023-01-01 2023-12-31 \
--output results/cta_1d/backtest_01
```
### Python API Usage
```python
# CTA 1D workflow
from cta_1d import CTA1DLoader, train_model, TrainConfig
loader = CTA1DLoader(return_type='o2c_twap1min', normalization='dual')
dataset = loader.load(dt_range=['2020-01-01', '2023-12-31'])
config = TrainConfig(dt_range=['2020-01-01', '2023-12-31'], feature_sets=['alpha158'])
model, metrics = train_model(config, output_dir='results/exp01')
# Stock 15m workflow
from stock_15m import Stock15mLoader, train_model, TrainConfig
loader = Stock15mLoader(normalization_mode='dual')
dataset = loader.load(
dt_range=['2020-01-01', '2023-12-31'],
feature_path='/data/parquet/stock_1min_alpha158',
kline_path='/data/parquet/stock_1min_kline'
)
```
## Architecture
### Module Organization
All implementation code lives in `src/` subdirectories:
- **`cta_1d/src/`**: CTA-specific implementations
- `loader.py`: CTA1DLoader class
- `train.py`: train_model, TrainConfig
- `backtest.py`: run_backtest, BacktestConfig
- `labels.py`: Label blending utilities
- **`stock_15m/src/`**: Stock-specific implementations
- `loader.py`: Stock15mLoader class
- `train.py`: train_model, TrainConfig
Root `__init__.py` files re-export public APIs for backward compatibility:
```python
from cta_1d import CTA1DLoader # Imports from cta_1d.src
```
### Data Flow
Both tasks follow a consistent pattern:
1. **Loaders** (`src/loader.py`): Fetch data from DolphinDB (CTA) or Parquet files (Stock), apply normalization, compute sample weights, return `pl_Dataset`
2. **Training** (`src/train.py`): XGBoost with early stopping, outputs model JSON + metrics
3. **Backtest** (`src/backtest.py`): CTA-only; uses `qshare.eval.cta.backtest.CTABacktester` for strategy simulation
### Key Classes
- **`CTA1DLoader`**: Loads alpha158/hffactor features from DolphinDB; supports 5 normalization modes (`zscore`, `cs_zscore`, `rolling_20`, `rolling_60`, `dual`)
- **`Stock15mLoader`**: Loads Alpha158 on 1-min data; computes 15-min forward returns; normalization modes: `industry`, `cs_zscore`, `dual`
- **`pl_Dataset`**: From `qshare.data`; provides `.with_segments()`, `.split()`, `.to_numpy()` methods
### Normalization Modes
**CTA 1D** (`dual` blending):
- `zscore`: Fit-time mean/std normalization
- `cs_zscore`: Cross-sectional z-score per datetime
- `rolling_20/60`: Rolling window normalization
- `dual`: Weighted blend (default: [0.2, 0.1, 0.3, 0.4])
**Stock 15m**:
- `industry`: Industry-neutralized returns
- `cs_zscore`: Cross-sectional z-score
- `dual`: 80% industry-neutral + 20% cs_zscore
### Experiment Tracking
Manual tracking in `results/{task}/README.md`:
```markdown
## 2025-01-15: Baseline XGB
- Notebook: `cta_1d/03_baseline_xgb.ipynb` (cells 1-50)
- Config: eta=0.5, lambda=0.1
- Train IC: 0.042
- Test IC: 0.038
- Notes: Dual normalization, 4 trades/day
```
### Dependencies on qshare
The codebase relies heavily on the `qshare` library (already installed in the venv):
- `qshare.data.pl_Dataset`: Dataset container with Polars backend
- `qshare.io.ddb`: DolphinDB session management
- `qshare.io.polars`: Parquet loading utilities
- `qshare.algo.polars`: Industry neutralization, cross-sectional z-score
- `qshare.eval.cta.backtest`: CTA backtesting framework
- `qshare.config.research.cta`: Predefined column lists (HFFACTOR_COLS)
### Configuration Files
YAML configs define data ranges, model hyperparameters, and output settings:
```yaml
data:
dt_range: ['2020-01-01', '2023-12-31']
feature_sets: [alpha158, hffactor]
normalization: dual
model:
type: xgb
params: {eta: 0.05, max_depth: 6}
```
Load with: `python -m cta_1d.train --config config.yaml` or `yaml.safe_load()` directly.