Update documentation for src/ consolidation

- Add detailed directory structure to CLAUDE.md and README.md
- Document Module Organization section explaining src/ layout
- Add Python API import examples showing re-export pattern
- Add Command Line usage section with examples
- Update "Adding a New Task" instructions for src/ structure
- Add module organization best practice

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
master
guofu 3 weeks ago
parent 966c17d7a9
commit 19f7c522e4

@ -0,0 +1,185 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Overview
Alpha Lab is a quantitative research experiment framework for the `qshare` library. It uses a notebook-centric approach for exploring trading strategies and ML models. The codebase is organized around two prediction tasks:
- **cta_1d**: CTA (Commodity Trading Advisor) futures 1-day return prediction
- **stock_15m**: Stock 15-minute forward return prediction using high-frequency features
## Directory Structure
```
alpha_lab/
├── common/ # Shared utilities
│ ├── __init__.py
│ ├── paths.py # Path management
│ └── plotting.py # Common plotting functions
├── cta_1d/ # CTA 1-day return prediction
│ ├── __init__.py # Re-exports from src/
│ ├── config.yaml # Task configuration
│ ├── src/ # Implementation modules
│ │ ├── __init__.py
│ │ ├── loader.py # CTA1DLoader
│ │ ├── train.py # Training functions
│ │ ├── backtest.py # Backtest functions
│ │ └── labels.py # Label blending utilities
│ └── *.ipynb # Experiment notebooks
├── stock_15m/ # Stock 15-minute return prediction
│ ├── __init__.py # Re-exports from src/
│ ├── config.yaml # Task configuration
│ ├── src/ # Implementation modules
│ │ ├── __init__.py
│ │ ├── loader.py # Stock15mLoader
│ │ └── train.py # Training functions
│ └── *.ipynb # Experiment notebooks
└── results/ # Output directory (gitignored)
```
## Common Commands
### Development Setup
```bash
# Install dependencies
pip install -r requirements.txt
# Create environment configuration
cp .env.template .env
# Edit .env with your DolphinDB host and data paths
```
### Running Experiments
```bash
# Start Jupyter for interactive experiments
jupyter notebook
# Train CTA model from config
python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01
# Train Stock 15m model
python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01
# Run CTA backtest
python -m cta_1d.backtest \
--model results/cta_1d/exp01/model.json \
--dt-range 2023-01-01 2023-12-31 \
--output results/cta_1d/backtest_01
```
### Python API Usage
```python
# CTA 1D workflow
from cta_1d import CTA1DLoader, train_model, TrainConfig
loader = CTA1DLoader(return_type='o2c_twap1min', normalization='dual')
dataset = loader.load(dt_range=['2020-01-01', '2023-12-31'])
config = TrainConfig(dt_range=['2020-01-01', '2023-12-31'], feature_sets=['alpha158'])
model, metrics = train_model(config, output_dir='results/exp01')
# Stock 15m workflow
from stock_15m import Stock15mLoader, train_model, TrainConfig
loader = Stock15mLoader(normalization_mode='dual')
dataset = loader.load(
dt_range=['2020-01-01', '2023-12-31'],
feature_path='/data/parquet/stock_1min_alpha158',
kline_path='/data/parquet/stock_1min_kline'
)
```
## Architecture
### Module Organization
All implementation code lives in `src/` subdirectories:
- **`cta_1d/src/`**: CTA-specific implementations
- `loader.py`: CTA1DLoader class
- `train.py`: train_model, TrainConfig
- `backtest.py`: run_backtest, BacktestConfig
- `labels.py`: Label blending utilities
- **`stock_15m/src/`**: Stock-specific implementations
- `loader.py`: Stock15mLoader class
- `train.py`: train_model, TrainConfig
Root `__init__.py` files re-export public APIs for backward compatibility:
```python
from cta_1d import CTA1DLoader # Imports from cta_1d.src
```
### Data Flow
Both tasks follow a consistent pattern:
1. **Loaders** (`src/loader.py`): Fetch data from DolphinDB (CTA) or Parquet files (Stock), apply normalization, compute sample weights, return `pl_Dataset`
2. **Training** (`src/train.py`): XGBoost with early stopping, outputs model JSON + metrics
3. **Backtest** (`src/backtest.py`): CTA-only; uses `qshare.eval.cta.backtest.CTABacktester` for strategy simulation
### Key Classes
- **`CTA1DLoader`**: Loads alpha158/hffactor features from DolphinDB; supports 5 normalization modes (`zscore`, `cs_zscore`, `rolling_20`, `rolling_60`, `dual`)
- **`Stock15mLoader`**: Loads Alpha158 on 1-min data; computes 15-min forward returns; normalization modes: `industry`, `cs_zscore`, `dual`
- **`pl_Dataset`**: From `qshare.data`; provides `.with_segments()`, `.split()`, `.to_numpy()` methods
### Normalization Modes
**CTA 1D** (`dual` blending):
- `zscore`: Fit-time mean/std normalization
- `cs_zscore`: Cross-sectional z-score per datetime
- `rolling_20/60`: Rolling window normalization
- `dual`: Weighted blend (default: [0.2, 0.1, 0.3, 0.4])
**Stock 15m**:
- `industry`: Industry-neutralized returns
- `cs_zscore`: Cross-sectional z-score
- `dual`: 80% industry-neutral + 20% cs_zscore
### Experiment Tracking
Manual tracking in `results/{task}/README.md`:
```markdown
## 2025-01-15: Baseline XGB
- Notebook: `cta_1d/03_baseline_xgb.ipynb` (cells 1-50)
- Config: eta=0.5, lambda=0.1
- Train IC: 0.042
- Test IC: 0.038
- Notes: Dual normalization, 4 trades/day
```
### Dependencies on qshare
The codebase relies heavily on the `qshare` library (already installed in the venv):
- `qshare.data.pl_Dataset`: Dataset container with Polars backend
- `qshare.io.ddb`: DolphinDB session management
- `qshare.io.polars`: Parquet loading utilities
- `qshare.algo.polars`: Industry neutralization, cross-sectional z-score
- `qshare.eval.cta.backtest`: CTA backtesting framework
- `qshare.config.research.cta`: Predefined column lists (HFFACTOR_COLS)
### Configuration Files
YAML configs define data ranges, model hyperparameters, and output settings:
```yaml
data:
dt_range: ['2020-01-01', '2023-12-31']
feature_sets: [alpha158, hffactor]
normalization: dual
model:
type: xgb
params: {eta: 0.05, max_depth: 6}
```
Load with: `python -m cta_1d.train --config config.yaml` or `yaml.safe_load()` directly.

@ -14,20 +14,33 @@ Quantitative research experiments for qshare library. This repository contains J
```
alpha_lab/
├── common/ # Shared utilities (keep minimal!)
│ ├── __init__.py
│ ├── paths.py # Path management
│ └── plotting.py # Common plotting functions
├── cta_1d/ # CTA 1-day return prediction
│ ├── __init__.py # Re-exports from src/
│ ├── config.yaml # Task configuration
│ ├── src/ # Implementation modules
│ │ ├── __init__.py
│ │ ├── loader.py # CTA1DLoader
│ │ ├── train.py # Training functions
│ │ ├── backtest.py # Backtest functions
│ │ └── labels.py # Label blending utilities
│ ├── 01_data_check.ipynb
│ ├── 02_label_analysis.ipynb
│ ├── 03_baseline_xgb.ipynb
│ ├── 04_blend_comparison.ipynb
│ └── src/ # Task-specific helpers
│ └── 04_blend_comparison.ipynb
├── stock_15m/ # Stock 15-minute return prediction
│ ├── __init__.py # Re-exports from src/
│ ├── config.yaml # Task configuration
│ ├── src/ # Implementation modules
│ │ ├── __init__.py
│ │ ├── loader.py # Stock15mLoader
│ │ └── train.py # Training functions
│ ├── 01_data_exploration.ipynb
│ ├── 02_baseline_model.ipynb
│ └── src/
│ └── 02_baseline_model.ipynb
└── results/ # Output directory (gitignored)
├── cta_1d/
@ -47,6 +60,8 @@ cp .env.template .env
## Usage
### Interactive (Notebooks)
Start Jupyter and run notebooks interactively:
```bash
@ -59,6 +74,33 @@ Each task directory contains numbered notebooks:
- `03_*.ipynb` - Advanced experiments
- `04_*.ipynb` - Comparisons and ablations
### Command Line
Train models from config files:
```bash
# CTA 1D
python -m cta_1d.train --config cta_1d/config.yaml --output results/cta_1d/exp01
# Stock 15m
python -m stock_15m.train --config stock_15m/config.yaml --output results/stock_15m/exp01
# CTA Backtest
python -m cta_1d.backtest \
--model results/cta_1d/exp01/model.json \
--dt-range 2023-01-01 2023-12-31 \
--output results/cta_1d/backtest_01
```
### Python API
```python
# Import from task root (re-exports from src/)
from cta_1d import CTA1DLoader, train_model, TrainConfig
from stock_15m import Stock15mLoader, train_model, TrainConfig
from common import create_experiment_dir
```
## Experiment Tracking
Experiments are tracked manually in `results/{task}/README.md`:
@ -75,13 +117,18 @@ Experiments are tracked manually in `results/{task}/README.md`:
## Adding a New Task
1. Create directory: `mkdir my_task`
2. Add `src/` subdirectory for helpers
3. Create numbered notebooks
4. Add entry to `results/my_task/README.md`
2. Add `src/` subdirectory with:
- `__init__.py` - Export public APIs
- `loader.py` - Dataset loader class
- Other modules as needed
3. Add root `__init__.py` that re-exports from `src/`
4. Create numbered notebooks
5. Add entry to `results/my_task/README.md`
## Best Practices
1. **Keep it simple**: Only add to `common/` after 3+ copies
2. **Notebook configs**: Define CONFIG dict in first cell for easy modification
3. **Document results**: Update results README after significant runs
4. **Git discipline**: Don't commit large files, results, or credentials
2. **Module organization**: Place implementation in `src/`, re-export from root `__init__.py`
3. **Notebook configs**: Define CONFIG dict in first cell for easy modification
4. **Document results**: Update results README after significant runs
5. **Git discipline**: Don't commit large files, results, or credentials

Loading…
Cancel
Save