Skip to content

Latest commit

Β 

History

History
755 lines (564 loc) Β· 19.9 KB

File metadata and controls

755 lines (564 loc) Β· 19.9 KB

AlgEngine Usage Guide

This guide covers how to use AlgEngine for training end-to-end autonomous driving models, running evaluations, extracting rare cases, and fine-tuning. AlgEngine is built on MMDetection3D and supports UniAD, VADv2, and HydraMDP architectures.

Table of Contents


Quick Reference

cd projects/AlgEngine
# Training (8 GPUs)
./scripts/e2e_dist_train.sh <config> <num_gpus> [resume_checkpoint]

# Open-loop evaluation
./scripts/e2e_dist_eval.sh <config> <checkpoint> <num_gpus>

# Full train set evaluation (chunked)
bash scripts/e2e_dist_eval_navtrain_chunked.sh <config> <checkpoint> <num_gpus> [num_chunks]

# Rare case extraction
python scripts/rare_case_sampling_by_pdms.py \
    --pdm-result <csv_file> \
    --base-split <yaml_file> \
    --output-dir <output_dir>

# Closed-loop evaluation
bash scripts/run_ray_distributed_testing.sh <config> <checkpoint> <model_name> <data_type> <react_type>

Training

Prerequisites

Before training, ensure:

  • βœ… AlgEngine environment installed (algengine conda env)
  • βœ… Data prepared (see Data Organization)
  • βœ… Pre-trained backbone weights downloaded

Training from Scratch

Train a model on 50% of the training data:

conda activate algengine
cd projects/AlgEngine

# Train VADv2 with 50% data (8 GPUs)
./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_50pct.py 8

Arguments:

  1. <config>: Configuration file path
  2. <num_gpus>: Number of GPUs to use
  3. [resume_checkpoint] (optional): Checkpoint to resume from

Training with 100% Data

./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_100pct.py 8

Resume Training

Resume from a checkpoint:

./scripts/e2e_dist_train.sh \
    configs/worldengine/e2e_vadv2_50pct.py \
    8 \
    work_dirs/e2e_vadv2_50pct/latest.pth

Auto-resume: If latest.pth exists in work_dirs/, training will auto-resume.

Monitor Training

# Watch training log
tail -f work_dirs/e2e_vadv2_50pct/logs/train.*

# TensorBoard (if enabled)
tensorboard --logdir work_dirs/e2e_vadv2_50pct/tf_logs

Key metrics to monitor:

  • loss: Total training loss (should decrease)
  • loss_planning: Planning loss
  • loss_track: Tracking loss
  • ade_4s: Average displacement error at 4 seconds
  • fde_4s: Final displacement error at 4 seconds

Training Output

work_dirs/e2e_vadv2_50pct/
β”œβ”€β”€ e2e_vadv2_50pct.py          # Config backup
β”œβ”€β”€ logs/
β”‚   └── train.26040614*         # Training logs
β”œβ”€β”€ epoch_1.pth                 # Checkpoints
β”œβ”€β”€ epoch_2.pth
...
β”œβ”€β”€ epoch_20.pth
└── latest.pth                  # Symlink to latest checkpoint

Evaluation

Open-Loop Evaluation

Evaluate model predictions against ground truth trajectories.

Full Test Set Evaluation

conda activate algengine
cd projects/AlgEngine

# Evaluate on navtest (8 GPUs)
./scripts/e2e_dist_eval.sh \
    configs/worldengine/e2e_vadv2_50pct.py \
    work_dirs/e2e_vadv2_50pct/epoch_20.pth \
    8

Output:

work_dirs/e2e_vadv2_50pct/
└── navtest.csv                 # Evaluation results

Rare Navtest Cases Only

Evaluate on known rare navtest cases:

./scripts/e2e_dist_eval_navtest_failures.sh \
    configs/worldengine/e2e_vadv2_50pct.py \
    work_dirs/e2e_vadv2_50pct/epoch_20.pth \
    8

Output:

work_dirs/e2e_vadv2_50pct/
└── navtest_failures.csv        # rare navtest cases only

Full Train Set Evaluation

Evaluate on the full training set (navtrain) to produce per-scenario metrics for Rare Case Extraction. Because navtrain is large, the script splits it into chunks to avoid OOM.

conda activate algengine
cd projects/AlgEngine

# Chunked evaluation on navtrain (8 GPUs, 20 chunks)
bash scripts/e2e_dist_eval_navtrain_chunked.sh \
    configs/worldengine/e2e_vadv2_50pct.py \
    work_dirs/e2e_vadv2_50pct/epoch.pth \
    8 \
    20

Arguments:

  1. <config>: Configuration file path
  2. <checkpoint>: Model checkpoint to evaluate
  3. <num_gpus>: Number of GPUs to use
  4. [num_chunks] (optional, default 10): Number of chunks to split navtrain into

The script automatically:

  1. Splits navtrain.yaml into chunks under configs/navsim_splits/navtrain_split/chunks/
  2. Evaluates each chunk sequentially
  3. Merges all chunk CSVs into a single file

Output:

experiments/worldengine/e2e_vadv2_50pct/
└── navtrain.csv                # Full train set evaluation results

Understanding Evaluation Metrics

Open-loop metrics CSV format:

token,ade_4s,fde_4s,no_at_fault_collisions,drivable_area_compliance,ego_progress,comfort,score
abc123,0.42,0.85,1.0,0.95,0.88,0.92,0.89
...

Key metrics:

  • ade_4s: Average trajectory error over 4 seconds (meters, lower is better)
  • fde_4s: Final position error at 4 seconds (meters, lower is better)
  • no_at_fault_collisions: Collision avoidance rate (0-1, higher is better)
  • drivable_area_compliance: Stay in drivable area (0-1, higher is better)
  • ego_progress: Route completion (0-1, higher is better)
  • comfort: Comfort metric (0-1, higher is better)
  • score: Overall PDM score (0-1, higher is better)

Closed-Loop Evaluation

Evaluate model in simulation (requires SimEngine).

See SimEngine Usage Guide for:

  • Single-GPU testing
  • Multi-GPU distributed testing
  • Reactive vs non-reactive modes

Quick example:

cd projects/AlgEngine

bash scripts/run_ray_distributed_testing.sh \
    $WORLDENGINE_ROOT/projects/AlgEngine/configs/worldengine/e2e_vadv2_50pct.py \
    $WORLDENGINE_ROOT/projects/AlgEngine/work_dirs/e2e_vadv2_50pct/epoch_20.pth \
    e2e_vadv2_50pct_epoch20 \
    navtest_failures \
    NR

Rare Case Extraction

Extract failure scenarios from evaluation results for targeted fine-tuning.

Prerequisites

Before extracting rare cases, you must complete a Full Train Set Evaluation to generate navtrain.csv with per-scenario metrics. The rare case extraction script uses this CSV to identify failure scenarios.

Basic Extraction

conda activate algengine
cd projects/AlgEngine

python scripts/rare_case_sampling_by_pdms.py \
    --pdm-result ${WORLDENGINE_ROOT}/experiments/worldengine/e2e_vadv2_50pct/navtrain.csv \
    --base-split configs/navsim_splits/navtrain_split/navtrain_50pct.yaml \
    --output-dir configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare

Arguments:

  • --pdm-result: CSV file with evaluation metrics
  • --base-split: Base scenario split YAML file
  • --output-dir: Directory to save extracted split files

Extracted Splits

The script generates three rare case split files:

configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/
β”œβ”€β”€ navtrain_50pct_collision.yaml      # Collision scenarios
β”œβ”€β”€ navtrain_50pct_off_road.yaml       # Off-road scenarios
└── navtrain_50pct_ep_1pct.yaml        # Low ego-progress (bottom 1%)

Using Custom Thresholds

Edit the script to customize:

# In rare_case_sampling_by_pdms.py

# Change collision threshold
collision_scenarios = df[df['no_at_fault_collisions'] < 0.95]  # From 1.0

# Change ego-progress percentile
ep_threshold = df['ego_progress'].quantile(0.05)  # From 0.01 (1% -> 5%)

Verify Extracted Scenarios

# Check how many scenarios were extracted
wc -l configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/*.yaml

# View first few scenarios
head -20 configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_collision.yaml

Fine-Tuning

Fine-tune a trained model on rare cases using reinforcement learning.

Prerequisites: Generating Rollouts with SimEngine

Important: Rare case extraction and Rollout data must be generated by SimEngine before fine-tuning. This involves:

  1. Convert nuPlan Data to SimEngine Format β€” convert rare case scenarios to SimEngine scenario format:

    conda activate simengine
    
    python projects/SimEngine/worldengine/utils/dataset_utils/nuplan/digitaltwin_nuplan_converter_navsim_filter.py \
        --navsim-filters $ALGENGINE_ROOT/configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_collision.yaml \
         $ALGENGINE_ROOT/configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_ep_1pct.yaml \
         $ALGENGINE_ROOT/configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_off_road.yaml \
        --out-dir data/sim_engine/scenarios/original/navtrain_vadv2_50pct_rare \
        --num-processes 8

    Output: data/sim_engine/scenarios/original/navtrain_vadv2_50pct_rare/all_scenarios.pkl

    For full parameter reference, see SimEngine: Convert nuPlan Data.

  2. Run SimEngine Rollout to generate trajectory data:

    # Multi-GPU distributed rollout (recommended for large datasets)
    export WORLDENGINE_ROOT=/path/to/WorldEngine
    cd projects/SimEngine
    bash scripts/run_ray_distributed_rollout.sh \
        $WORLDENGINE_ROOT/projects/AlgEngine/configs/worldengine/e2e_vadv2_50pct.py \
        $WORLDENGINE_ROOT/data/alg_engine/ckpts/e2e_vadv2_50pct_ep8.pth \
        e2e_vadv2_50pct \
        navtrain_vadv2_50pct_rare \
        navtrain
  3. Rollout Output is saved to:

    experiments/closed_loop_exps/e2e_vadv2_50pct/navtrain_NR/
    └── WE_output/
        └── openscene_format/
            β”œβ”€β”€ sensor_blobs/        # Camera images, LiDAR
            β”œβ”€β”€ meta_datas/          # Per-scenario metadata
            β”œβ”€β”€ pdms_pkl/            # Metric pdms pkl
            └── all_scenes_pdm_averages_NR.csv
    
  4. Reorganize to AlgEngine Format (creates openscene-synthetic dataset):

    conda activate simengine
    
    cd projects/SimEngine
    python scripts/export_simulation_data.py \
        --test_path experiments/closed_loop_exps/e2e_vadv2_50pct/navtrain_NR \
        --appendix 260406  # Date suffix for versioning, Default None

    Output location: data/alg_engine/openscene-synthetic/

  5. Verify Data Structure:

    data/alg_engine/openscene-synthetic/
    β”œβ”€β”€ sensor_blobs/              # Replayed scenario sensor data
    β”œβ”€β”€ meta_datas/                # Metadata
    └── pdms_pkl/                  # Metric pdms pkl

For detailed SimEngine usage, see SimEngine Usage Guide.

RL-Based Fine-Tuning

conda activate algengine
cd projects/AlgEngine

# Fine-tune on extracted rare cases (8 GPUs)
./scripts/e2e_dist_train.sh \
    configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py \
    8 \
    work_dirs/e2e_vadv2_50pct/epoch_20.pth

Arguments:

  1. Config with _rlft_rare_log suffix (uses rare case splits)
  2. Number of GPUs
  3. Base checkpoint to fine-tune from

Fine-Tuning Configuration

The _rlft_rare_log config typically includes:

# configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py

# Use rare case splits
data = dict(
    train=dict(
        ann_file='merged_infos_navformer/nuplan_openscene_navtrain.pkl',
        scenario_filter=[
            'configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_collision.yaml',
            'configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_off_road.yaml',
            'configs/navsim_splits/navtrain_split/e2e_vadv2_50pct_rare/navtrain_50pct_ep_1pct.yaml',
        ]
    )
)

# RL training settings
optimizer = dict(type='AdamW', lr=5e-5)  # Lower learning rate
total_epochs = 8  # Fewer epochs for fine-tuning

Fine-Tuning Output

work_dirs/e2e_vadv2_50pct_rlft_rare_log/
β”œβ”€β”€ e2e_vadv2_50pct_rlft_rare_log.py
β”œβ”€β”€ logs/
β”‚   └── train.*
β”œβ”€β”€ epoch_1.pth
...
└── epoch_8.pth

Evaluate Fine-Tuned Model

cd projects/AlgEngine
# Open-loop evaluation
./scripts/e2e_dist_eval.sh \
    configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py \
    work_dirs/e2e_vadv2_50pct_rlft_rare_log/epoch_8.pth \
    8

# Closed-loop evaluation
bash scripts/run_ray_distributed_testing.sh \
    $WORLDENGINE_ROOT/projects/AlgEngine/configs/worldengine/e2e_vadv2_50pct_rlft_rare_log.py \
    $WORLDENGINE_ROOT/projects/AlgEngine/work_dirs/e2e_vadv2_50pct_rlft_rare_log/epoch_8.pth \
    e2e_vadv2_50pct_rlft \
    navtest_failures \
    NR

Configuration

AlgEngine uses hierarchical configuration with MMDetection3D. For a detailed reference of all config parameters, variants, and their relationships, see the Configuration Guide.

Configuration Hierarchy

configs/
β”œβ”€β”€ _base_/
β”‚   └── default_runtime.py              # Base runtime settings
β”œβ”€β”€ worldengine/
β”‚   β”œβ”€β”€ e2e_vadv2_50pct.py              # 50% data training
β”‚   β”œβ”€β”€ e2e_vadv2_100pct.py             # 100% data training
β”‚   β”œβ”€β”€ e2e_vadv2_50pct_rlft_rare_log.py  # Rare case fine-tuning
β”‚   └── ...
└── navsim_splits/
    β”œβ”€β”€ navtrain_split/
    β”‚   β”œβ”€β”€ navtrain.yaml               # Full training set
    β”‚   β”œβ”€β”€ navtrain_50pct.yaml         # 50% subset
    β”‚   └── e2e_vadv2_50pct_rare/       # Rare case splits
    β”‚       β”œβ”€β”€ navtrain_50pct_collision.yaml
    β”‚       β”œβ”€β”€ navtrain_50pct_off_road.yaml
    β”‚       └── navtrain_50pct_ep_1pct.yaml
    └── navtest_split/
        β”œβ”€β”€ navtest.yaml                # Full test set
        └── navtest_failures.yaml       # Failure subset

Key Configuration Parameters

# Model architecture
model = dict(
    type='VADv2',  # or 'UniAD', 'HydraMDP'
    num_query=900,
    num_classes=7,
    planning_steps=8,
    img_backbone=dict(type='ResNet50', ...),
    img_neck=dict(type='FPN', ...),
)

# BEV configuration
bev_h_, bev_w_ = 200, 200
patch_size = [102.4, 102.4]  # Physical range (meters)

# Input modality
input_modality = dict(
    use_lidar=False,
    use_camera=True,  # 8 cameras
    use_radar=False,
    use_external=True  # CAN bus
)

# Training
total_epochs = 20
optimizer = dict(type='AdamW', lr=2e-4, weight_decay=0.01)
lr_config = dict(
    policy='CosineAnnealing',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=1.0 / 3,
)

# Data
data = dict(
    samples_per_gpu=1,
    workers_per_gpu=4,
    train=dict(
        ann_file='merged_infos_navformer/nuplan_openscene_navtrain.pkl',
        scenario_filter='configs/navsim_splits/navtrain_split/navtrain_50pct.yaml',
    ),
    val=dict(
        ann_file='merged_infos_navformer/nuplan_openscene_navtest.pkl',
        scenario_filter='configs/navsim_splits/navtest_split/navtest.yaml',
    ),
)

Override Configuration

Override config parameters at runtime:

./scripts/e2e_dist_train.sh \
    configs/worldengine/e2e_vadv2_50pct.py \
    8 \
    --cfg-options \
    optimizer.lr=1e-4 \
    total_epochs=30 \
    data.samples_per_gpu=2

Model Architectures

AlgEngine supports multiple end-to-end autonomous driving architectures.

VADv2 (Default)

Features:

  • Vector-based scene representation
  • Planning-oriented perception
  • Efficient trajectory prediction

Config: configs/worldengine/e2e_vadv2_*.py

Best for: General driving scenarios, fast inference

UniAD

Features:

  • Unified perception-prediction-planning
  • Multi-task learning
  • Strong generalization

Config: configs/worldengine/e2e_uniad_*.py

Best for: Complex scenarios, research

HydraMDP

Features:

  • Multi-modal trajectory prediction
  • Distribution-aware planning
  • Behavior world model integration

Config: configs/worldengine/e2e_hydramdp_*.py

Best for: Safety-critical scenarios, rare cases

Switching Architectures

# Train UniAD instead of VADv2
./scripts/e2e_dist_train.sh configs/worldengine/e2e_uniad_50pct.py 8

# Evaluate HydraMDP
./scripts/e2e_dist_eval.sh \
    configs/worldengine/e2e_hydramdp_50pct.py \
    work_dirs/e2e_hydramdp_50pct/epoch_20.pth \
    8

Advanced Training

Multi-Node Training

For very large models or datasets:

# Node 0 (master)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16  # Total GPUs
export RANK=0  # Node rank

./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_100pct.py 8

# Node 1 (worker)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16
export RANK=8

./scripts/e2e_dist_train.sh configs/worldengine/e2e_vadv2_100pct.py 8

Mixed Precision Training

Enable automatic mixed precision (AMP) for faster training:

# In config
fp16 = dict(loss_scale='dynamic')

Gradient Accumulation

For large batch sizes with limited GPU memory:

# In config
data = dict(
    samples_per_gpu=1,
    workers_per_gpu=4,
)

# Set gradient accumulation steps
runner = dict(
    max_epochs=20,
    gradient_accumulation_steps=4,  # Effective batch size = 1 * 8 GPUs * 4 = 32
)

Troubleshooting

Issue 1: CUDA out of memory during training

Solution:

# Reduce batch size
# Edit config: data.samples_per_gpu = 1 (from 2)

# Reduce BEV resolution
# Edit config: bev_h_, bev_w_ = 150, 150 (from 200, 200)

# Use gradient checkpointing
# Edit config: model.img_backbone.with_cp = True

Issue 2: Training loss not decreasing

Possible causes:

  • Learning rate too high/low
  • Data loading issues
  • Incorrect pre-trained weights

Solution:

# Check data loading
python tools/analysis_tools/browse_dataset.py configs/worldengine/e2e_vadv2_50pct.py

# Verify pre-trained weights loaded
grep "load checkpoint" work_dirs/*/logs/train.*

# Try different learning rate
./scripts/e2e_dist_train.sh ... --cfg-options optimizer.lr=1e-4

Issue 3: Evaluation hangs

Solution:

# Check if processes are stuck
ps aux | grep python

# Kill stuck processes
pkill -f "test.py"

# Restart evaluation with fewer GPUs
./scripts/e2e_dist_eval.sh ... 4  # Use 4 instead of 8

Issue 4: "ModuleNotFoundError: No module named mmdet3d"

Solution:

# Ensure you're in the right environment
conda activate algengine

# Verify MMCV installation
python -c "import mmcv; print(mmcv.__version__)"

# Reinstall MMDetection3D if needed
pip uninstall mmdet3d -y
pip install mmdet3d==1.0.0rc6

Issue 5: Checkpoint file corrupted

Solution:

# Use a previous checkpoint
./scripts/e2e_dist_train.sh ... work_dirs/*/epoch_18.pth  # Instead of epoch_20

# Or train from scratch
rm work_dirs/e2e_vadv2_50pct/latest.pth
./scripts/e2e_dist_train.sh ...

Performance Optimization

Training Speed

  1. Increase workers: data.workers_per_gpu = 8 (if CPU/RAM allows)
  2. Use SSD: Store data on fast NVMe SSD
  3. Mixed precision: Enable fp16 = dict(loss_scale='dynamic')
  4. Persistent workers: data.persistent_workers = True

Memory Optimization

  1. Reduce batch size: data.samples_per_gpu = 1
  2. Lower BEV resolution: bev_h_, bev_w_ = 150, 150
  3. Gradient checkpointing: model.img_backbone.with_cp = True
  4. Clear cache: torch.cuda.empty_cache() in code

Distributed Training Tips

  1. Even GPU allocation: Use same GPU type across nodes
  2. InfiniBand: Use high-speed interconnect for multi-node
  3. Shared filesystem: Use NFS/Lustre for data loading
  4. Monitor network: Watch for communication bottlenecks

Next Steps

For questions, visit GitHub Discussions.