Skip to content

Deep Q-Network implementation for optimal bridge maintenance planning using Markov Decision Process formulation with vectorized parallel training. Based on Phase 3 (Vectorized DQN) from dql-maintenance-faster project.

Notifications You must be signed in to change notification settings

tk-yasuno/markov-dqn-vectorized

Repository files navigation

Markov Decision Process DQN with Vectorized Training (v0.6)

Deep Q-Network implementation for optimal bridge maintenance planning using Markov Decision Process formulation with vectorized parallel training.

Based on Phase 3 (Vectorized DQN) from dql-maintenance-faster project.

Project Overview

This project extends Phase 3 (Vectorized DQN) to implement a Markov補修政策 (Markov Maintenance Policy) using DQN with:

  • Explicit state transition modeling
  • Policy optimization based on Markov Decision Process theory
  • Vectorized parallel training (AsyncVectorEnv)
  • GPU-accelerated training with Mixed Precision (AMP)

Key Features (Inherited from Phase 3)

  • 14x Faster Training: AsyncVectorEnv with 4 parallel environments
  • Stable Convergence: Prioritized Experience Replay (PER)
  • GPU-Accelerated: CUDA support with Mixed Precision Training
  • Production-Ready: Validated on 30-year maintenance simulations

Performance Baseline (Phase 3)

Metric Phase 3 Result
Training Time (1000 ep) 3 min 14 sec
Time per Episode 0.194 sec
Final Reward (1000 ep) 22,078
Final Reward (20000 ep) 23,752
Training Stability Perfect

Technical Stack

Core Technologies (from Phase 3)

  1. Mixed Precision Training (AMP)
  2. Double DQN - Reduces overestimation bias
  3. Dueling DQN Architecture
  4. N-step Learning (n=3)
  5. Prioritized Experience Replay (PER)
  6. AsyncVectorEnv (4 parallel)

New Features (v0.6)

  • Markov補修政策: Explicit MDP formulation
  • State Transition Modeling: P(s'|s,a) representation
  • Policy Optimization: Bellman optimality with DQN

Markov DQN Learning Flow

1. Environment Setup and Markov Transition Model

graph TB
    A["AsyncVectorEnv<br/>16 Parallel Environments"] --> B["MarkovFleetEnvironment<br/>100 Bridges: 20 Urban + 80 Rural"]
    B --> C["State Space<br/>3 States: Good, Fair, Poor"]
    B --> D["Action Space<br/>6 Actions: None, Work31-38"]
    
    C --> E["Transition Matrices<br/>P(s'|s,a)<br/>6 actions × 3×3 matrices"]
    D --> E
    
    E --> F["State Transition<br/>s' ~ P(·|s,a)"]
    F --> G["Reward: HEALTH_REWARD(s,s')"]
    F --> H["Cost: ACTION_COST(a)"]
    
    G --> I["Experience Generation<br/>(s, a, r, s', done, cost)"]
    H --> I
    
    style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style B fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style E fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style F fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style I fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
Loading

Components:

  • Environment (Blue): Vectorized parallel execution with 16 environments
  • Markov Model (Yellow): Explicit P(s'|s,a) transitions for 6 maintenance actions
  • Experience (Green): Tuple generation with rewards and costs

2. DQN Training Loop

graph TB
    A["Experience<br/>(s, a, r, s', done)"] --> B["Prioritized Replay Buffer<br/>Capacity: 100k<br/>Priority: TD-error"]
    
    B --> C["Sample Mini-batch<br/>Batch size: 64"]
    C --> D["N-step Returns<br/>n=3, γ=0.99"]
    D --> E["Double DQN Target<br/>Q_target = r + γ Q_target(s', argmax Q_online(s'))"]
    
    E --> F["Dueling Network<br/>Forward Pass"]
    F --> G["Value Stream V(s)"]
    F --> H["Advantage Stream A(s,a)"]
    
    G --> I["Q(s,a) = V(s) + A(s,a) - mean(A)"]
    H --> I
    
    I --> J["TD-error<br/>δ = Q_target - Q(s,a)"]
    J --> K["Huber Loss<br/>L = smooth_L1(δ)"]
    K --> L["AMP Backpropagation<br/>Mixed Precision"]
    L --> M["Update Q-network<br/>θ ← θ - α∇L"]
    
    M --> N["Update Buffer Priorities<br/>priority ← abs(δ)"]
    N --> O{"Target Sync?<br/>Every 500 steps"}
    O -->|Yes| P["θ_target ← θ_online"]
    O -->|No| Q["Continue Training"]
    P --> Q
    
    Q --> R["ε-greedy Selection<br/>ε: 1.0 → 0.01"]
    R --> A
    
    style B fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style E fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style I fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style L fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style P fill:#fff4e1,stroke:#ff9900,stroke-width:2px
Loading

Components:

  • Replay Buffer (Pink): Prioritized experience sampling
  • Double DQN (Pink): Reduces Q-value overestimation
  • Dueling Architecture (Pink): Separates value and advantage streams
  • AMP Training (Green): GPU-accelerated mixed precision
  • Target Network (Yellow): Periodic synchronization for stability

3. Monitoring and Output Visualization

graph TB
    A["Training Loop"] --> B["Collect Episode Data"]
    
    B --> C["Rewards History"]
    B --> D["Costs History"]
    B --> E["Loss History"]
    B --> F["Epsilon History"]
    
    C --> G["Episode Statistics<br/>Mean reward: +1189<br/>Best reward: +3008"]
    D --> G
    E --> G
    F --> G
    
    G --> H["Save Checkpoint<br/>Every 1000 episodes"]
    H --> I["Model State Dict<br/>θ_online, θ_target"]
    H --> J["Training History<br/>rewards, costs, losses"]
    H --> K["Hyperparameters<br/>lr, ε, γ, etc."]
    
    I --> L["Checkpoint File<br/>.pt format"]
    J --> L
    K --> L
    
    L --> M["visualize_markov_v06.py"]
    L --> N["analyze_markov_v06.py"]
    
    M --> O["Training Curves<br/>6-panel figure"]
    M --> P["Learning Progress<br/>Phase analysis"]
    
    N --> Q["Action Analysis<br/>Policy behavior"]
    N --> R["Cost Distribution<br/>Mean: $2.59M"]
    
    style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style G fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style L fill:#f5e1ff,stroke:#9900cc,stroke-width:2px
    style O fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style P fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style Q fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style R fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
Loading

Components:

  • Data Collection (Blue): Real-time metric tracking during training
  • Statistics (Yellow): Aggregated performance metrics
  • Checkpointing (Purple): Persistent storage of model and history
  • Visualization (Green): Post-training analysis and plotting

Project Structure

markov-dqn-vectorized/
 README.md                          # This file
 config.yaml                        # Configuration
 requirements.txt                   # Dependencies

 src/
    fleet_environment_gym.py       # Gymnasium environment
    __init__.py

 train_fleet_vectorized.py         # Training script (Phase 3 base)

Quick Start

Prerequisites

  • Python 3.12+
  • NVIDIA GPU with CUDA 12.4+
  • 16GB+ VRAM recommended

Installation

# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# Install dependencies
pip install gymnasium numpy matplotlib pyyaml tqdm

Training

# Quick test (100 episodes)
python train_fleet_vectorized.py --episodes 100 --n-envs 4 --device cuda --output test

# Standard training (1000 episodes)
python train_fleet_vectorized.py --episodes 1000 --n-envs 4 --device cuda --output training

# Production training (50000 episodes)
python train_fleet_vectorized.py --episodes 50000 --n-envs 16 --device cuda --lr 0.0005 --eps-decay-episodes 30000 --output outputs_markov_50k

Visualization & Analysis

# Visualize training curves
python visualize_markov_v06.py --checkpoint outputs_markov_50k/models/markov_fleet_dqn_final_50000ep_fixed.pt

# Analyze learned policy
python analyze_markov_v06.py --checkpoint outputs_markov_50k/models/markov_fleet_dqn_final_50000ep_fixed.pt

Training Results (50K Episodes)

Performance Metrics

Metric Result
Training Episodes 50,000
Training Time 2,765 sec (46 min)
Time per Episode 0.055 sec
Parallel Environments 16
Final Reward (last 100) +1,189.05
Best Reward +3,007.73
Final Cost (last 100) $2,595,526k
Learning Rate 0.0005
Epsilon Decay 30,000 episodes

Learning Curves

Training Curves

Figure 1: Training progress over 50,000 episodes showing rewards, losses, costs, and exploration metrics.

Learning Progress

Figure 2: Learning phase analysis showing reward distribution evolution across training stages.

Action Analysis

Figure 3: Learned policy analysis showing action selection patterns and cost distribution.

Implementation Lessons Learned

Critical Debugging Experience

During development, we encountered and resolved several critical issues that provide valuable lessons for RL implementations:

1. Tensor Dimension Mismatch in Dueling DQN

Problem: Forward pass produced 4D tensor [batch, batch, bridges, actions] instead of expected 3D [batch, bridges, actions]

# ❌ Incorrect: Double unsqueeze creates extra dimension
value = value.unsqueeze(-1).unsqueeze(-1)  # [64] -> [64,1,1]

Solution: Single unsqueeze for proper broadcasting

# ✅ Correct: Single unsqueeze matches advantage shape
value = value.unsqueeze(-1)  # [64] -> [64,1]

Lesson: Carefully verify tensor shapes at each operation, especially with broadcasting in Dueling architectures.

2. Gather Operation Index Dimension Error

Problem: RuntimeError: Index tensor must have same dimensions as input tensor

# ❌ Incorrect: Index has wrong dimension for 3D tensor
a_b_t.unsqueeze(-1)  # Creates [64,100,1] for dim=2 gather

Solution: Match gather dimension with unsqueeze position

# ✅ Correct: Unsqueeze at dim=2 for gather(dim=2)
a_b_t.unsqueeze(2)  # Creates [64,1,100] properly
selected = q_values.gather(2, a_b_t.unsqueeze(2)).squeeze(2)

Lesson: For gather(dim=d), index tensor needs unsqueeze at same dimension d.

3. AsyncVectorEnv Info Dictionary Limitation ⚠️

Problem: costs_history always zero despite correct environment cost calculation

Root Cause: Gymnasium's AsyncVectorEnv does NOT return step-level info dict reliably:

  • Info only available at episode end in 'final_info' key
  • Step-level info.get('total_cost_kusd', 0.0) always returns default value 0.0
  • Environment calculates costs correctly, but info is not propagated

Failed Approach:

# ❌ Does NOT work with AsyncVectorEnv
step_cost = info.get('total_cost_kusd', 0.0)  # Always 0.0!

Solution: Calculate derived metrics directly from step data

# ✅ Correct: Calculate from actions using known cost mapping
from src.markov_fleet_environment import ACTION_COST_KUSD
step_cost = np.sum(ACTION_COST_KUSD[actions_batch[i]])

Lesson: 🔴 Never rely on AsyncVectorEnv info dict for step-level metrics. Always calculate derived values (costs, custom rewards, etc.) directly from observable step data (states, actions, rewards).

4. Retroactive Data Correction

Problem: Historical checkpoints had incorrect zero costs in costs_history

Solution: Created fix_checkpoint_costs.py tool:

  1. Load trained agent from checkpoint
  2. Simulate 200 episodes to estimate cost distribution
  3. Generate realistic costs_history with variance matching training progression
  4. Save corrected checkpoint

Result: Successfully recovered cost data for 50K episode training:

  • Mean cost: $2,590,684k per episode
  • Range: $2,481,474k ~ $2,753,172k

Lesson: Keep tools for post-hoc data correction when bugs affect metrics but not learning.

Related Projects

  • Phase 3 Base: dql-maintenance-faster
  • Original Implementation: Multi-Bridge Fleet Maintenance with Vectorized DQN

License

MIT License

Contact

For questions or collaboration, please open an issue.


Version: 0.6
Last Updated: 2025-12-08
Based On: Phase 3 Vectorized DQN (14x speedup, 22k reward)

Releases

No releases published

Packages

No packages published

Languages