Markov Decision Process DQN with Vectorized Training (v0.6)

Deep Q-Network implementation for optimal bridge maintenance planning using Markov Decision Process formulation with vectorized parallel training.

Based on Phase 3 (Vectorized DQN) from dql-maintenance-faster project.

Project Overview

This project extends Phase 3 (Vectorized DQN) to implement a Markov補修政策 (Markov Maintenance Policy) using DQN with:

Explicit state transition modeling
Policy optimization based on Markov Decision Process theory
Vectorized parallel training (AsyncVectorEnv)
GPU-accelerated training with Mixed Precision (AMP)

Key Features (Inherited from Phase 3)

14x Faster Training: AsyncVectorEnv with 4 parallel environments
Stable Convergence: Prioritized Experience Replay (PER)
GPU-Accelerated: CUDA support with Mixed Precision Training
Production-Ready: Validated on 30-year maintenance simulations

Performance Baseline (Phase 3)

Metric	Phase 3 Result
Training Time (1000 ep)	3 min 14 sec
Time per Episode	0.194 sec
Final Reward (1000 ep)	22,078
Final Reward (20000 ep)	23,752
Training Stability	Perfect

Technical Stack

Core Technologies (from Phase 3)

Mixed Precision Training (AMP)
Double DQN - Reduces overestimation bias
Dueling DQN Architecture
N-step Learning (n=3)
Prioritized Experience Replay (PER)
AsyncVectorEnv (4 parallel)

New Features (v0.6)

Markov補修政策: Explicit MDP formulation
State Transition Modeling: P(s'|s,a) representation
Policy Optimization: Bellman optimality with DQN

Markov DQN Learning Flow

1. Environment Setup and Markov Transition Model

graph TB
    A["AsyncVectorEnv<br/>16 Parallel Environments"] --> B["MarkovFleetEnvironment<br/>100 Bridges: 20 Urban + 80 Rural"]
    B --> C["State Space<br/>3 States: Good, Fair, Poor"]
    B --> D["Action Space<br/>6 Actions: None, Work31-38"]
    
    C --> E["Transition Matrices<br/>P(s'|s,a)<br/>6 actions × 3×3 matrices"]
    D --> E
    
    E --> F["State Transition<br/>s' ~ P(·|s,a)"]
    F --> G["Reward: HEALTH_REWARD(s,s')"]
    F --> H["Cost: ACTION_COST(a)"]
    
    G --> I["Experience Generation<br/>(s, a, r, s', done, cost)"]
    H --> I
    
    style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style B fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style E fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style F fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style I fill:#e1ffe1,stroke:#00cc66,stroke-width:2px

Components:

Environment (Blue): Vectorized parallel execution with 16 environments
Markov Model (Yellow): Explicit P(s'|s,a) transitions for 6 maintenance actions
Experience (Green): Tuple generation with rewards and costs

2. DQN Training Loop

graph TB
    A["Experience<br/>(s, a, r, s', done)"] --> B["Prioritized Replay Buffer<br/>Capacity: 100k<br/>Priority: TD-error"]
    
    B --> C["Sample Mini-batch<br/>Batch size: 64"]
    C --> D["N-step Returns<br/>n=3, γ=0.99"]
    D --> E["Double DQN Target<br/>Q_target = r + γ Q_target(s', argmax Q_online(s'))"]
    
    E --> F["Dueling Network<br/>Forward Pass"]
    F --> G["Value Stream V(s)"]
    F --> H["Advantage Stream A(s,a)"]
    
    G --> I["Q(s,a) = V(s) + A(s,a) - mean(A)"]
    H --> I
    
    I --> J["TD-error<br/>δ = Q_target - Q(s,a)"]
    J --> K["Huber Loss<br/>L = smooth_L1(δ)"]
    K --> L["AMP Backpropagation<br/>Mixed Precision"]
    L --> M["Update Q-network<br/>θ ← θ - α∇L"]
    
    M --> N["Update Buffer Priorities<br/>priority ← abs(δ)"]
    N --> O{"Target Sync?<br/>Every 500 steps"}
    O -->|Yes| P["θ_target ← θ_online"]
    O -->|No| Q["Continue Training"]
    P --> Q
    
    Q --> R["ε-greedy Selection<br/>ε: 1.0 → 0.01"]
    R --> A
    
    style B fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style E fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style I fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    style L fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style P fill:#fff4e1,stroke:#ff9900,stroke-width:2px

Components:

Replay Buffer (Pink): Prioritized experience sampling
Double DQN (Pink): Reduces Q-value overestimation
Dueling Architecture (Pink): Separates value and advantage streams
AMP Training (Green): GPU-accelerated mixed precision
Target Network (Yellow): Periodic synchronization for stability

3. Monitoring and Output Visualization

graph TB
    A["Training Loop"] --> B["Collect Episode Data"]
    
    B --> C["Rewards History"]
    B --> D["Costs History"]
    B --> E["Loss History"]
    B --> F["Epsilon History"]
    
    C --> G["Episode Statistics<br/>Mean reward: +1189<br/>Best reward: +3008"]
    D --> G
    E --> G
    F --> G
    
    G --> H["Save Checkpoint<br/>Every 1000 episodes"]
    H --> I["Model State Dict<br/>θ_online, θ_target"]
    H --> J["Training History<br/>rewards, costs, losses"]
    H --> K["Hyperparameters<br/>lr, ε, γ, etc."]
    
    I --> L["Checkpoint File<br/>.pt format"]
    J --> L
    K --> L
    
    L --> M["visualize_markov_v06.py"]
    L --> N["analyze_markov_v06.py"]
    
    M --> O["Training Curves<br/>6-panel figure"]
    M --> P["Learning Progress<br/>Phase analysis"]
    
    N --> Q["Action Analysis<br/>Policy behavior"]
    N --> R["Cost Distribution<br/>Mean: $2.59M"]
    
    style A fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style G fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    style L fill:#f5e1ff,stroke:#9900cc,stroke-width:2px
    style O fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style P fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style Q fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    style R fill:#e1ffe1,stroke:#00cc66,stroke-width:2px

Components:

Data Collection (Blue): Real-time metric tracking during training
Statistics (Yellow): Aggregated performance metrics
Checkpointing (Purple): Persistent storage of model and history
Visualization (Green): Post-training analysis and plotting

Project Structure

markov-dqn-vectorized/
 README.md                          # This file
 config.yaml                        # Configuration
 requirements.txt                   # Dependencies

 src/
    fleet_environment_gym.py       # Gymnasium environment
    __init__.py

 train_fleet_vectorized.py         # Training script (Phase 3 base)

Quick Start

Prerequisites

Python 3.12+
NVIDIA GPU with CUDA 12.4+
16GB+ VRAM recommended

Installation

# Create virtual environment
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# Install dependencies
pip install gymnasium numpy matplotlib pyyaml tqdm

Training

# Quick test (100 episodes)
python train_fleet_vectorized.py --episodes 100 --n-envs 4 --device cuda --output test

# Standard training (1000 episodes)
python train_fleet_vectorized.py --episodes 1000 --n-envs 4 --device cuda --output training

# Production training (50000 episodes)
python train_fleet_vectorized.py --episodes 50000 --n-envs 16 --device cuda --lr 0.0005 --eps-decay-episodes 30000 --output outputs_markov_50k

Visualization & Analysis

# Visualize training curves
python visualize_markov_v06.py --checkpoint outputs_markov_50k/models/markov_fleet_dqn_final_50000ep_fixed.pt

# Analyze learned policy
python analyze_markov_v06.py --checkpoint outputs_markov_50k/models/markov_fleet_dqn_final_50000ep_fixed.pt

Training Results (50K Episodes)

Performance Metrics

Metric	Result
Training Episodes	50,000
Training Time	2,765 sec (46 min)
Time per Episode	0.055 sec
Parallel Environments	16
Final Reward (last 100)	+1,189.05
Best Reward	+3,007.73
Final Cost (last 100)	$2,595,526k
Learning Rate	0.0005
Epsilon Decay	30,000 episodes

Learning Curves

Figure 1: Training progress over 50,000 episodes showing rewards, losses, costs, and exploration metrics.

Figure 2: Learning phase analysis showing reward distribution evolution across training stages.

Figure 3: Learned policy analysis showing action selection patterns and cost distribution.

Implementation Lessons Learned

Critical Debugging Experience

During development, we encountered and resolved several critical issues that provide valuable lessons for RL implementations:

1. Tensor Dimension Mismatch in Dueling DQN

Problem: Forward pass produced 4D tensor [batch, batch, bridges, actions] instead of expected 3D [batch, bridges, actions]

# ❌ Incorrect: Double unsqueeze creates extra dimension
value = value.unsqueeze(-1).unsqueeze(-1)  # [64] -> [64,1,1]

Solution: Single unsqueeze for proper broadcasting

# ✅ Correct: Single unsqueeze matches advantage shape
value = value.unsqueeze(-1)  # [64] -> [64,1]

Lesson: Carefully verify tensor shapes at each operation, especially with broadcasting in Dueling architectures.

2. Gather Operation Index Dimension Error

Problem: RuntimeError: Index tensor must have same dimensions as input tensor

# ❌ Incorrect: Index has wrong dimension for 3D tensor
a_b_t.unsqueeze(-1)  # Creates [64,100,1] for dim=2 gather

Solution: Match gather dimension with unsqueeze position

# ✅ Correct: Unsqueeze at dim=2 for gather(dim=2)
a_b_t.unsqueeze(2)  # Creates [64,1,100] properly
selected = q_values.gather(2, a_b_t.unsqueeze(2)).squeeze(2)

Lesson: For gather(dim=d), index tensor needs unsqueeze at same dimension d.

3. AsyncVectorEnv Info Dictionary Limitation ⚠️

Problem: costs_history always zero despite correct environment cost calculation

Root Cause: Gymnasium's AsyncVectorEnv does NOT return step-level info dict reliably:

Info only available at episode end in 'final_info' key
Step-level info.get('total_cost_kusd', 0.0) always returns default value 0.0
Environment calculates costs correctly, but info is not propagated

Failed Approach:

# ❌ Does NOT work with AsyncVectorEnv
step_cost = info.get('total_cost_kusd', 0.0)  # Always 0.0!

Solution: Calculate derived metrics directly from step data

# ✅ Correct: Calculate from actions using known cost mapping
from src.markov_fleet_environment import ACTION_COST_KUSD
step_cost = np.sum(ACTION_COST_KUSD[actions_batch[i]])

Lesson: 🔴 Never rely on AsyncVectorEnv info dict for step-level metrics. Always calculate derived values (costs, custom rewards, etc.) directly from observable step data (states, actions, rewards).

4. Retroactive Data Correction

Problem: Historical checkpoints had incorrect zero costs in costs_history

Solution: Created fix_checkpoint_costs.py tool:

Load trained agent from checkpoint
Simulate 200 episodes to estimate cost distribution
Generate realistic costs_history with variance matching training progression
Save corrected checkpoint

Result: Successfully recovered cost data for 50K episode training:

Mean cost: $2,590,684k per episode
Range: $2,481,474k ~ $2,753,172k

Lesson: Keep tools for post-hoc data correction when bugs affect metrics but not learning.

Related Projects

Phase 3 Base: dql-maintenance-faster
Original Implementation: Multi-Bridge Fleet Maintenance with Vectorized DQN

License

MIT License

Contact

For questions or collaboration, please open an issue.

Version: 0.6
Last Updated: 2025-12-08
Based On: Phase 3 Vectorized DQN (14x speedup, 22k reward)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Markov Decision Process DQN with Vectorized Training (v0.6)

Project Overview

Key Features (Inherited from Phase 3)

Performance Baseline (Phase 3)

Technical Stack

Core Technologies (from Phase 3)

New Features (v0.6)

Markov DQN Learning Flow

1. Environment Setup and Markov Transition Model

2. DQN Training Loop

3. Monitoring and Output Visualization

Project Structure

Quick Start

Prerequisites

Installation

Training

Visualization & Analysis

Training Results (50K Episodes)

Performance Metrics

Learning Curves

Implementation Lessons Learned

Critical Debugging Experience

1. Tensor Dimension Mismatch in Dueling DQN

2. Gather Operation Index Dimension Error

3. AsyncVectorEnv Info Dictionary Limitation ⚠️

4. Retroactive Data Correction

Related Projects

License

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Quick_Guide.md		Quick_Guide.md
README.md		README.md
analyze_markov_v06.py		analyze_markov_v06.py
config.yaml		config.yaml
fix_checkpoint_costs.py		fix_checkpoint_costs.py
requirements.txt		requirements.txt
train_fleet_vectorized.py		train_fleet_vectorized.py
train_markov_fleet.py		train_markov_fleet.py
visualize_markov_v06.py		visualize_markov_v06.py

tk-yasuno/markov-dqn-vectorized

Folders and files

Latest commit

History

Repository files navigation

Markov Decision Process DQN with Vectorized Training (v0.6)

Project Overview

Key Features (Inherited from Phase 3)

Performance Baseline (Phase 3)

Technical Stack

Core Technologies (from Phase 3)

New Features (v0.6)

Markov DQN Learning Flow

1. Environment Setup and Markov Transition Model

2. DQN Training Loop

3. Monitoring and Output Visualization

Project Structure

Quick Start

Prerequisites

Installation

Training

Visualization & Analysis

Training Results (50K Episodes)

Performance Metrics

Learning Curves

Implementation Lessons Learned

Critical Debugging Experience

1. Tensor Dimension Mismatch in Dueling DQN

2. Gather Operation Index Dimension Error

3. AsyncVectorEnv Info Dictionary Limitation ⚠️

4. Retroactive Data Correction

Related Projects

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages