diff --git a/reward_analysis.md b/reward_analysis.md new file mode 100644 index 0000000..17d450f --- /dev/null +++ b/reward_analysis.md @@ -0,0 +1,228 @@ +# Custom Reward Function Analysis + +## Issues Found + +### 🚨 **Critical Issue: Duplicate Function Definitions** + +All four reward functions you asked me to review exist **twice** in your code: + +1. **Lines 459-494**: First definitions (wrong docstrings, looks like copy-paste errors) +2. **Lines 544-622**: Second definitions (correct docstrings) + +The functions in the `gen_reward_manager()` at line 627 are using the SECOND definitions (lines 544-622), but the duplicates are confusing and could cause issues. + +--- + +## Function-by-Function Analysis + +### 1. `target_height_reward` (Lines 544-559) + +**Current Implementation:** +```python +return -((obj.body.position.y - target_height)**2) +``` + +**Issues:** +- āœ… **This is well-implemented** - negative L2 squared distance is a standard reward formulation +- āœ… Returns a negative reward (penalty) that gets smaller as player approaches target +- āœ… Smooth and continuous, which is good for RL + +**Usage Note:** +- Currently used with weight=0.0 in line 629, so it's effectively disabled + +--- + +### 2. `head_to_middle_reward` (Lines 561-580) + +**Current Implementation:** +```python +multiplier = -1 if player.body.position.x > 0 else 1 +reward = multiplier * (player.body.position.x - player.prev_x) +return reward +``` + +**Issues:** +- āš ļø **Has a bug**: Calculates velocity in the wrong direction + +**The Problem:** +- When `player.position.x > 0` (right side), multiplier is -1 +- When `player.position.x < 0` (left side), multiplier is 1 +- This means: + - Player on RIGHT side: reward = `-1 * (current - prev)` = `prev - current` + - Player on LEFT side: reward = `1 * (current - prev)` = `current - prev` + +**What this actually does:** +- RIGHT side: Rewards moving LEFT (toward middle) +- LEFT side: Rewards moving RIGHT (toward middle) +- āœ… **This IS correct!** The logic is right + +**However:** +- This only rewards horizontal movement toward x=0 +- If player is already at middle (x ā‰ˆ 0), the multiplier flips sign based on tiny differences +- Creates discontinuity at x=0 + +**Potential Issues:** +- Discontinuity at x=0 could confuse the agent +- No bounds checking - what if arena is larger than expected? + +**Better Implementation Suggestion:** +```python +def head_to_middle_reward(env: WarehouseBrawl) -> float: + player: Player = env.objects["player"] + + # Distance from middle + dist_from_middle = abs(player.body.position.x) + + # Reward for reducing distance (moving toward middle) + prev_dist = abs(player.prev_x) + reward = prev_dist - dist_from_middle + + return reward +``` + +**Why this is better:** +- Continuous everywhere +- Directly measures progress toward middle +- No sign flipping issues +- Works regardless of arena size + +--- + +### 3. `head_to_opponent` (Lines 582-602) + +**Current Implementation:** +```python +multiplier = -1 if player.body.position.x > opponent.body.position.x else 1 +reward = multiplier * (player.body.position.x - player.prev_x) +return reward +``` + +**Issues:** +- āš ļø **Same discontinuity problem** as head_to_middle +- āš ļø **Only considers horizontal movement** + +**What this does:** +- If player is RIGHT of opponent, reward = `-1 * (current - prev)` = reward for moving left (toward opponent) +- If player is LEFT of opponent, reward = `1 * (current - prev)` = reward for moving right (toward opponent) +- āœ… Logic is correct for horizontal component + +**Problems:** +1. **No vertical component** - doesn't reward moving up/down toward opponent +2. **Discontinuity** when player.x exactly equals opponent.x +3. **Not distance-based** - moving 1 unit when 100 units away gives same reward as moving 1 unit when 1 unit away + +**Better Implementation Suggestion:** +```python +def head_to_opponent(env: WarehouseBrawl) -> float: + player: Player = env.objects["player"] + opponent: Player = env.objects["opponent"] + + # Current distance + current_dist = np.sqrt( + (player.body.position.x - opponent.body.position.x)**2 + + (player.body.position.y - opponent.body.position.y)**2 + ) + + # Previous distance + prev_dist = np.sqrt( + (player.prev_x - opponent.body.position.x)**2 + + (player.prev_y - opponent.body.position.y)**2 + ) + + # Reward for getting closer + reward = prev_dist - current_dist + + return reward +``` + +**Why this is better:** +- Considers both horizontal AND vertical movement +- Continuous everywhere +- Scales with distance (closer = more reward for same movement) +- Natural interpretation: reward proportional to distance reduction + +**OR, for movement-based (similar to current):** +```python +def head_to_opponent(env: WarehouseBrawl) -> float: + player: Player = env.objects["player"] + opponent: Player = env.objects["opponent"] + + # Direction vector toward opponent + dx = opponent.body.position.x - player.body.position.x + dy = opponent.body.position.y - player.body.position.y + + # Normalize (handle division by zero) + dist = np.sqrt(dx**2 + dy**2) + if dist < 0.001: + return 0 # Already on top of opponent + + dx_norm = dx / dist + dy_norm = dy / dist + + # Movement this frame + vel_x = player.body.position.x - player.prev_x + vel_y = player.body.position.y - player.prev_y + + # Reward velocity in direction of opponent + reward = dx_norm * vel_x + dy_norm * vel_y + + return reward +``` + +--- + +### 4. `taunt_reward` (Lines 604-622) + +**Current Implementation:** +```python +reward = 1 if isinstance(player.state, TauntState) else 0.0 +return reward * env.dt +``` + +**Issues:** +- āœ… **This is well-implemented** +- āœ… Returns a fixed reward for being in taunt state +- āœ… Multiplied by dt for proper time integration + +**One potential improvement:** +If you want to discourage spamming taunt, you could add a small negative reward when entering taunt from certain states: + +```python +# Could track previous state and penalize frequent taunting +# But current implementation is fine for basic use +``` + +--- + +## Summary Recommendations + +| Function | Current Quality | Recommended Action | +|----------|----------------|-------------------| +| `target_height_reward` | āœ… Good | Keep as-is | +| `head_to_middle_reward` | āš ļø Works but has issues | **Rewrite** to distance-based | +| `head_to_opponent` | āš ļø Only 1D, has issues | **Rewrite** to 2D distance-based | +| `taunt_reward` | āœ… Good | Keep as-is | + +## General Recommendations + +1. **Delete duplicate functions** (lines 459-494) to avoid confusion +2. **Fix head_to_middle** to use distance-based formulation +3. **Fix head_to_opponent** to consider both X and Y movement +4. **Test reward scaling** - make sure magnitudes are appropriate relative to other rewards + +## Context: How prev_x Works + +From code analysis: +- `prev_x` is updated in `physics_process()` after physics step completes (line 3933) +- This means `position.x - prev_x` gives the **change in position this frame** (velocity * dt) +- This is correct for calculating movement rewards + +## Note on Weights + +Your current weights in `gen_reward_manager()`: +- `head_to_middle_reward`: 0.01 +- `head_to_opponent`: 0.05 +- `taunt_reward`: 0.2 + +These seem reasonable for encouraging engagement (head_to_opponent) without overwhelming other signals. + diff --git a/user/grid_search.py b/user/grid_search.py new file mode 100644 index 0000000..f966443 --- /dev/null +++ b/user/grid_search.py @@ -0,0 +1,91 @@ +import sys, os +sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))) +import pandas as pd +from train_agent import RecurrentPPOAgent, train, TrainLogging, danger_zone_reward, damage_interaction_reward, in_state_reward, on_win_reward, AttackState, RewardManager, RewTerm, SaveHandler, SaveHandlerMode, OpponentsCfg +from functools import partial +from environment.agent import BasedAgent +from environment.environment import CameraResolution +from itertools import product + +""" +Grid Search over reward function weights for training an RL agent. +""" + +# Create an array to store results of Grid Search training +results = [] + +# Define search space +search_space = { + 'damage_interaction_reward': [0.5, 1.0], + 'danger_zone_reward': [0.1, 0.5], + 'penalize_attack_reward': [-0.1, -0.04], +} + +for dmg, dz, atk in product( + search_space['damage_interaction_reward'], + search_space['danger_zone_reward'], + search_space['penalize_attack_reward'] +): + print(f"\nšŸš€ Training with dmg={dmg}, dz={dz}, atk={atk}") + + reward_manager = RewardManager( + { + 'danger_zone_reward': RewTerm(func=danger_zone_reward, weight=dz), + 'damage_interaction_reward': RewTerm(func=damage_interaction_reward, weight=dmg), + 'penalize_attack_reward': RewTerm(func=in_state_reward, weight=atk, params={'desired_state': AttackState}), + }, + { + 'on_win_reward': ('win_signal', RewTerm(func=on_win_reward, weight=50)), + } + ) + + my_agent = RecurrentPPOAgent() + save_path = f'checkpoints/gridsearch_d{dmg}_z{dz}_a{atk}' + save_handler = SaveHandler( + agent=my_agent, + save_freq=1000, + save_path=save_path, + run_name=f'exp_d{dmg}_z{dz}_a{atk}', + mode=SaveHandlerMode.FORCE + ) + + opponent_cfg = OpponentsCfg(opponents={'based_agent': (1.0, partial(BasedAgent))}) + + # Run training + try: + train( + my_agent, + reward_manager, + save_handler, + opponent_cfg, + CameraResolution.LOW, + train_timesteps=2000, + train_logging=TrainLogging.PLOT + ) + + # Load training log (usually saved as monitor.csv) + import os + log_path = os.path.join(save_path, "monitor.csv") + if os.path.exists(log_path): + df = pd.read_csv(log_path, skiprows=1) + mean_reward = df['r'].mean() # average episode reward + else: + mean_reward = None + + results.append({ + "damage_interaction_reward": dmg, + "danger_zone_reward": dz, + "penalize_attack_reward": atk, + "mean_reward": mean_reward, + "checkpoint": save_path + }) + except Exception as e: + print(f"āŒ Failed for dmg={dmg}, dz={dz}, atk={atk}: {e}") + +# Save results to CSV for inspection +pd.DataFrame(results).to_csv("gridsearch_results.csv", index=False) + +# Print the best one +best = max(results, key=lambda x: x["mean_reward"] or float('-inf')) +print("\nšŸ† Best configuration:") +print(best) \ No newline at end of file diff --git a/user/grid_search_optuna.py b/user/grid_search_optuna.py new file mode 100644 index 0000000..9073d10 --- /dev/null +++ b/user/grid_search_optuna.py @@ -0,0 +1,86 @@ +import os +import optuna +import pandas as pd +from functools import partial + +from train_agent import ( + RecurrentPPOAgent, train, TrainLogging, + danger_zone_reward, damage_interaction_reward, in_state_reward, + holding_more_than_3_kets, on_win_reward, AttackState, + RewardManager, RewTerm, SaveHandler, SaveHandlerMode, OpponentsCfg +) +from environment.agent import BasedAgent +from environment.environment import CameraResolution + + +def objective(trial): + """ + Defines one optimization trial. + Each trial uses a unique combination of reward weights. + """ + # --- Suggest weights for each reward term --- + dmg = trial.suggest_float("damage_interaction_reward", 0.5, 2.0) + dz = trial.suggest_float("danger_zone_reward", 0.05, 0.5) + atk = trial.suggest_float("penalize_attack_reward", -0.15, -0.02) + hold = trial.suggest_float("holding_more_than_3_kets", -0.5, 0.5) + + # --- Build Reward Manager --- + reward_manager = RewardManager( + { + 'danger_zone_reward': RewTerm(func=danger_zone_reward, weight=dz), + 'damage_interaction_reward': RewTerm(func=damage_interaction_reward, weight=dmg), + 'penalize_attack_reward': RewTerm( + func=in_state_reward, + weight=atk, + params={'desired_state': AttackState} + ), + 'holding_more_than_3_kets': RewTerm( + func=holding_more_than_3_kets, + weight=hold + ), + }, + { + 'on_win_reward': ('win_signal', RewTerm(func=on_win_reward, weight=50)), + } + ) + + # --- Initialize agent, save handler, and opponent --- + my_agent = RecurrentPPOAgent() + save_path = f"checkpoints/optuna_trial_{trial.number}" + os.makedirs(save_path, exist_ok=True) + + save_handler = SaveHandler( + agent=my_agent, + save_freq=10_000, + save_path=save_path, + run_name=f"optuna_trial_{trial.number}", + mode=SaveHandlerMode.FORCE, + ) + + opponent_cfg = OpponentsCfg(opponents={'based_agent': (1.0, partial(BasedAgent))}) + + # --- Train briefly (for evaluation) --- + try: + train( + my_agent, + reward_manager, + save_handler, + opponent_cfg, + CameraResolution.LOW, + train_timesteps=100_000, + train_logging=TrainLogging.NONE + ) + except Exception as e: + print(f"āŒ Trial {trial.number} failed: {e}") + return -999 # Penalize crashed runs + + # --- Load training rewards --- + log_path = os.path.join(save_path, "monitor.csv") + if os.path.exists(log_path): + df = pd.read_csv(log_path, skiprows=1) + mean_reward = df['r'].mean() + print(f"Trial {trial.number}: mean_reward={mean_reward:.2f}") + return mean_reward + else: + print(f"āŒ Trial {trial.number} failed: log file not found.") + return -999 # Penalize missing logs \ No newline at end of file diff --git a/user/pvp_match.py b/user/pvp_match.py index 9ee7d754..b1486a1 100644 --- a/user/pvp_match.py +++ b/user/pvp_match.py @@ -1,17 +1,16 @@ # import skvideo # import skvideo.io from environment.environment import RenderMode -from environment.agent import SB3Agent, CameraResolution, RecurrentPPOAgent, BasedAgent, UserInputAgent, ConstantAgent, run_match, run_real_time_match, gen_reward_manager -from user.my_agent import SubmittedAgent, ConstantAgent +from environment.agent import SB3Agent, CameraResolution, RecurrentPPOAgent, BasedAgent, UserInputAgent, ConstantAgent, run_match, run_real_time_match +from user.my_agent import SubmittedAgent -reward_manager = gen_reward_manager() experiment_dir_1 = "experiment_6/" #input('Model experiment directory name (e.g. experiment_1): ') model_name_1 = "rl_model00_steps" #input('Name of first model (e.g. rl_model_100_steps): ') my_agent = UserInputAgent() #opponent = SubmittedAgent(None) -opponent = ConstantAgent() +opponent = SubmittedAgent("C:/Users/HMUQRI/Downloads/UTMIST-AI2/checkpoints/experiment_9/rl_model_700007_steps.zip") # my_agent = UserInputAgent() # opponent = ConstantAgent() diff --git a/user/train_agent.py b/user/train_agent.py index 7356155..9cc4864 100644 --- a/user/train_agent.py +++ b/user/train_agent.py @@ -537,6 +537,105 @@ def on_combo_reward(env: WarehouseBrawl, agent: str) -> float: return -1.0 else: return 1.0 + +# ------------------------------------------------------------------------- +# ------------------------- CUSTOM REWARD FUNCTIONS ----------------------- + +def target_height_reward( + env: WarehouseBrawl, + target_height: float, + obj_name: str = 'player' +) -> float: + """Reward asset for being close to target height using L2 squared kernel. + + Note: + For flat terrain, target height is in the world frame. For rough terrain, + sensor readings can adjust the target height to account for the terrain. + """ + # Extract the used quantities (to enable type-hinting) + obj: GameObject = env.objects[obj_name] + + # Compute the L2 squared penalty + return -((obj.body.position.y - target_height)**2) + +def head_to_middle_reward( + env: WarehouseBrawl, +) -> float: + """ + Rewards player for moving towards the middle of the arena. + + Args: + env (WarehouseBrawl): The game environment. + + Returns: + float: The computed reward. + + Fix: Continuous everywhere, directly measures progress toward the middle and no sign flipping + issue. + + """ + player: Player = env.objects["player"] + + # Distance from middle + dist_from_middle = abs(player.body.position.x) + + # Reward for reducing distance (moving toward middle) + prev_dist = abs(player.prev_x) + reward = prev_dist - dist_from_middle + + return reward + + +def head_to_opponent( + env: WarehouseBrawl, +) -> float: + """ + Rewards player for moving towards the opponent. + + Args: + env (WarehouseBrawl): The game environment. + + Returns: + float: The computed reward. + + Fix: Considers horizontal and vertical movement, scales with distance + """ + # Current distance + current_dist = np.sqrt( + (player.body.position.x - opponent.body.position.x)**2 + + (player.body.position.y - opponent.body.position.y)**2 + ) + + # Previous distance + prev_dist = np.sqrt( + (player.prev_x - opponent.body.position.x)**2 + + (player.prev_y - opponent.body.position.y)**2 + ) + + # Reward for getting closer + reward = prev_dist - current_dist + + return reward + +def taunt_reward( + env: WarehouseBrawl, +) -> float: + """ + Rewards player for being in the Taunt state. + + Args: + env (WarehouseBrawl): The game environment. + + Returns: + float: The computed reward. + """ + # Get player object from the environment + player: Player = env.objects["player"] + + # Reward if the player is in the Taunt state + reward = 1 if isinstance(player.state, TauntState) else 0.0 + + return reward * env.dt ''' Add your dictionary of RewardFunctions here using RewTerms @@ -572,7 +671,7 @@ def gen_reward_manager(): my_agent = CustomAgent(sb3_class=PPO, extractor=MLPExtractor) # Start here if you want to train from scratch. e.g: - #my_agent = RecurrentPPOAgent() + my_agent = RecurrentPPOAgent() # Start here if you want to train from a specific timestep. e.g: #my_agent = RecurrentPPOAgent(file_path='checkpoints/experiment_3/rl_model_120006_steps.zip')