Enhance project structure and documentation

kechirojp · kechirojp · commit 92abcb5244a3 · 2025-07-01T20:57:08.000+09:00
- Move language selection to top of README for better visibility
- Add comprehensive API reference and contributing guidelines
- Create GitHub Actions CI workflow for automated testing
- Add CONTRIBUTING.md with development guidelines
- Include example.py for easy getting started
- Update .gitignore for better development file management
- Set up development dependencies and code quality tools
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,63 @@
+name: CI
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.8, 3.9, "3.10", "3.11"]
+
+    steps:
+    - uses: actions/checkout@v4
+    
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+    
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -e .[dev]
+    
+    - name: Lint with flake8
+      run: |
+        # Stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # Exit-zero treats all errors as warnings
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics
+    
+    - name: Check code formatting with black
+      run: |
+        black --check .
+    
+    - name: Type checking with mypy
+      run: |
+        mypy . --ignore-missing-imports
+    
+    - name: Test installation
+      run: |
+        python -c "from sb3_grpo import GRPO; print('Import successful!')"
+    
+    - name: Run basic functionality test
+      run: |
+        python -c "
+        import gymnasium as gym
+        import torch
+        from stable_baselines3.common.vec_env import DummyVecEnv
+        from sb3_grpo import GRPO
+        
+        def simple_reward(state, action, next_state):
+            return torch.ones(state.shape[0], 1)
+        
+        env = DummyVecEnv([lambda: gym.make('CartPole-v1')])
+        agent = GRPO('MlpPolicy', env, reward_function=simple_reward, verbose=0)
+        agent.learn(total_timesteps=100)
+        print('Basic functionality test passed!')
+        "
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,11 @@ grpo_アクションサンプル用メモリサイズ.md
 grpoの流れ.md
 README_ja.md
 
+# Example outputs and temporary files
+example_outputs/
+temp/
+*.tmp
+
 # Python
 __pycache__/
 *.py[cod]
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,68 @@
+# Contributing to SB3-GRPO
+
+Thank you for your interest in contributing to SB3-GRPO! 
+
+## Development Setup
+
+1. Fork the repository
+2. Clone your fork:
+   ```bash
+   git clone https://github.com/yourusername/sb3-grpo.git
+   cd sb3-grpo
+   ```
+3. Install in development mode:
+   ```bash
+   pip install -e .[dev]
+   ```
+
+## Code Style
+
+We use the following tools to maintain code quality:
+
+- **Black**: Code formatting
+- **Flake8**: Linting
+- **MyPy**: Type checking
+
+Run these before submitting:
+
+```bash
+black .
+flake8 .
+mypy . --ignore-missing-imports
+```
+
+## Testing
+
+Make sure your changes don't break existing functionality:
+
+```bash
+python -c "from sb3_grpo import GRPO; print('Import test passed!')"
+```
+
+## Pull Request Process
+
+1. Create a feature branch from `main`
+2. Make your changes
+3. Run the code quality tools
+4. Test your changes
+5. Submit a pull request with a clear description
+
+## Reporting Issues
+
+When reporting bugs, please include:
+
+- Python version
+- PyTorch version
+- Stable Baselines3 version
+- Complete error traceback
+- Minimal reproduction example
+
+## Feature Requests
+
+We welcome feature requests! Please open an issue with:
+
+- Clear description of the feature
+- Use case examples
+- Proposed API (if applicable)
+
+Thank you for contributing!
diff --git a/README.md b/README.md
@@ -5,6 +5,13 @@
 
 [[License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 
+## Language Versions / 言語選択
+
+- **English**: [README.md](README.md) (this file)
+- **日本語**: [README_ja.md](README_ja.md)
+
+---
+
 `sb3-grpo` is a [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3) (SB3) compatible implementation of **Group Relative Policy Optimization (GRPO)**.
 
 This algorithm can be used as a drop-in replacement for standard PPO, providing stable learning especially in environments where rewards can be densely defined for states and actions.
@@ -159,10 +166,48 @@ python example.py
 
 As training progresses, standard SB3 logs will be displayed. If the agent can maintain CartPole upright for extended periods after training, it's successful.
 
-## Language Versions
+## API Reference
 
-- **English**: [README.md](README.md) (this file)
-- **日本語**: [README_ja.md](README_ja.md)
+### GRPO Class
+
+```python
+class GRPO(PPO):
+    """
+    Group Relative Policy Optimization (GRPO) implementation extending PPO.
+    
+    Args:
+        policy: The policy model to use (MlpPolicy, CnnPolicy, ...)
+        env: The environment to learn from
+        reward_function: Function to calculate rewards from (state, action, next_state)
+        **kwargs: Other standard PPO arguments (learning_rate, n_steps, etc.)
+    """
+```
+
+### Reward Function Interface
+
+Your reward function must follow this signature:
+
+```python
+def your_reward_function(
+    state: torch.Tensor,      # Current state [batch_size, state_dim]
+    action: torch.Tensor,     # Action taken [batch_size, 1]  
+    next_state: torch.Tensor  # Resulting state [batch_size, state_dim]
+) -> torch.Tensor:            # Returns: rewards [batch_size, 1]
+    # Your reward calculation logic here
+    return rewards
+```
+
+## Contributing
+
+Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
+
+### Development Setup
+
+```bash
+git clone https://github.com/kechirojp/sb3-grpo.git
+cd sb3-grpo
+pip install -e .[dev]  # Install with development dependencies
+```
 
 ## License
 
diff --git a/example.py b/example.py
@@ -0,0 +1,69 @@
+# example.py
+
+import gymnasium as gym
+import torch
+from stable_baselines3.common.vec_env import DummyVecEnv
+
+# Import GRPO from the package
+from sb3_grpo import GRPO
+
+# --- 1. Define reward function for GRPO ---
+# The core of GRPO is the ability to inject custom reward functions.
+# Here we define a function that evaluates how "good" the next state is.
+def cartpole_reward_fn(state: torch.Tensor, action: torch.Tensor, next_state: torch.Tensor) -> torch.Tensor:
+    """
+    Reward function for CartPole environment.
+    Evaluates how "good" the next_state is.
+    - Higher reward for pole angle closer to vertical
+    - Higher reward for cart position closer to center
+    """
+    # next_state contents: [cart_pos, cart_vel, pole_angle, pole_vel]
+    cart_pos = next_state[:, 0]
+    pole_angle = next_state[:, 2]
+    
+    # Reward is higher when angle and position are closer to 0
+    reward = 1.0 - torch.abs(pole_angle) - 0.1 * torch.abs(cart_pos)
+    
+    return reward.unsqueeze(-1)
+
+
+# --- 2. Environment setup ---
+# Standard Stable Baselines3 environment preparation
+env = gym.make("CartPole-v1")
+env = DummyVecEnv([lambda: env])
+
+
+# --- 3. Create GRPO agent ---
+# Usage is almost identical to PPO instantiation.
+agent = GRPO(
+    "MlpPolicy",
+    env,
+    reward_function=cartpole_reward_fn,  # Inject reward function here
+    n_steps=256,
+    batch_size=64,
+    n_epochs=10,
+    learning_rate=3e-4,
+    verbose=1,
+)
+
+# --- 4. Training ---
+# Just call the `learn` method like standard SB3 PPO
+print("--- Starting GRPO Training ---")
+agent.learn(total_timesteps=20000)
+print("--- Training Finished ---")
+
+
+# --- 5. Evaluate trained agent ---
+print("\n--- Evaluating Trained Agent ---")
+eval_env = gym.make("CartPole-v1")
+obs, _ = eval_env.reset()
+total_reward = 0
+for _ in range(1000):
+    action, _ = agent.predict(obs, deterministic=True)
+    obs, reward, terminated, truncated, info = eval_env.step(action)
+    total_reward += reward
+    if terminated or truncated:
+        print(f"Episode finished with total reward: {total_reward}")
+        total_reward = 0
+        obs, _ = eval_env.reset()
+eval_env.close()