Important note: This project is still in work and does not contain the optimal RL implementation. It may also have some bugs in regards to the game mechanics!
An advanced implementation of a reinforcement learning agent for the card game Skyjo using PPO (Proximal Policy Optimization) with action masking and sophisticated state representation.
This project implements a complete Skyjo game environment with multiple agent types, focusing on a reinforcement learning approach using the PPO algorithm. The implementation features a custom OpenAI Gym environment, action masking for valid move enforcement, and detailed state representation.
- Core game logic implementation
- Manages game state transitions
- Handles player actions and card operations
- Implements state tracking and validation
The RL environment extends OpenAI Gym and implements:
- Field representation: 12x18 one-hot encoded matrix
- Encodes card values (-2 to 12)
- Special symbols (♦ for stars)
- Hidden/visible card states
- Last action encoding: 5-dimensional one-hot vector
Total observation space:
Box(low=0, high=1, shape=(12 * 18 + 5,), dtype=float32)
- 16 discrete actions combining:
- Card positions (0-11)
- Game actions (pull, discard, change)
- Action masking ensures only valid moves
Multiple reward components:
- Point-based rewards
if current_points[self.rl_name] < self.points_in_running[-2]:
reward += 1
if current_points[self.rl_name] < min(self.points_in_running):
reward += 2
- Normalized score rewards
reward += (1 - self.theoretical_normalize(current_points[self.rl_name])) * 2
- Line completion rewards
- Additional rewards for completing rows/columns
- Matching card combinations
- Implements PPO with action masking
- Uses neural network policy
- Maintains game state history
- Rule-based decision making
- Card value thresholds
- Position-based strategies
- Baseline implementation
- Random action selection
- Used for comparison
- Manages card distribution
- Card value mappings
- Special card handling (stars)
- Grid-based card layout
- Line completion detection
- Score calculation
The state is represented as a combination of:
self.observation_space = gym.spaces.Box(
low=0,
high=1,
shape=(12 * 18 + 5,),
dtype=np.float32
)
Implemented using MaskablePPO
from Stable-Baselines3:
def action_masks(self):
legal_actions = self._legal_actions()
action_mask = [False] * 16
# Mask based on game state
if self.game_state == "running":
if last_action == "pull deck":
action_mask[12:14] = [True, True]
return np.array(action_mask, dtype=bool)
model = MaskablePPO(
"MlpPolicy",
rl_env,
verbose=1,
device='cuda',
learning_rate=0.0001
)
model.learn(
total_timesteps=100000,
progress_bar=True,
callback=logging_callback
)
- Episode rewards tracking
- Wrong action counting
- Performance visualization
- Learning curves
- Clone the repository
- Install dependencies:
pip install torch gymnasium stable-baselines3 numpy matplotlib
# Initialize environment
carddeck = Carddeck()
agent = RLAgent("RLBOT", carddeck, (4, 3))
gamefield = GameField(4, 3, [agent], carddeck)
env = Environment(gamefield)
rl_env = RLEnvironment(env)
# Training with monitoring
logging_callback = LoggingCallback()
model = MaskablePPO("MlpPolicy", rl_env, verbose=1, device='cuda')
model.learn(total_timesteps=100000, callback=logging_callback)
# Visualize results
plt.plot(rl_env.points, "-o", label="Points of agent")
plt.title("Points over time")
plt.grid()
plt.show()
- Complete game state tracking
- Efficient state updates
- Memory-optimized representations
- Comprehensive move validation
- Legal action enforcement
- Game rule compliance
- Multi-component reward system
- Progress-based incentives
- Strategic play encouragement
- GPU acceleration support
- Efficient state representations
- Optimized action masking
- Advanced Features
- Multi-agent training capabilities
- Self-play implementation
- Advanced reward shaping
- Optimizations
- Enhanced state representation
- Improved action space efficiency
- Advanced policy architectures
Contributions are welcome! Please refer to the contribution guidelines for more information.
MIT License - see LICENSE file for details