Development of a reinforcement learning-based drone surveillance system for efficient area coverage and bandit detection in obstacle-rich environments.
Date: Early Development
Objective: Create foundational grid-based drone simulation
- 10x10 grid environment with sprites (ground, obstacle, drone, bandit)
- Multiple drones with random movement patterns
- Basic obstacle avoidance
- Coverage tracking with blue flags for visited areas
- Red flags for bandit detection
- Visual sensory field display (3x3 vision range)
- Random movement was inefficient
- No learning mechanism
- Drones often got stuck in local areas
Date: Development Session 1
Objective: Improve visual debugging and information display
- Added console panel below game grid
- Real-time statistics display (coverage, bandits detected, simulation status)
- Message logging system with scrolling capability
- Separation of game area from information display
- Better visual debugging
- Non-intrusive information display
- Enhanced user experience
Date: Development Session 2
Objective: Introduce reinforcement learning for intelligent drone behavior
- Tabular Q-learning with state = (x, y) position
- Action space: [Up, Down, Left, Right, Stay]
- Reward structure:
- +1 for visiting new cells
- +10 for detecting bandits
- -10 for hitting obstacles
- Epsilon-greedy exploration with decay
- Problem: Drone repeatedly hit walls/obstacles
- Problem: Drone stuck in repetitive movement loops
- Problem: Insufficient exploration of right side of grid
Date: Development Session 3
Objective: Address exploration and repetitive behavior issues
-
Enhanced reward structure:
- +2 reward for new cell visits (increased from +1)
- -1 penalty for revisiting already-covered areas
- Maintained +10 bonus for bandit detection
-
Improved epsilon scheduling:
- Initial epsilon: 1.0 (full exploration)
- Final epsilon: 0.05
- Decay rate: 0.995 per episode
-
Vision-based coverage:
- Drone covers entire 3x3 sensory field, not just current position
- Reward based on newly visible cells in sensory range
- Reduced wall-hitting behavior
- Improved exploration patterns
- Still some repetitive behavior remained
Date: Development Session 4
Objective: Solve persistent looping issues through better state encoding
-
Rich state representation:
- State = (x, y, tuple of 3x3 sensory field)
- Q-table changed from NumPy array to dictionary
- States include local environment context
-
Optimistic initialization:
- New states initialized with Q-values = 50 (optimistic)
- Encourages exploration of unseen state-action pairs
-
Legal action filtering:
- Prevented selection of actions leading to obstacles
- Eliminated wall-hitting behavior completely
- Significant reduction in repetitive behavior
- More intelligent navigation around obstacles
- Better coverage patterns
Date: Development Session 5
Objective: Improve training efficiency and visualization
- Fast headless training: No rendering during learning phase
- Selective visualization: Show only first 5, best 5, and last 5 episodes
- Performance tracking: Episode statistics storage and analysis
- Dramatically reduced training time
- Better insight into learning progression
- Maintained visual debugging capabilities
Date: Current Session
Objective: Implement systematic solution for remaining stuck behaviors
- Stuck Detection Algorithm: