Add Nonogram puzzle environment #398
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Nonogram is a logic puzzle. You are given a grid you need to color in black and white and "hints" - which indicate the run lengths.
For example, a clue of "4 8 3" would mean there are sets of four, eight, and three filled squares, in that order, with at least one blank square between successive sets.

This game is hard mathematically (it is "NP-HARD" meaning that we don't expect to find an algorithm that solves it completely but we have good heuristics)
The wikipedia article has a lot of nice techniques.
Environment
There are two types of actions coloring white and coloring black. I also experimented with allow toggle on and off, but I found the agent getting stuck in loops and I wanted to force it to see new boards).
I implemented two modes of the environment
One I call
easy_learn(in practice it is hard to play for humans) in which a wrong move leads to an instant death. There are some ambiguity for randomly generated boards but I think the signal is strong regardless.The other one has a couple of heuristics to check if the board is correct / incorrect
This mode is harder to learn in practice.
I also give negative rewards for invalid moves and positive rewards if the player completes a row or a column or completes the board (a larger reward for completing the board).
Neural network
I tried many different kinds of architectures. In the end, I decided on encoding the board using a CNN the clues using a dense architecture separately (rows and columns) and also encoding the board size.
The grid has 4 types of cells (padding, empty, black, white) and clues have theoretically 0-8 (so 9 types).
In practice I statically put the maximum board size at 8.
Curriculum Learning
I do two types of randomization
I ran my best policy against different board sizes
It can beat all sizes up until 5x5, and 70% of 6x6 boards.