Thanks for the code. I have two questions.
- How to define custom bandits and their rewards, instead of using some predefined distribution?
- The reward part looks like a policy network. Can we add a policy network with layers of neurons (MLP) into the model? How to do that exactly?
(sorry I'm quite new to this)