Custom bandit rewards and policy network layers

Thanks for the code. I have two questions.
1) How to define custom bandits and their rewards, instead of using some predefined distribution?
2) The reward part looks like a policy network. Can we add a policy network with layers of neurons (MLP) into the model? How to do that exactly?
(sorry I'm quite new to this)