Request to add Debug mode

Currently the eval script supports running through benchmarks as a whole. Would be nice if we can have a `debug` mode that test for evaluating one example or similarly a `num_example` mode that allows evaluating on a subset of a benchmark.