Is your feature request related to a problem? Please describe.
The current router reason bench uses MMLU-Pro for model eval. Since the classifier also uses the same dataset for training, it is more reasonable to eval router's classification accuracy and reasoning setting through other datasts.
Describe the solution you'd like
Build a dataset factory and support datasets like GPAQ, BIG-bench, etc.