Evaluation details

Hello, could you add some details (used code/eval benchmarks) about evaluation for datasets mentioned below?
1) HumanEval
2) MT-Bench
3) LiveCodeBench

You precisely determined in the paper that for MT-Bench you used Qwen 2.5 as a Judge and some details about samples, but maybe you can add more details about used methods/eval frameworks etc.