With Generator sped up by vllm's support_torch_compile, the new bottleneck is trainer. Let's enable `torch.compile` and `cudagraph` there to get similar speedups