[RL] Improve vLLM/Generator startup time with cudagraphs and support_torch_compile

As noted in initial PR https://github.com/pytorch/torchtitan/pull/2486 - the time with our vLLMWrapper is quite a bit larger (4x) than vLLM Native.  We should investigate this startup time and see how to reduce it