Given how much LLM training (via FSDP) and inference (often with vllm) are both needed for RL/GRPO, I wonder if it's time to upstream some basic components / utils for okay-speed inference directly to PyTorch? As vllm gets ever-more complicated...
The goal would be being able to immediately run inference of FSDP-wrapped models without much of weight-conversion, or being able to use torchao for quantization of the existing weights. And it could also drive the dynamic shape testing for torch.compile / CUDA graphs...
It is a bit strange not needing any special framework for training besides FSDP, and needing an inference framework for basic inference. So maybe time for upstreaming some of the time-proven components from the inference engines...
This is already happening a bit with: