Batteries included: promote some basic version/utils of reasonably fast offline/batched inference into PyTorch core (maybe based on gpt-fast, nano-vllm, torchao)

Given how much LLM training (via FSDP) and inference (often with vllm) are both needed for RL/GRPO, I wonder if it's time to upstream some basic components / utils for okay-speed inference directly to PyTorch? As vllm gets ever-more complicated... 

The goal would be being able to immediately run inference of FSDP-wrapped models without much of weight-conversion, or being able to use torchao for quantization of the existing weights. And it could also drive the dynamic shape testing for torch.compile / CUDA graphs...

It is a bit strange not needing any special framework for training besides FSDP, and needing an inference framework for basic inference. So maybe time for upstreaming some of the time-proven components from the inference engines...

This is already happening a bit with:
- https://github.com/pytorch/pytorch/pull/153666

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batteries included: promote some basic version/utils of reasonably fast offline/batched inference into PyTorch core (maybe based on gpt-fast, nano-vllm, torchao) #229

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batteries included: promote some basic version/utils of reasonably fast offline/batched inference into PyTorch core (maybe based on gpt-fast, nano-vllm, torchao) #229

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions