feat: FP8 initial support on continuous batching #402

wallashss · 2025-08-21T13:17:01Z

Description

This PR add initial support for FP8 on Continuous Batching

Changes

Included FP8 logic in spyre.py which need to set the scale to the weights for CB
[UDPATE]: Added padding for bs=1
[EXTRA] Added decoding of generation on scheduler test_spyre_cb_scheduler_steps.py for better debugging later

TODOS

Set tolerance for logprobs difference for quantized model during tests
Currently, the matrix of tests does not include tests/e2e/test_spyre_cb_scheduler_steps.py for FP8, we have to figure it out a clean way to include them. Moreover, most of these tests are failing and need better thinking before activating them.

Signed-off-by: Wallas Santos <[email protected]>

github-actions · 2025-08-21T13:17:08Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Wallas Santos <[email protected]>

docs: improved docs Signed-off-by: Wallas Santos <[email protected]>

Signed-off-by: Wallas Santos <[email protected]>

prashantgupta24 · 2025-08-25T22:03:33Z

vllm_spyre/v1/worker/spyre_model_runner.py

+        if self.model.model.dtype in [torch.float8_e4m3fn]:
+            mask = mask.to(torch.float32)
+        else:
+            mask = mask.to(self.model.model.dtype)


Since we control the self.model.model.dtype (through the get_dtype function), can we not make sure that self.model.model.dtype is always what we want it to be?

Would be good to keep such unintuitive code in one place (unintuitive because I would have expected fp8 to work but it's not supposed to work that way)

Which place? I didn't understand.

Can I put there a TODO there and you fix that in your following PR?

I meant the get_dtype function in spyre.py - that's where we get the dtype which eventually is what gets plugged to self.model.model.dtype) -> I'm not 100% sure if that is the correct way though, worth adding a TODO

So right now we have this for CB:

def get_dtype(self) -> torch.dtype: # Get the model's data type # This should be: # FP32 for un-quantized models on cpu # FP16 for un-quantized models on spyre # FP8 (float8_e4m3fn) for quantized models # (only fp8 quantization is supported) if self.model_config.quantization: return torch.float8_e4m3fn else: if envs_spyre.VLLM_SPYRE_DYNAMO_BACKEND in BACKEND_LIST: return torch.float16 else: return torch.float32

If we don't want to work with torch.float8_e4m3fn (since you explicitly check in your if condition above), maybe we don't return that as the value within if self.model_config.quantization and instead return fp32 directly? That way we can get rid of the if condition you wrote above

Hmmm I think understood your point, but I think change the behavior of this method is misleading. I did a quick search and this method is only used to set the dtype of the model in the constructor. I think change to something else would be wrong. The check I did is only to prevent mask tensor to use torch.float8_e4m3fn, a corner case that I identified during development, maybe elsewhere would be fine to keep the right dtype.

Actually yeah the other place where self.dtype is used is in the scale of past_key_value_states - it would be ugly if that needs fp8 and this needs fp32 :(

I thought that for running on spyre we wanted to set the mask as fp16 though, not necessarily fp32?

Maybe we should scrap model.dtype completely, and instead specify the dtypes that we need for specific tensors. e.g.

def self.get_mask_dtype(self): return fp16 if spyre else fp32 def self.get_kv_cache_dtype(self): return fp8 if self.is_fp8_model else fp16

Sorry, but I missed the last comment from Joe.

I changed to something similar to his suggestion. Running the tests to check if everything's still alright.

BTW, this change breaks the cache.

vllm_spyre/model_executor/model_loader/spyre.py

maxdebayser · 2025-08-26T14:53:52Z

vllm_spyre/model_executor/model_loader/spyre.py

@@ -430,25 +438,32 @@ def _set_past_key_value_states(self, num_blocks) -> None:
            # TODO: This does not work yet. The scale needs to be handled, see:
            # https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/v0.1.0rc3/aiu_fms_testing_utils/utils/paged.py#L306-L319
            from fms_mo.aiu_addons.fp8.fp8_utils import ScaledTensor
+            batch_size = max(2, self.scheduler_config.max_num_seqs)


Is FP8 only supported with batch size <= 2?

At least batch size. Currently, I found that had to set bs=2, but maybe it is not necessary for now. I'll revert it.

Then maybe we should update the scheduler_config in Platform.check_and_update_config.

IMHO, The intention here is to fix/workaround the compiler limitation while in platform we change the behavior of the system as whole. There's something similar there:

# min value 2 needed for VLLM_DT_MAX_BATCH_SIZE (compiler constraint) # Note that we can still have decodes of batch size 1 as the env var # only concerns the max batch size. os.environ["VLLM_DT_MAX_BATCH_SIZE"] = str( max(vllm_config.scheduler_config.max_num_seqs, 2))

But setting this env does NOT* change original setup of vllm.

Either way is probably fine, but does everything still work if the user sets --max-num-seqs 1? If not then I'd prefer overriding it in platform.py

vllm_spyre/v1/worker/spyre_model_runner.py

Signed-off-by: Wallas Santos <[email protected]>

test: minor improvement on tolerance for quantized models Signed-off-by: Wallas Santos <[email protected]>

Signed-off-by: Wallas Santos <[email protected]>

maxdebayser · 2025-09-02T13:20:06Z

tests/scheduling_utils.py

@@ -212,7 +221,12 @@ def check_scheduler_inference_steps(
                new_token_ids[0])
            collected_outputs[output.request_id]["logprobs"].append(
                new_logprobs[0][0])
+            collected_outputs[output.request_id]["tokens"].append(


I'm not sure I understand why the decoding is needed. Is it just to print text for debugging instead of token indices?

Yes, these tests were failing to me, and just seeing the logs I didn't know why was failing. For example, the generation was diverging due to a different token choice, but I didn't know if it was gibberish or something reasonable because of difference of logprobs. Also, it is helpful to get the exact prompt and test in a different environment to see the response out of a batch . For instance, the prompts of these tests are slight different of the chicken soup prompts, they were truncated to have an exact count of tokens.

That's why I think the decoding is helpful for the debugging of these tests.

maxdebayser · 2025-09-02T13:24:01Z

vllm_spyre/model_executor/model_loader/spyre.py

@@ -430,25 +438,32 @@ def _set_past_key_value_states(self, num_blocks) -> None:
            # TODO: This does not work yet. The scale needs to be handled, see:
            # https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/v0.1.0rc3/aiu_fms_testing_utils/utils/paged.py#L306-L319
            from fms_mo.aiu_addons.fp8.fp8_utils import ScaledTensor
+            batch_size = max(2, self.scheduler_config.max_num_seqs)


Then maybe we should update the scheduler_config in Platform.check_and_update_config.

Signed-off-by: Wallas Santos <[email protected]>

joerunde · 2025-09-03T22:53:12Z

tests/spyre_util.py

@@ -24,6 +25,9 @@
 ISCLOSE_REL_TOL_CPU = 0.35
 ISCLOSE_REL_TOL_SPYRE = 0.35

+# TODO: improve this
+ISCLOSE_REL_TOL_QUANTIZATION = 0.451


uhhh.... at what point is this just tolerance too loose?

isclose takes an absolute tolerance as well, maybe we should instead start maintaining both tolerances. For example if the two logprobs we're comparing are -9.1 and -15.2 then we should fail, but with -0.000001 and -0.000002 maybe we can pass.

If we add ISCLOSE_ABS_TOL = 0.0001 then how tight can we make the relative tolerance again?

joerunde

lpgtm! With a slight preference to adding an absolute tolerance so we don't have to relax the relative one so much

Signed-off-by: Wallas Santos <[email protected]>

feat: FP8 support on continuous batch

8f3fdf5

Signed-off-by: Wallas Santos <[email protected]>

wallashss added 12 commits August 21, 2025 13:17

refact: fews cleanup

a2c9567

Signed-off-by: Wallas Santos <[email protected]>

ci: updated conftests for fp8

909eb89

Signed-off-by: Wallas Santos <[email protected]>

feat: code cleanup

391d443

Signed-off-by: Wallas Santos <[email protected]>

Merge branch 'main' into wallas-878

73f4328

Signed-off-by: Wallas Santos <[email protected]>

fix: tests comparation to help debugging

ebc6913

Signed-off-by: Wallas Santos <[email protected]>

fix: scale_indices field

27bb8f6

Signed-off-by: Wallas Santos <[email protected]>

fix: tests

a73a4e9

docs: improved docs Signed-off-by: Wallas Santos <[email protected]>

feat: revert warmup

d1b3db1

Signed-off-by: Wallas Santos <[email protected]>

refact: spyre.py

536fe54

Signed-off-by: Wallas Santos <[email protected]>

style: fix linting

6d3bb24

Signed-off-by: Wallas Santos <[email protected]>

ci: include cpu and cb and quantized

c8ca4c3

Signed-off-by: Wallas Santos <[email protected]>

fix: added cache for continuous batching fp8

95f22b0

Signed-off-by: Wallas Santos <[email protected]>

prashantgupta24 reviewed Aug 25, 2025

View reviewed changes

maxdebayser reviewed Aug 26, 2025

View reviewed changes

vllm_spyre/model_executor/model_loader/spyre.py Show resolved Hide resolved

maxdebayser reviewed Aug 26, 2025

View reviewed changes

vllm_spyre/model_executor/model_loader/spyre.py Show resolved Hide resolved

maxdebayser reviewed Aug 26, 2025

View reviewed changes

vllm_spyre/v1/worker/spyre_model_runner.py Show resolved Hide resolved

wallashss added 3 commits August 26, 2025 14:26

test: revert test.yml

e7d93ca

Signed-off-by: Wallas Santos <[email protected]>

test: increase tollerance for logprobs diff

247d9e1

Signed-off-by: Wallas Santos <[email protected]>

docs: added todos

7bb19d1

test: minor improvement on tolerance for quantized models Signed-off-by: Wallas Santos <[email protected]>

wallashss changed the title ~~[WIP] feat: FP8 support on continuous batch~~ [WIP] feat: FP8 initial support on continuous batching Aug 26, 2025

wallashss changed the title ~~[WIP] feat: FP8 initial support on continuous batching~~ feat: FP8 initial support on continuous batching Aug 26, 2025

wallashss marked this pull request as ready for review August 26, 2025 17:50

wallashss requested review from rafvasq, sducouedic, yannicks1, tdoublep and nikolaospapandreou as code owners August 26, 2025 17:50

wallashss added 4 commits August 26, 2025 15:06

refact: change scale indices to tensor

16a3e15

Signed-off-by: Wallas Santos <[email protected]>

wip: padding for bs=2

b62d937

Signed-off-by: Wallas Santos <[email protected]>

feat: padding batch for bs=1 on FP8+CB

650406c

Signed-off-by: Wallas Santos <[email protected]>

Merge branch 'main' into wallas-878

bb5600f

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested review from maxdebayser, joerunde and prashantgupta24 August 29, 2025 19:11

wallashss added the ready label Aug 29, 2025

wallashss added 2 commits August 29, 2025 18:14

feat: set lower threshold for fp8 for both spyre and cpu

0321daa

Signed-off-by: Wallas Santos <[email protected]>

tests: removed xfail of cb on spyre

0089b68

Signed-off-by: Wallas Santos <[email protected]>

maxdebayser reviewed Sep 2, 2025

View reviewed changes

feat: addressing reviews comments

1355775

Signed-off-by: Wallas Santos <[email protected]>

joerunde reviewed Sep 3, 2025

View reviewed changes

joerunde approved these changes Sep 3, 2025

View reviewed changes

wallashss added 2 commits September 4, 2025 10:28

fix: reverted tolerance to compare logprobs

4138eca

Signed-off-by: Wallas Santos <[email protected]>

Merge branch 'main' into wallas-878

6a848ac

Signed-off-by: Wallas Santos <[email protected]>

feat: FP8 initial support on continuous batching #402

Are you sure you want to change the base?

feat: FP8 initial support on continuous batching #402

Conversation

wallashss commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

TODOS

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

prashantgupta24 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantgupta24 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantgupta24 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantgupta24 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wallashss Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wallashss commented Aug 21, 2025 •

edited

Loading

prashantgupta24 Aug 25, 2025 •

edited

Loading

prashantgupta24 Aug 25, 2025 •

edited

Loading

prashantgupta24 Aug 26, 2025 •

edited

Loading

prashantgupta24 Aug 26, 2025 •

edited

Loading

wallashss Sep 2, 2025 •

edited

Loading