Skip to content

Conversation

@Manan17
Copy link
Contributor

@Manan17 Manan17 commented Jun 5, 2025

Summary

Just testing out logprobs as mentioned in #742
It worked for the models where the test using logits was not working.
Also, tried to setup 1e-1 tolerance for qwen (previously 1) and it passed.

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

Comment on lines +899 to 900
1e-1, # 1e-1
1e-1, # 1e-2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing all logprobs comparison, we can try setting it lower.
sglang only has atol and sets it to 5e-2 (decode_tolerance)
verl sets (atol, rtol) = (1e-2, 1e-5), but it's mean of all logprobs not topk

Copy link
Contributor Author

@Manan17 Manan17 Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not work with lower tolerance.
For gemma3, it passes when atol=1e-1 and rtol=1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out with fp32, it fails for most of the models where old logic for checking the logits is passing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are comparing values in log-space, the total tolerance here is actually relative tolerance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just check the rtol?
like: tolerance = rtol * torch.abs(tensor2)

Copy link
Collaborator

@Tcc0403 Tcc0403 Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolute diff for two logprobs (logA - logB) = relative diff for two probs (A / B), which means the whole tolerance (atol + rtol * torch.abs(expected)) should be the maximum relative diff we can accept.

I think that's also why sglang only has a single tolerance in their test.

@Manan17
Copy link
Contributor Author

Manan17 commented Jun 8, 2025

@Tcc0403 Can you have a look at the changes, I have tested it.
Let me know what you think, I will update it for the multimodal tests as well.
What should be done for test_mini_models_with_logits?

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Jun 9, 2025

What should be done for test_mini_models_with_logits

check logprobs as well for consistency

I'm planning to rewrite convergence tests so just ignore namings for now.

@Manan17
Copy link
Contributor Author

Manan17 commented Jun 9, 2025

What should be done for test_mini_models_with_logits

check logprobs as well for consistency

I'm planning to rewrite convergence tests so just ignore namings for now.

Gotcha!
I tried testing with mean logprobs as well.
The tests pass with lower tolerance values. Verl has set atol=1e-2 and rtol=1e-5, which works for us as well in bf16.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Jun 9, 2025

I tried testing with mean logprobs as well.
The tests pass with lower tolerance values. Verl has set atol=1e-2 and rtol=1e-5, which works for us as well in bf16.

What mean logprobs do you pick? I checked verl impl, they pick per-token logprobs for the given labels

@Manan17
Copy link
Contributor Author

Manan17 commented Jun 10, 2025

I tried top 20 logprobs and it was able to pass tests for all the models! @Tcc0403

@Manan17
Copy link
Contributor Author

Manan17 commented Jun 11, 2025

The tolerance for gemma3 multimodal model had to be set high as it does not pass the tests for loss and topk_logprobs.
The atol and rtol set it 1e-1.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Jun 11, 2025

The tolerance for gemma3 multimodal model had to be set high as it does not pass the tests for loss and topk_logprobs.
The atol and rtol set it 1e-1.

Yeah, I think we can compromise with 1e-1 before further investigation in numerical issue. Just make them all green first unless there's an obvious mismatch.

Copy link
Collaborator

@shimizust shimizust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes!

@shimizust shimizust merged commit 1f640a5 into linkedin:main Jun 13, 2025
3 of 7 checks passed
@Manan17 Manan17 changed the title Trying out logprobs and top logprobs for testing rather than logits. Changed tests from logits to topk logprobs Jul 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants