-
Notifications
You must be signed in to change notification settings - Fork 190
Open
Description
JSON Serialization Error: NaN Values Causing Serialization Failure
Problem Description
When using the Llama-3.1-8B-Instruct model for inference, a JSON serialization error occurs. The error happens when the /v1/chat/completions/tokens endpoint tries to return a response, failing to serialize ChatCompletionResponse to JSON.
Error Message
ValueError: Out of range float values are not JSON compliant: nan
Full Error Stack Trace
[0;36m(ApiServer_0 pid=467463)[0;0m ERROR: Exception in ASGI application
[0;36m(ApiServer_0 pid=467463)[0;0m Traceback (most recent call last):
...
File "/proj-vertical-llms-pvc/users/zhihan/webarena/WebAgent-R1/WebAgent-R1-prime/prime-rl/src/prime_rl/inference/vllm/server.py", line 119, in _chat_with_tokens
return JSONResponse(content=generator.model_dump())
...
File "/root/.local/share/uv/python/cpython-3.12.12-linux-x86_64-gnu/lib/python3.12/json/encoder.py", line 258, in iterencode
return _iterencode(o, 0)
[0;36m(ApiServer_0 pid=467463)[0;0m ValueError: Out of range float values are not JSON compliant: nan
Problem Analysis
-
Error Location: The error occurs in the
_chat_with_tokensfunction insrc/prime_rl/inference/vllm/server.pywhen trying to convert aChatCompletionResponseobject to a dictionary viamodel_dump()and serialize it to JSON. -
Root Cause: The
ChatCompletionResponseobject containsnan(Not a Number) float values, which are not supported by the JSON standard (along withinfand other special float values). -
Model Differences:
- ✅ Qwen3 Series Models: When using Qwen3-4B-Instruct-2507 and Qwen2.5-7B-Instruct, this error was not observed
- ❌ Llama3.1 8B: After switching to Llama-3.1-8B-Instruct, this error started occurring
Configuration
Full configuration file content:
inference_gpu_ids = [0,1]
trainer_gpu_ids = [2,3,4,5,6,7]
max_steps = 200 # Number of training steps
seq_len = 16384 # Sequence length for training
[model]
name = "Llama-3.1-8B-Instruct"
[wandb]
project = "webarena-rl"
name = "webarena-debug"
[trainer.model]
impl = "liger_kernel" # Training implementation
ac = { freq = 1 } # freq=1 means full activation checkpointing (checkpoint every layer)
[trainer.optim]
lr = 1e-5
[orchestrator]
batch_size = 128 # Batch size (adjust based on GPU memory)
rollouts_per_example = 16 # Number of rollouts per example
[orchestrator.optim]
lr = 1e-5 # Learning rate (should match trainer.optim.lr)
[orchestrator.sampling]
max_tokens = 512 # Max tokens per generation
# WebArena environment configuration
[[orchestrator.env]]
id = "webarena-env" # Environment ID (must match the installed package name)
name = "webarena-env" # Display name for logs
[inference]
# Inference server configuration
api_server_count = 2 # Number of inference servers
gpu_memory_utilization = 0.9
[inference.parallel]
dp = 2
[inference.model]
Questions
- Why does this problem occur?
- How to fix it?
Related Code Locations
src/prime_rl/inference/vllm/server.py:119-_chat_with_tokensfunctionsrc/prime_rl/inference/vllm/serving_chat_with_tokens.py- Chat completion with tokens implementation
Metadata
Metadata
Assignees
Labels
No labels