In fastchat/serve/model_worker.py, the get_embeddings function calculates embeddings incorrectly when processing multiple inputs at once.
sum_embeddings is per-sequence but token_num sums tokens across the entire batch:
sum_embeddings = torch.sum(masked_embeddings, dim=1) # [batch_size, hidden_dim]
token_num = torch.sum(attention_mask).item() # scalar across ALL sequences
Each sequence should be divided by its own token count:
token_counts = attention_mask.sum(dim=1, keepdim=True) # [batch_size, 1]
mean_embeddings = sum_embeddings / token_counts
This silently returns wrong embeddings when batch_size > 1.