Skip to content

Conversation

@Jackmin801
Copy link
Member

@Jackmin801 Jackmin801 commented Jan 17, 2026

Note

Introduces token-in chat completions and adapts to vLLM nightly module/layout changes.

  • Adds OpenAIServingChatWithTokens and /v1/chat/completions/tokens route; allows overriding prompt_token_ids from request tokens and supports streaming/full responses
  • Updates imports to new vLLM OpenAI modules (chat_completion.protocol/serving, engine.protocol) and adjusts server wiring
  • Monkeypatches:
    • PrometheusStatLogger init to bypass DP-mode LoRA check while restoring LoRA metrics
    • OpenAIServingModels.load_lora_adapter to reuse/update existing adapters by name
    • LRUCacheWorkerLoRAManager to avoid redundant per-request loads and manage cache activation/eviction
  • Customizes init_app_state and worker proc to ensure patches are applied in multi-API-server setups
  • Test utils now only consider SUCCESS step lines matching the reward pattern for numeric checks

Written by Cursor Bugbot for commit b16f34a. This will update automatically on new commits. Configure here.

@Jackmin801 Jackmin801 marked this pull request as ready for review January 18, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants