Skip to content

Questions on Evaluation Metrics and Concurrent Request Support #4

@EanWang

Description

@EanWang

Questions on Evaluation Metrics and Concurrent Request Support

First of all, thank you for your excellent paper and for sharing the code with the community. I have several questions regarding your evaluation methodology and practical applicability:

1. Fairness in Resource Allocation for Baseline Comparison

In your proposed testing methodology, the baseline configuration uses 2 GPUs for the target model, while your approach requires additional GPU resources for both the draft model and verification model (e.g., an extra 2 GPUs).

This raises a question about fairness in resource comparison: If the baseline target model were allocated 4 GPUs (matching the total resource consumption of your method), would it achieve better performance? Theoretically, more GPU resources could enable:

  • Higher tensor parallelism, potentially improving computational efficiency
  • More KV cache space, supporting longer sequence processing
  • Larger batch capacity

Under such equal resource conditions, would the baseline performance exceed the acceleration results reported in the paper? I would appreciate seeing comparative results under these fair resource constraints.

2. Concerns About Effectiveness in Multi-Concurrency Scenarios

Your current method appears to leverage the target model's concurrent processing capabilities primarily in single-request testing. However, in real-world production environments, handling multiple concurrent requests is much more common.

I'm concerned that in multi-concurrency scenarios:

  • End-to-end latency for individual verification requests might increase due to resource contention
  • Overall system throughput improvements might be less significant compared to single-request scenarios
  • Resource allocation efficiency could degrade when processing multiple requests simultaneously

Do you have plans to provide test results for multi-concurrency scenarios? This would be crucial for evaluating the method's practicality in real deployments.

3. Integration Possibilities with vLLM's Internal Speculative Decoding

I notice that your current solution primarily optimizes through an external draft model, but doesn't seem to address the performance bottlenecks in vLLM's internal speculative decoding verification process.

Given that mainstream speculative decoding methods like Eagle have gained significant attention and adoption in the vLLM community, have you considered modifying and optimizing based on vLLM's internal draft model verification mechanism?

Such integration could offer several advantages:

  • Better compatibility with vLLM's existing ecosystem
  • Leveraging vLLM's internally optimized KV cache management
  • Reducing communication overhead from external components

Thank you again for your valuable work. I look forward to your response and further discussion!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions