-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Questions on Evaluation Metrics and Concurrent Request Support
First of all, thank you for your excellent paper and for sharing the code with the community. I have several questions regarding your evaluation methodology and practical applicability:
1. Fairness in Resource Allocation for Baseline Comparison
In your proposed testing methodology, the baseline configuration uses 2 GPUs for the target model, while your approach requires additional GPU resources for both the draft model and verification model (e.g., an extra 2 GPUs).
This raises a question about fairness in resource comparison: If the baseline target model were allocated 4 GPUs (matching the total resource consumption of your method), would it achieve better performance? Theoretically, more GPU resources could enable:
- Higher tensor parallelism, potentially improving computational efficiency
- More KV cache space, supporting longer sequence processing
- Larger batch capacity
Under such equal resource conditions, would the baseline performance exceed the acceleration results reported in the paper? I would appreciate seeing comparative results under these fair resource constraints.
2. Concerns About Effectiveness in Multi-Concurrency Scenarios
Your current method appears to leverage the target model's concurrent processing capabilities primarily in single-request testing. However, in real-world production environments, handling multiple concurrent requests is much more common.
I'm concerned that in multi-concurrency scenarios:
- End-to-end latency for individual verification requests might increase due to resource contention
- Overall system throughput improvements might be less significant compared to single-request scenarios
- Resource allocation efficiency could degrade when processing multiple requests simultaneously
Do you have plans to provide test results for multi-concurrency scenarios? This would be crucial for evaluating the method's practicality in real deployments.
3. Integration Possibilities with vLLM's Internal Speculative Decoding
I notice that your current solution primarily optimizes through an external draft model, but doesn't seem to address the performance bottlenecks in vLLM's internal speculative decoding verification process.
Given that mainstream speculative decoding methods like Eagle have gained significant attention and adoption in the vLLM community, have you considered modifying and optimizing based on vLLM's internal draft model verification mechanism?
Such integration could offer several advantages:
- Better compatibility with vLLM's existing ecosystem
- Leveraging vLLM's internally optimized KV cache management
- Reducing communication overhead from external components
Thank you again for your valuable work. I look forward to your response and further discussion!