Questions on Evaluation Metrics and Concurrent Request Support

# Questions on Evaluation Metrics and Concurrent Request Support

First of all, thank you for your excellent paper and for sharing the code with the community. I have several questions regarding your evaluation methodology and practical applicability:

## 1. Fairness in Resource Allocation for Baseline Comparison

In your proposed testing methodology, the baseline configuration uses **2 GPUs** for the target model, while your approach requires additional GPU resources for both the draft model and verification model (e.g., an extra 2 GPUs).

This raises a question about **fairness in resource comparison**: If the baseline target model were allocated **4 GPUs** (matching the total resource consumption of your method), would it achieve better performance? Theoretically, more GPU resources could enable:

- **Higher tensor parallelism**, potentially improving computational efficiency
- **More KV cache space**, supporting longer sequence processing
- **Larger batch capacity**

Under such equal resource conditions, would the baseline performance exceed the acceleration results reported in the paper? I would appreciate seeing comparative results under these fair resource constraints.

## 2. Concerns About Effectiveness in Multi-Concurrency Scenarios

Your current method appears to leverage the target model's concurrent processing capabilities primarily in **single-request testing**. However, in real-world production environments, **handling multiple concurrent requests** is much more common.

I'm concerned that in multi-concurrency scenarios:
- **End-to-end latency for individual verification requests** might increase due to resource contention
- **Overall system throughput improvements** might be less significant compared to single-request scenarios
- **Resource allocation efficiency** could degrade when processing multiple requests simultaneously

**Do you have plans to provide test results for multi-concurrency scenarios?** This would be crucial for evaluating the method's practicality in real deployments.

## 3. Integration Possibilities with vLLM's Internal Speculative Decoding

I notice that your current solution primarily optimizes through an **external draft model**, but doesn't seem to address the performance bottlenecks in vLLM's internal speculative decoding verification process.

Given that **mainstream speculative decoding methods like Eagle** have gained significant attention and adoption in the vLLM community, **have you considered modifying and optimizing based on vLLM's internal draft model verification mechanism?**

Such integration could offer several advantages:
- Better compatibility with vLLM's existing ecosystem
- Leveraging vLLM's internally optimized KV cache management
- Reducing communication overhead from external components

Thank you again for your valuable work. I look forward to your response and further discussion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on Evaluation Metrics and Concurrent Request Support #4

Questions on Evaluation Metrics and Concurrent Request Support

1. Fairness in Resource Allocation for Baseline Comparison

2. Concerns About Effectiveness in Multi-Concurrency Scenarios

3. Integration Possibilities with vLLM's Internal Speculative Decoding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on Evaluation Metrics and Concurrent Request Support #4

Description

Questions on Evaluation Metrics and Concurrent Request Support

1. Fairness in Resource Allocation for Baseline Comparison

2. Concerns About Effectiveness in Multi-Concurrency Scenarios

3. Integration Possibilities with vLLM's Internal Speculative Decoding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions