Skip to content

Commit 08d4613

Browse files
authored
Change time-to-first-token parameter to be based on number of request tokens #137 (#165)
* Fix comments on prefill arg in completion request interface Signed-off-by: Qifan Deng <[email protected]> * Add feature of calc ttft by prefill overhead. TODO: kvcache transfer overhead Signed-off-by: Qifan Deng <[email protected]> * Rename prefill-overhead-complexity to prefill-complexity Signed-off-by: Qifan Deng <[email protected]> * Calc kv cache transfer overhead based on prompt length Signed-off-by: Qifan Deng <[email protected]> * Add invalid test cases for args prefill-overhead and kv-cache-transfer-overhead Signed-off-by: Qifan Deng <[email protected]> * Add standard deviation in utils Signed-off-by: Qifan Deng <[email protected]> * Add stddev for prefill overhead and kvcache trans overhead Signed-off-by: Qifan Deng <[email protected]> * Fix test condition when remove p/d is enabled and in-place policy is used Signed-off-by: Qifan Deng <[email protected]> * Use simplfied implementation of ttft Signed-off-by: Qifan Deng <[email protected]> * Add sep lines in readme params Signed-off-by: Qifan Deng <[email protected]> * Update readme with explanation of new ttft Signed-off-by: Qifan Deng <[email protected]> * Fix ttft new params tests Signed-off-by: Qifan Deng <[email protected]> * Fix kv cache trasfer tests and impl Signed-off-by: Qifan Deng <[email protected]> * Fix invalid config test of new ttft params Signed-off-by: Qifan Deng <[email protected]> * Revert "Add standard deviation in utils" This reverts commit 18d3075. Signed-off-by: Qifan Deng <[email protected]> * Remove additional variables in prefill time calculation Signed-off-by: Qifan Deng <[email protected]> * Improve is remote prefill/decode interface doc Signed-off-by: Qifan Deng <[email protected]> * Improve implementation of ttft calc Signed-off-by: Qifan Deng <[email protected]> * Remove unnecessary variable Signed-off-by: Qifan Deng <[email protected]> --------- Signed-off-by: Qifan Deng <[email protected]> Signed-off-by: Qifan Deng <[email protected]>
1 parent 33e7210 commit 08d4613

File tree

7 files changed

+211
-19
lines changed

7 files changed

+211
-19
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,13 +101,22 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
101101
- `mode`: the simulator mode, optional, by default `random`
102102
- `echo`: returns the same text that was sent in the request
103103
- `random`: returns a sentence chosen at random from a set of pre-defined sentences
104+
---
104105
- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero
105106
- `time-to-first-token-std-dev`: standard deviation for time before the first token will be returned, in milliseconds, optional, default is 0, can't be more than 30% of `time-to-first-token`, will not cause the actual time to first token to differ by more than 70% from `time-to-first-token`
106107
- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero
107108
- `inter-token-latency-std-dev`: standard deviation for time between generated tokens, in milliseconds, optional, default is 0, can't be more than 30% of `inter-token-latency`, will not cause the actual inter token latency to differ by more than 70% from `inter-token-latency`
108109
- `kv-cache-transfer-latency`: time for KV-cache transfer from a remote vLLM (in milliseconds), by default zero. Usually much shorter than `time-to-first-token`
109110
- `kv-cache-transfer-latency-std-dev`: standard deviation for time to "transfer" kv-cache from another vLLM instance in case P/D is activated, in milliseconds, optional, default is 0, can't be more than 30% of `kv-cache-transfer-latency`, will not cause the actual latency to differ by more than 70% from `kv-cache-transfer-latency`
111+
---
112+
- `prefill-overhead`: constant overhead time for prefill (in milliseconds), optional, by default zero, used in calculating time to first token, this will be ignored if `time-to-first-token` is not `0`
113+
- `prefill-time-per-token`: time taken to generate each token during prefill (in milliseconds), optional, by default zero, this will be ignored if `time-to-first-token` is not `0`
114+
- `prefill-time-std-dev`: similar to `time-to-first-token-std-dev`, but is applied on the final prefill time, which is calculated by `prefill-overhead`, `prefill-time-per-token`, and number of prompt tokens, this will be ignored if `time-to-first-token` is not `0`
115+
- `kv-cache-transfer-time-per-token`: time taken to transfer cache for each token in case P/D is enabled (in milliseconds), optional, by default zero, this will be ignored if `kv-cache-transfer-latency` is not `0`
116+
- `kv-cache-transfer-time-std-dev`: similar to `time-to-first-token-std-dev`, but is applied on the final kv cache transfer time in case P/D is enabled (in milliseconds), which is calculated by `kv-cache-transfer-time-per-token` and number of prompt tokens, this will be ignored if `kv-cache-transfer-latency` is not `0`
117+
---
110118
- `seed`: random seed for operations (if not set, current Unix time in nanoseconds is used)
119+
---
111120
- `max-tool-call-integer-param`: the maximum possible value of integer parameters in a tool call, optional, defaults to 100
112121
- `min-tool-call-integer-param`: the minimum possible value of integer parameters in a tool call, optional, defaults to 0
113122
- `max-tool-call-number-param`: the maximum possible value of number (float) parameters in a tool call, optional, defaults to 100
@@ -116,6 +125,7 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
116125
- `min-tool-call-array-param-length`: the minimum possible length of array parameters in a tool call, optional, defaults to 1
117126
- `tool-call-not-required-param-probability`: the probability to add a parameter, that is not required, in a tool call, optional, defaults to 50
118127
- `object-tool-call-not-required-field-probability`: the probability to add a field, that is not required, in an object in a tool call, optional, defaults to 50
128+
---
119129
- `enable-kvcache`: if true, the KV cache support will be enabled in the simulator. In this case, the KV cache will be simulated, and ZQM events will be published when a KV cache block is added or evicted.
120130
- `kv-cache-size`: the maximum number of token blocks in kv cache
121131
- `block-size`: token block size for contiguous chunks of tokens, possible values: 8,16,32,64,128
@@ -124,8 +134,10 @@ For more details see the <a href="https://docs.vllm.ai/en/stable/getting_started
124134
- `zmq-endpoint`: ZMQ address to publish events
125135
- `zmq-max-connect-attempts`: the maximum number of ZMQ connection attempts, defaults to 0, maximum: 10
126136
- `event-batch-size`: the maximum number of kv-cache events to be sent together, defaults to 16
137+
---
127138
- `failure-injection-rate`: probability (0-100) of injecting failures, optional, default is 0
128139
- `failure-types`: list of specific failure types to inject (rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found), optional, if empty all types are used
140+
---
129141
- `fake-metrics`: represents a predefined set of metrics to be sent to Prometheus as a substitute for the real metrics. When specified, only these fake metrics will be reported — real metrics and fake metrics will never be reported together. The set should include values for
130142
- `running-requests`
131143
- `waiting-requests`

pkg/common/config.go

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ type Configuration struct {
7272
// in milliseconds, optional, default is 0, can't be more than 30% of TimeToFirstToken, will not
7373
// cause the actual time to first token to differ by more than 70% from TimeToFirstToken
7474
TimeToFirstTokenStdDev int `yaml:"time-to-first-token-std-dev" json:"time-to-first-token-std-dev"`
75+
7576
// InterTokenLatency time between generated tokens, in milliseconds
7677
InterTokenLatency int `yaml:"inter-token-latency" json:"inter-token-latency"`
7778
// InterTokenLatencyStdDev standard deviation for time between generated tokens, in milliseconds,
@@ -87,6 +88,21 @@ type Configuration struct {
8788
// KVCacheTransferLatency
8889
KVCacheTransferLatencyStdDev int `yaml:"kv-cache-transfer-latency-std-dev" json:"kv-cache-transfer-latency-std-dev"`
8990

91+
// $Total Prefill Time = PrefillOverhead + n * PrefillTimePerToken$
92+
// the assumption is that n is less than k, where k is the number of prallelism units of GPU
93+
// PrefillOverhead time taken to prefill the context, in milliseconds
94+
PrefillOverhead int `yaml:"prefill-overhead" json:"prefill-overhead"`
95+
PrefillTimePerToken int `yaml:"prefill-time-per-token" json:"prefill-time-per-token"`
96+
// PrefillOverheadStdDev similar to TimeToFirstTokenStdDev
97+
PrefillTimeStdDev int `yaml:"prefill-time-std-dev" json:"prefill-time-std-dev"`
98+
// $Total KV Cache Transfer Time = n * KVCacheTransferTimePerToken$
99+
// the assumption is that the cache blocks are all missed at the remote pod
100+
// KVCacheTransfer overhead time taken to transfer kv-cache from another vLLM instance in case P/D is activated,
101+
// in milliseconds.
102+
KVCacheTransferTimePerToken int `yaml:"kv-cache-transfer-time-per-token" json:"kv-cache-transfer-time-per-token"`
103+
// KVCacheTransferOverheadStdDev similar to TimeToFirstTokenStdDev
104+
KVCacheTransferTimeStdDev int `yaml:"kv-cache-transfer-time-std-dev" json:"kv-cache-transfer-time-std-dev"`
105+
90106
// Mode defines the simulator response generation mode, valid values: echo, random
91107
Mode string `yaml:"mode" json:"mode"`
92108
// Seed defines random seed for operations
@@ -307,6 +323,24 @@ func (c *Configuration) validate() error {
307323
if float32(c.TimeToFirstTokenStdDev) > 0.3*float32(c.TimeToFirstToken) {
308324
return errors.New("time to first token standard deviation cannot be more than 30% of time to first token")
309325
}
326+
327+
if c.PrefillOverhead < 0 {
328+
return errors.New("prefill overhead cannot be negative")
329+
}
330+
if c.PrefillTimePerToken < 0 {
331+
return errors.New("prefill time per token cannot be negative")
332+
}
333+
if c.PrefillTimeStdDev < 0 {
334+
return errors.New("prefill time standard deviation cannot be negative")
335+
}
336+
337+
if c.KVCacheTransferTimePerToken < 0 {
338+
return errors.New("kv-cache tranfer time per token cannot be negative")
339+
}
340+
if c.KVCacheTransferTimeStdDev < 0 {
341+
return errors.New("kv-cache tranfer time standard deviation cannot be negative")
342+
}
343+
310344
if c.KVCacheTransferLatency < 0 {
311345
return errors.New("kv-cache tranfer time cannot be negative")
312346
}
@@ -316,6 +350,7 @@ func (c *Configuration) validate() error {
316350
if float32(c.KVCacheTransferLatencyStdDev) > 0.3*float32(c.KVCacheTransferLatency) {
317351
return errors.New("kv-cache tranfer standard deviation cannot be more than 30% of kv-cache tranfer")
318352
}
353+
319354
if c.MaxLoras < 1 {
320355
return errors.New("max LoRAs cannot be less than 1")
321356
}
@@ -433,6 +468,13 @@ func ParseCommandParamsAndLoadConfig() (*Configuration, error) {
433468
f.StringVar(&config.Mode, "mode", config.Mode, "Simulator mode: echo - returns the same text that was sent in the request, for chat completion returns the last message; random - returns random sentence from a bank of pre-defined sentences")
434469
f.IntVar(&config.InterTokenLatency, "inter-token-latency", config.InterTokenLatency, "Time to generate one token (in milliseconds)")
435470
f.IntVar(&config.TimeToFirstToken, "time-to-first-token", config.TimeToFirstToken, "Time to first token (in milliseconds)")
471+
472+
f.IntVar(&config.PrefillOverhead, "prefill-overhead", config.PrefillOverhead, "Time to prefill in milliseconds. This argument is ignored if <time-to-first-token> is not 0.")
473+
f.IntVar(&config.PrefillTimePerToken, "prefill-time-per-token", config.PrefillTimePerToken, "Time to prefill per token (in milliseconds)")
474+
f.IntVar(&config.PrefillTimeStdDev, "prefill-time-std-dev", config.PrefillTimeStdDev, "Standard deviation for time to prefill (in milliseconds)")
475+
f.IntVar(&config.KVCacheTransferTimePerToken, "kv-cache-transfer-time-per-token", config.KVCacheTransferTimePerToken, "Time for KV-cache transfer per token from a remote vLLM (in milliseconds)")
476+
f.IntVar(&config.KVCacheTransferTimeStdDev, "kv-cache-transfer-time-std-dev", config.KVCacheTransferTimeStdDev, "Standard deviation for time for KV-cache transfer per token from a remote vLLM (in milliseconds)")
477+
436478
f.IntVar(&config.KVCacheTransferLatency, "kv-cache-transfer-latency", config.KVCacheTransferLatency, "Time for KV-cache transfer from a remote vLLM (in milliseconds)")
437479
f.IntVar(&config.InterTokenLatencyStdDev, "inter-token-latency-std-dev", config.InterTokenLatencyStdDev, "Standard deviation for time between generated tokens (in milliseconds)")
438480
f.IntVar(&config.TimeToFirstTokenStdDev, "time-to-first-token-std-dev", config.TimeToFirstTokenStdDev, "Standard deviation for time before the first token will be returned (in milliseconds)")

pkg/common/config_test.go

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -401,6 +401,31 @@ var _ = Describe("Simulator configuration", func() {
401401
name: "invalid (negative) zmq-max-connect-attempts for config file",
402402
args: []string{"cmd", "--config", "../../manifests/invalid-config.yaml"},
403403
},
404+
{
405+
name: "invalid (negative) prefill-overhead",
406+
args: []string{"cmd", "--prefill-overhead", "-1",
407+
"--config", "../../manifests/config.yaml"},
408+
},
409+
{
410+
name: "invalid (negative) prefill-time-per-token",
411+
args: []string{"cmd", "--prefill-time-per-token", "-1",
412+
"--config", "../../manifests/config.yaml"},
413+
},
414+
{
415+
name: "invalid (negative) prefill-time-std-dev",
416+
args: []string{"cmd", "--prefill-time-std-dev", "-1",
417+
"--config", "../../manifests/config.yaml"},
418+
},
419+
{
420+
name: "invalid (negative) kv-cache-transfer-time-per-token",
421+
args: []string{"cmd", "--kv-cache-transfer-time-per-token", "-1",
422+
"--config", "../../manifests/config.yaml"},
423+
},
424+
{
425+
name: "invalid (negative) kv-cache-transfer-time-std-dev",
426+
args: []string{"cmd", "--kv-cache-transfer-time-std-dev", "-1",
427+
"--config", "../../manifests/config.yaml"},
428+
},
404429
}
405430

406431
for _, test := range invalidTests {

pkg/llm-d-inference-sim/simulator.go

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -495,7 +495,7 @@ func (s *VllmSimulator) reqProcessingWorker(ctx context.Context, id int) {
495495
model: displayModel,
496496
doRemotePrefill: req.IsDoRemotePrefill(),
497497
},
498-
responseTokens, toolCalls, finishReason, usageDataToSend,
498+
usageDataToSend.PromptTokens, responseTokens, toolCalls, finishReason, usageDataToSend,
499499
)
500500
} else {
501501
if req.IsDoRemoteDecode() {
@@ -646,8 +646,9 @@ func (s *VllmSimulator) sendResponse(isChatCompletion bool, ctx *fasthttp.Reques
646646
}
647647

648648
// calculate how long to wait before returning the response, time is based on number of tokens
649-
numOfTokens := usageData.CompletionTokens
650-
totalMillisToWait := s.getTimeToFirstToken(doRemotePrefill) + s.getTotalInterTokenLatency(numOfTokens)
649+
nPromptTokens := usageData.PromptTokens
650+
nGenTokens := usageData.CompletionTokens
651+
totalMillisToWait := s.getTimeToFirstToken(nPromptTokens, doRemotePrefill) + s.getTotalInterTokenLatency(nGenTokens)
651652
time.Sleep(time.Duration(totalMillisToWait) * time.Millisecond)
652653

653654
ctx.Response.Header.SetContentType("application/json")
@@ -665,14 +666,23 @@ func (s *VllmSimulator) sendResponse(isChatCompletion bool, ctx *fasthttp.Reques
665666
}
666667

667668
// returns time to first token based on the current request's doRemotePrefill
668-
func (s *VllmSimulator) getTimeToFirstToken(doRemotePrefill bool) int {
669-
mean := float64(s.config.TimeToFirstToken)
670-
stddev := float64(s.config.TimeToFirstTokenStdDev)
669+
func (s *VllmSimulator) getTimeToFirstToken(nPromptTokens int, doRemotePrefill bool) int {
671670
if doRemotePrefill {
672-
mean = float64(s.config.KVCacheTransferLatency)
673-
stddev = float64(s.config.KVCacheTransferLatencyStdDev)
671+
if s.config.KVCacheTransferLatency == 0 && s.config.KVCacheTransferLatencyStdDev == 0 {
672+
// is disaggregated PD and ttft is calculated using number of prompt tokens
673+
kvCacheTransT := s.config.KVCacheTransferTimePerToken * nPromptTokens
674+
return int(common.RandomNorm(float64(kvCacheTransT), float64(s.config.KVCacheTransferTimeStdDev)))
675+
}
676+
// is disaggregated PD and *not* using number of prompt tokens
677+
return int(common.RandomNorm(float64(s.config.KVCacheTransferLatency), float64(s.config.KVCacheTransferLatencyStdDev)))
674678
}
675-
return int(common.RandomNorm(mean, stddev))
679+
if s.config.TimeToFirstToken == 0 && s.config.TimeToFirstTokenStdDev == 0 {
680+
// is aggregated PD and ttft is calculated using number of prompt tokens
681+
prefillTime := s.config.PrefillOverhead + nPromptTokens*s.config.PrefillTimePerToken
682+
return int(common.RandomNorm(float64(prefillTime), float64(s.config.PrefillTimeStdDev)))
683+
}
684+
// is aggregated PD and *not* using number of prompt tokens
685+
return int(common.RandomNorm(float64(s.config.TimeToFirstToken), float64(s.config.TimeToFirstTokenStdDev)))
676686
}
677687

678688
// returns inter token latency

pkg/llm-d-inference-sim/simulator_test.go

Lines changed: 100 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -807,7 +807,7 @@ var _ = Describe("Simulator", func() {
807807
simulator.config.TimeToFirstTokenStdDev = timeToFirstTokenStdDev
808808
simulator.config.KVCacheTransferLatency = kvCacheLatency
809809
simulator.config.KVCacheTransferLatencyStdDev = kvCacheLatencyStdDev
810-
timeToFirst := simulator.getTimeToFirstToken(doREmotePrefill)
810+
timeToFirst := simulator.getTimeToFirstToken(1, doREmotePrefill)
811811
if doREmotePrefill {
812812
Expect(timeToFirst).To(BeNumerically(">=", int(float32(kvCacheLatency)*0.3)))
813813
Expect(timeToFirst).To(BeNumerically("<=", int(float32(kvCacheLatency)*1.7)))
@@ -828,5 +828,104 @@ var _ = Describe("Simulator", func() {
828828
Entry(nil, 10000, 0, 1000, 0, true),
829829
Entry(nil, 10000, 0, 1000, 0, false),
830830
)
831+
832+
It("when <time-to-first-token> is not 0, ignore <prefill-overhead>", func() {
833+
timeToFirstToken := 1000
834+
simulator.config.TimeToFirstToken = timeToFirstToken
835+
simulator.config.TimeToFirstTokenStdDev = 0
836+
837+
simulator.config.PrefillOverhead = 100
838+
simulator.config.PrefillTimePerToken = 200
839+
simulator.config.PrefillTimeStdDev = 80
840+
841+
ttft := simulator.getTimeToFirstToken(128, false)
842+
843+
Expect(ttft).To(BeNumerically("==", timeToFirstToken))
844+
})
845+
846+
It("when <time-to-first-token> is 0, and <prefill-overhead> is not 0, use <prefill-overhead>", func() {
847+
simulator.config.TimeToFirstToken = 0
848+
simulator.config.TimeToFirstTokenStdDev = 0
849+
850+
simulator.config.PrefillOverhead = 100
851+
simulator.config.PrefillTimePerToken = 200
852+
simulator.config.PrefillTimeStdDev = 80
853+
854+
ttft := simulator.getTimeToFirstToken(128, false)
855+
Expect(ttft).NotTo(BeNumerically("==", 0))
856+
})
857+
858+
DescribeTable("time to first token is against number of prompt tokens",
859+
func(prefillOverhead int, prefillTimePerToken int, stdDev int, nTokens int) {
860+
simulator.config.TimeToFirstToken = 0
861+
simulator.config.PrefillOverhead = prefillOverhead
862+
simulator.config.PrefillTimePerToken = prefillTimePerToken
863+
simulator.config.PrefillTimeStdDev = stdDev
864+
865+
ttft := simulator.getTimeToFirstToken(nTokens, false)
866+
867+
expectedTTFT := prefillOverhead + prefillTimePerToken*nTokens
868+
Expect(ttft).To(BeNumerically(">=", int(float64(expectedTTFT)*0.3)))
869+
Expect(ttft).To(BeNumerically("<=", int(float64(expectedTTFT)*1.7)))
870+
871+
},
872+
func(prefillOverhead int, prefillTimePerToken, stdDev int, nTokens int) string {
873+
return fmt.Sprintf("prefillOverhead: %d, prefillTimePerToken: %d, stdDev: %d, nTokens: %d",
874+
prefillOverhead, prefillTimePerToken, stdDev, nTokens)
875+
},
876+
Entry("single token", 100, 50, 70, 1),
877+
Entry("stddev is 0", 100, 50, 0, 1),
878+
Entry("medium overhead, 512 tokens", 200, 1000, 150, 512),
879+
Entry("large overhead, 1024 tokens", 2000, 3000, 1800, 1024),
880+
Entry("very long prompt", 150, 200, 100, 20000),
881+
)
882+
883+
It("when <kv-cache-transfer-latency> not 0, ignore <kv-cache-transfer-overhead>", func() {
884+
simulator.config.KVCacheTransferLatency = 200
885+
simulator.config.KVCacheTransferLatencyStdDev = 0
886+
887+
simulator.config.KVCacheTransferTimePerToken = 100
888+
simulator.config.KVCacheTransferTimeStdDev = 0
889+
890+
ttft := simulator.getTimeToFirstToken(128, true)
891+
Expect(ttft).To(BeNumerically("==", 200))
892+
})
893+
894+
It("when <kv-cache-transfer-latency> is 0, and <kv-cache-transfer-overhead> is not 0, use <kv-cache-transfer-overhead>", func() {
895+
simulator.config.KVCacheTransferLatency = 0
896+
simulator.config.KVCacheTransferLatencyStdDev = 0
897+
898+
simulator.config.KVCacheTransferTimePerToken = 100
899+
simulator.config.KVCacheTransferTimeStdDev = 0
900+
901+
ttft := simulator.getTimeToFirstToken(128, true)
902+
Expect(ttft).To(BeNumerically("==", 12800))
903+
})
904+
905+
DescribeTable("kv cache transfer time against number of prompt tokens",
906+
func(kvCacheTransTPT int, stddev int, nTokens int) {
907+
simulator.config.TimeToFirstToken = 0
908+
simulator.config.PrefillOverhead = 1
909+
simulator.config.KVCacheTransferTimePerToken = kvCacheTransTPT
910+
simulator.config.KVCacheTransferTimeStdDev = stddev
911+
912+
ttft := simulator.getTimeToFirstToken(nTokens, true)
913+
914+
expectedTTFT := kvCacheTransTPT * nTokens
915+
Expect(ttft).To(BeNumerically(">=", int(float64(expectedTTFT)*0.3)))
916+
Expect(ttft).To(BeNumerically("<=", int(float64(expectedTTFT)*1.7)))
917+
918+
},
919+
func(kvCacheTransferTimePerToken int, stddev int, nTokens int) string {
920+
return fmt.Sprintf("kvCacheTransferTimePerToken: %d stddev: %d nTokens: %d",
921+
kvCacheTransferTimePerToken, stddev, nTokens)
922+
},
923+
Entry("single token", 100, 70, 1),
924+
Entry("stddev is 0", 100, 0, 1),
925+
Entry("medium overhead, 512 tokens", 200, 150, 512),
926+
Entry("large overhead, 1024 tokens", 2000, 1800, 1024),
927+
Entry("very long prompt", 150, 100, 20000),
928+
)
929+
831930
})
832931
})

0 commit comments

Comments
 (0)