[QUESTION] I cannot achieve token/s speed as in your tests even with better GPU

### OS

Linux

### GPU Library

CUDA 12.x

### Python version

3.11

### Pytorch version

2.8.0 - cuda 128

### Model

TinyLlama-1.1B-intermediate

### Describe the bug

I tried the library with TinyLlama-intermediate EXL2 4.0 bpw 1.1B model.

My hardware is following:
```
CPU: intel i9 24 Cores
GPU: RTX 5090 24GB laptop
RAM: 64GB
```

`The maximum result I achieved in terms of speed is around 350/370 tokens/s, while in your test it is 602 t/s | 700 t/s (the double) and achieved with GPUs that are certainly less powerful than mine.
`

May be there are some wrong settings in my code?

Thankyou for suggestions

Following the code I used for test:



### Reproduction steps

```
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob, ExLlamaV2Sampler
import torch
import time
from enum import Enum
import argparse
import readline
class PromptType(Enum):
    NORMAL = "Normal"
    TEKKEN3 = "Tekken3"
    CHAT = "Chat"


general_instructions = """
You are a JSON-only AI assistant for a medical clinic. Process the input JSON and respond with valid JSON only.

Extract entities from UserMessage: Nome (first name), Cognome (last name), CognomeDottore (doctor last name).
Detect intent from AvailableIntents.
Generate a natural Italian response in AssistantResponse.
Set NextPathName to empty string if IsStepCompleted = true, otherwise choose from AvailablePaths based on DetectedIntentName.

Output JSON format:
{
  "AssistantResponse": "natural reply in Italian",
  "CurrentPathName": "copy from input",
  "CurrentStepName": "copy from StepToExecute.StepName",
  "DetectedIntentName": "from AvailableIntents",
  "Language": "it",
  "ExtractedEntities": [{"EntityName": "...", "EntityValue": "..."}],
  "IsStepCompleted": true/false,
  "NextPathName": "from AvailablePaths if applicable",
  "FunctionCall": {"FunctionName": "", "Parameters": []}
}

Example:
Input UserMessage: "Buongiorno, sono Giovanni Rossi, ho telefonato per chiedere se il dottor Andreoli è in clinica oggi oggi."
Output:
{
  "AssistantResponse": "Buongiorno Sig. Rossi, il dottor. Filippucci è presente in studio oggi. Come posso aiutarla?",
  "CurrentPathName": "ConversationStart",
  "CurrentStepName": "UserIdentification",
  "DetectedIntentName": "Informazioni",
  "Language": "it",
  "ExtractedEntities": [
    {"EntityName": "Nome", "EntityValue": "Amedeo"},
    {"EntityName": "Cognome", "EntityValue": "Garibaldi"},
    {"EntityName": "CognomeDottore", "EntityValue": "Filippucci"}
  ],
  "IsStepCompleted": true,
  "NextPathName": "",
  "FunctionCall": {"FunctionName": "", "Parameters": []}
}
"""

user_input = """
{
  "CurrentPathName": "ConversationStart",
  "StepToExecute": {
    "StepNumber": 0,
    "StepName": "UserIdentification",
    "PathName": "StartPath",
    "StepInstructions": "Identify user intent and extract mandatory entities: Nome, Cognome, CognomeDottore",
    "IsLatestStep": true,
    "FunctionToUse": "",
    "MandatoryStepEntities": [
      {"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
      {"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
      {"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
    ],
    "ResponseExamples": [
      "Buongiorno, come posso aiutarla oggi?",
      "Si, Il dottor {CognomeDottore} è in studio oggi, come posso aiutarla?",
      ""No, il dottot {CognomeDottore} non riceve oggi, lo trova domani!
    ],
    "SkipCondition": "",
    "AlternativePathsNames": []
  },
  "Context": {"CollectedEntities": [], "Error": "", "StepStatus": 0},
  "EntitiesToCollect": [
    {"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
    {"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
    {"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
  ],
  "AvailableIntents": [
    {"IntentName": "Informazioni", "Description": "User wants information"},
    {"IntentName": "Prenotare", "Description": "User wants to book a visit"},
    {"IntentName": "Anullare", "Description": "User wants to cancel a visit"}
  ],
  "AvailableFunctions": [],
  "AvailablePaths": [
  ],
  "UserMessage": "Buongiorno, sono Amedeo Garibaldi, ho telefonato per chiedere se il dermatologo Filippucci può ricevere oggi.",
  "FunctionResult": ""
}
"""

torch.set_num_threads(24)

# Configuration
model_directory = "TinyLlama-1.1B-intermediate_output/full"  # Path to the quantized model directory

# Argument parser
parser = argparse.ArgumentParser()
parser.add_argument('--prompt_type', choices=[e.value for e in PromptType], default='Chat')
args = parser.parse_args()
prompt_type = PromptType(args.prompt_type)

# Load model
config = ExLlamaV2Config(model_directory)
config.attention_mode = "flash"  # Enable flash attention for speed
config.max_seq_len = 4096  # Increase context length
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
cache.quantize = True  # Enable cache quantization
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

# Pre-encode the static prompt (optimization for repeated use)
if prompt_type == PromptType.NORMAL:
    inference_promtp = general_instructions
elif prompt_type == PromptType.TEKKEN3:
    inference_promtp = f"[INST]{general_instructions}[/INST]"
elif prompt_type == PromptType.CHAT:
    inference_promtp = "<|im_start|>system\n" + general_instructions + "<|im_end|>\n<|im_start|>user\n" + user_input  + "<|im_end|>\n<|im_start|>assistant\n"
encoded_prompt = tokenizer.encode(inference_promtp, add_bos = True)
input_token_count = encoded_prompt.shape[1] if len(encoded_prompt.shape) > 1 else encoded_prompt.shape[0]

# Create generator
generator = ExLlamaV2DynamicGenerator(
    model = model,
    cache = cache,
    tokenizer = tokenizer,
    max_batch_size=256,
    max_seq_len=1024,
    max_sampling_threads=24,
    min_sampling_threads=24,
    filter_background_eval=False,
    paged=False
)

# Generation settings (optimized for multilingual speed-quality balance)
gen_settings = ExLlamaV2Sampler.Settings(
    token_repetition_penalty = 1.05,
    temperature = 0.1,  # Low temperature for quality while maintaining speed
    top_k = 50,  # Allow diverse token selection for multilingual
    top_p = 0.9,
)

# Interactive loop
while True:
    prompt = input("Enter prompt (or 'quit' to exit): ")
    if prompt.lower() == 'quit':
        break

    # Streaming loop
    start_time = time.time()

    # Use pre-encoded prompt
    # Create job
    job = ExLlamaV2DynamicJob(
        input_ids = encoded_prompt,
        max_new_tokens = 512,
        stop_conditions = [tokenizer.eos_token_id],
        gen_settings = gen_settings
    )
    generator.enqueue(job)

    generated_text = ""
    output_token_count = 0
    eos = False
                
    while not eos:
        results = generator.iterate()
        for result in results:
            if result["stage"] == "streaming":
                eos = result["eos"]
                if "text" in result:
                    generated_text += result["text"]
                if "token_ids" in result:
                    output_token_count += len(result["token_ids"][0])

    end_time = time.time()

    print(generated_text.replace("<|im_end|>",""))

    # Calculate metrics (approximate tokens as characters / 4)
    total_tokens = output_token_count 
    total_time = end_time - start_time
    tps = total_tokens / total_time if total_time > 0 else 0

    print("\nMetrics:")
    print(f"Total tokens inputed: {input_token_count}")
    print(f"Total tokens generated: {total_tokens}")
    print(f"Total time: {total_time:.2f} seconds")
    print(f"Outputed tokens per second: {tps:.2f}")
    print()

```

### Expected behavior

Speed range 600/700 tokens/s

### Logs

```
{
  "CurrentPathName": "ConversationStart",
  "StepToExecute": {
    "StepNumber": 1,
    "StepName": "UserIdentification",
    "PathName": "StartPath",
    "StepInstructions": "Identify user intent and extract mandatory entities: Nome, Cognome, CognomeDottore",
    "IsLatestStep": true,
    "FunctionToUse": "",
    "MandatoryStepEntities": [
      {"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
      {"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
      {"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
    ],
    "ResponseExamples": [
      "Buongiorno, come posso aiutarla oggi?",
      "Si, Il dottor {CognomeDottore} è in studio oggi, come posso aiutarla?",
      ""No, il dottot {CognomeDottore} non riceve oggi, lo trova domani!
    ],
    "SkipCondition": "",
    "AlternativePathsNames": []
  },
  "Context": {"CollectedEntities": [], "Error": "", "StepStatus": 0},
  "EntitiesToCollect": [
    {"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
    {"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
    {"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
  ],
  "AvailableIntents": [
    {"IntentName": "Informazioni", "Description": "User wants information"},
    {"IntentName": "Prenotare", "Description": "User wants to book a visit"},
    {"IntentName": "Anullare", "Description": "User wants to cancel a visit"}
  ],
  "AvailableFunctions": [],
  "AvailablePaths": [
 

```
Metrics:
Total tokens inputed: 1116
Total tokens generated: 512
Total time: 1.57 seconds
Outputed tokens per second: 326.84

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[QUESTION] I cannot achieve token/s speed as in your tests even with better GPU #806

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[QUESTION] I cannot achieve token/s speed as in your tests even with better GPU #806

Description

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions