-
-
Notifications
You must be signed in to change notification settings - Fork 323
Open
Labels
bugSomething isn't workingSomething isn't working
Description
OS
Linux
GPU Library
CUDA 12.x
Python version
3.11
Pytorch version
2.8.0 - cuda 128
Model
TinyLlama-1.1B-intermediate
Describe the bug
I tried the library with TinyLlama-intermediate EXL2 4.0 bpw 1.1B model.
My hardware is following:
CPU: intel i9 24 Cores
GPU: RTX 5090 24GB laptop
RAM: 64GB
The maximum result I achieved in terms of speed is around 350/370 tokens/s, while in your test it is 602 t/s | 700 t/s (the double) and achieved with GPUs that are certainly less powerful than mine.
May be there are some wrong settings in my code?
Thankyou for suggestions
Following the code I used for test:
Reproduction steps
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob, ExLlamaV2Sampler
import torch
import time
from enum import Enum
import argparse
import readline
class PromptType(Enum):
NORMAL = "Normal"
TEKKEN3 = "Tekken3"
CHAT = "Chat"
general_instructions = """
You are a JSON-only AI assistant for a medical clinic. Process the input JSON and respond with valid JSON only.
Extract entities from UserMessage: Nome (first name), Cognome (last name), CognomeDottore (doctor last name).
Detect intent from AvailableIntents.
Generate a natural Italian response in AssistantResponse.
Set NextPathName to empty string if IsStepCompleted = true, otherwise choose from AvailablePaths based on DetectedIntentName.
Output JSON format:
{
"AssistantResponse": "natural reply in Italian",
"CurrentPathName": "copy from input",
"CurrentStepName": "copy from StepToExecute.StepName",
"DetectedIntentName": "from AvailableIntents",
"Language": "it",
"ExtractedEntities": [{"EntityName": "...", "EntityValue": "..."}],
"IsStepCompleted": true/false,
"NextPathName": "from AvailablePaths if applicable",
"FunctionCall": {"FunctionName": "", "Parameters": []}
}
Example:
Input UserMessage: "Buongiorno, sono Giovanni Rossi, ho telefonato per chiedere se il dottor Andreoli è in clinica oggi oggi."
Output:
{
"AssistantResponse": "Buongiorno Sig. Rossi, il dottor. Filippucci è presente in studio oggi. Come posso aiutarla?",
"CurrentPathName": "ConversationStart",
"CurrentStepName": "UserIdentification",
"DetectedIntentName": "Informazioni",
"Language": "it",
"ExtractedEntities": [
{"EntityName": "Nome", "EntityValue": "Amedeo"},
{"EntityName": "Cognome", "EntityValue": "Garibaldi"},
{"EntityName": "CognomeDottore", "EntityValue": "Filippucci"}
],
"IsStepCompleted": true,
"NextPathName": "",
"FunctionCall": {"FunctionName": "", "Parameters": []}
}
"""
user_input = """
{
"CurrentPathName": "ConversationStart",
"StepToExecute": {
"StepNumber": 0,
"StepName": "UserIdentification",
"PathName": "StartPath",
"StepInstructions": "Identify user intent and extract mandatory entities: Nome, Cognome, CognomeDottore",
"IsLatestStep": true,
"FunctionToUse": "",
"MandatoryStepEntities": [
{"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
{"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
{"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
],
"ResponseExamples": [
"Buongiorno, come posso aiutarla oggi?",
"Si, Il dottor {CognomeDottore} è in studio oggi, come posso aiutarla?",
""No, il dottot {CognomeDottore} non riceve oggi, lo trova domani!
],
"SkipCondition": "",
"AlternativePathsNames": []
},
"Context": {"CollectedEntities": [], "Error": "", "StepStatus": 0},
"EntitiesToCollect": [
{"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
{"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
{"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
],
"AvailableIntents": [
{"IntentName": "Informazioni", "Description": "User wants information"},
{"IntentName": "Prenotare", "Description": "User wants to book a visit"},
{"IntentName": "Anullare", "Description": "User wants to cancel a visit"}
],
"AvailableFunctions": [],
"AvailablePaths": [
],
"UserMessage": "Buongiorno, sono Amedeo Garibaldi, ho telefonato per chiedere se il dermatologo Filippucci può ricevere oggi.",
"FunctionResult": ""
}
"""
torch.set_num_threads(24)
# Configuration
model_directory = "TinyLlama-1.1B-intermediate_output/full" # Path to the quantized model directory
# Argument parser
parser = argparse.ArgumentParser()
parser.add_argument('--prompt_type', choices=[e.value for e in PromptType], default='Chat')
args = parser.parse_args()
prompt_type = PromptType(args.prompt_type)
# Load model
config = ExLlamaV2Config(model_directory)
config.attention_mode = "flash" # Enable flash attention for speed
config.max_seq_len = 4096 # Increase context length
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
cache.quantize = True # Enable cache quantization
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
# Pre-encode the static prompt (optimization for repeated use)
if prompt_type == PromptType.NORMAL:
inference_promtp = general_instructions
elif prompt_type == PromptType.TEKKEN3:
inference_promtp = f"[INST]{general_instructions}[/INST]"
elif prompt_type == PromptType.CHAT:
inference_promtp = "<|im_start|>system\n" + general_instructions + "<|im_end|>\n<|im_start|>user\n" + user_input + "<|im_end|>\n<|im_start|>assistant\n"
encoded_prompt = tokenizer.encode(inference_promtp, add_bos = True)
input_token_count = encoded_prompt.shape[1] if len(encoded_prompt.shape) > 1 else encoded_prompt.shape[0]
# Create generator
generator = ExLlamaV2DynamicGenerator(
model = model,
cache = cache,
tokenizer = tokenizer,
max_batch_size=256,
max_seq_len=1024,
max_sampling_threads=24,
min_sampling_threads=24,
filter_background_eval=False,
paged=False
)
# Generation settings (optimized for multilingual speed-quality balance)
gen_settings = ExLlamaV2Sampler.Settings(
token_repetition_penalty = 1.05,
temperature = 0.1, # Low temperature for quality while maintaining speed
top_k = 50, # Allow diverse token selection for multilingual
top_p = 0.9,
)
# Interactive loop
while True:
prompt = input("Enter prompt (or 'quit' to exit): ")
if prompt.lower() == 'quit':
break
# Streaming loop
start_time = time.time()
# Use pre-encoded prompt
# Create job
job = ExLlamaV2DynamicJob(
input_ids = encoded_prompt,
max_new_tokens = 512,
stop_conditions = [tokenizer.eos_token_id],
gen_settings = gen_settings
)
generator.enqueue(job)
generated_text = ""
output_token_count = 0
eos = False
while not eos:
results = generator.iterate()
for result in results:
if result["stage"] == "streaming":
eos = result["eos"]
if "text" in result:
generated_text += result["text"]
if "token_ids" in result:
output_token_count += len(result["token_ids"][0])
end_time = time.time()
print(generated_text.replace("<|im_end|>",""))
# Calculate metrics (approximate tokens as characters / 4)
total_tokens = output_token_count
total_time = end_time - start_time
tps = total_tokens / total_time if total_time > 0 else 0
print("\nMetrics:")
print(f"Total tokens inputed: {input_token_count}")
print(f"Total tokens generated: {total_tokens}")
print(f"Total time: {total_time:.2f} seconds")
print(f"Outputed tokens per second: {tps:.2f}")
print()
Expected behavior
Speed range 600/700 tokens/s
Logs
{
"CurrentPathName": "ConversationStart",
"StepToExecute": {
"StepNumber": 1,
"StepName": "UserIdentification",
"PathName": "StartPath",
"StepInstructions": "Identify user intent and extract mandatory entities: Nome, Cognome, CognomeDottore",
"IsLatestStep": true,
"FunctionToUse": "",
"MandatoryStepEntities": [
{"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
{"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
{"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
],
"ResponseExamples": [
"Buongiorno, come posso aiutarla oggi?",
"Si, Il dottor {CognomeDottore} è in studio oggi, come posso aiutarla?",
""No, il dottot {CognomeDottore} non riceve oggi, lo trova domani!
],
"SkipCondition": "",
"AlternativePathsNames": []
},
"Context": {"CollectedEntities": [], "Error": "", "StepStatus": 0},
"EntitiesToCollect": [
{"EntityName": "Nome", "EntityType": "String", "Mandatory": true},
{"EntityName": "Cognome", "EntityType": "String", "Mandatory": true},
{"EntityName": "CognomeDottore", "EntityType": "String", "Mandatory": true}
],
"AvailableIntents": [
{"IntentName": "Informazioni", "Description": "User wants information"},
{"IntentName": "Prenotare", "Description": "User wants to book a visit"},
{"IntentName": "Anullare", "Description": "User wants to cancel a visit"}
],
"AvailableFunctions": [],
"AvailablePaths": [
Metrics:
Total tokens inputed: 1116
Total tokens generated: 512
Total time: 1.57 seconds
Outputed tokens per second: 326.84
Additional context
No response
Acknowledgements
- I have looked for similar issues before submitting this one.
- I understand that the developers have lives and my issue will be answered when possible.
- I understand the developers of this program are human, and I will ask my questions politely.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working