-
Notifications
You must be signed in to change notification settings - Fork 115
Open
Description
Hi,
I've noticed that the pred.py script provided is only compatible with chat models. If I wanted to use a non-chat model (for example: Llama2-7b instead of Llama2-7b-chat), how would I perform this? I added llama2 to the json files in the config folder, and attempted to modify the generation code from:
completion = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=max_new_tokens,
stream=False
return completion.choices[0].message.content
to:
completion = client.completions.create(
model=model,
prompt=prompt,
temperature=temperature,
max_tokens=max_new_tokens,
stream=False
)
return completion.choices[0].text
However, the models predictions are always 'null', but vllm raises no error. For example:
INFO 08-06 04:46:11 [async_llm.py:269] Added request cmpl-ba91b440856f49ec9aa181e79e9dbb21-0.
INFO: 127.0.0.1:47438 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 08-06 04:46:11 [logger.py:41] Received request cmpl-cab8ab6cf4bb4fe180efb27e5dfb59bf-0: prompt: 'those corridors and he could have killed Frank without realising he’d got the wrong man. As it happens, we only have Derek’s word for it that Stefan ever went into the room.\n</text>\n\nWhat is the correct answer to this question: Please try to deduce the true story based on the evidence currently known. Who murdered Frank Parris in your deduction?\nChoices:\n(A) Aiden MacNeil\n(B) Martin Williams\n(C) Stefan Codrescu\n(D) Lisa Treherne\n\nFormat your response as follows: "The correct answer is (insert answer here)".', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=0.9, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 386, 852, 1034, 2429, 943, 322, 540, 1033, 505, 9445, 4976, 1728, 1855, 5921, 540, 30010, 29881, 2355, 278, 2743, 767, 29889, 1094, 372, 5930, 29892, 591, 871, 505, 360, 20400, 30010, 29879, 1734, 363, 372, 393, 21512, 3926, 3512, 964, 278, 5716, 29889, 13, 829, 726, 29958, 13, 13, 5618, 338, 278, 1959, 1234, 304, 445, 1139, 29901, 3529, 1018, 304, 21049, 346, 278, 1565, 5828, 2729, 373, 278, 10757, 5279, 2998, 29889, 11644, 13406, 287, 4976, 1459, 3780, 297, 596, 21049, 428, 29973, 13, 15954, 1575, 29901, 13, 29898, 29909, 29897, 319, 3615, 4326, 8139, 309, 13, 29898, 29933, 29897, 6502, 11648, 13, 29898, 29907, 29897, 21512, 315, 397, 690, 4979, 13, 29898, 29928, 29897, 29420, 6479, 2276, 484, 13, 13, 5809, 596, 2933, 408, 4477, 29901, 376, 1576, 1959, 1234, 338, 313, 7851, 1234, 1244, 29897, 1642], prompt_embeds shape: None, lora_request: None.
My running process is the following:
vllm serve llama-2-7b-hf \
--max-model-len 65536 \
--gpu-memory-utilization 0.98
python3 pred.py --model llama-2-7b-hf
xcc-zach and HaoranDeng
Metadata
Metadata
Assignees
Labels
No labels