Adapting pred.py to non-chat models.

Hi,

I've noticed that the pred.py script provided is only compatible with chat models. If I wanted to use a non-chat model (for example: Llama2-7b instead of Llama2-7b-chat), how would I perform this? I added llama2 to the json files in the config folder, and attempted to modify the generation code from:

```
completion = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=max_new_tokens,
                stream=False
return completion.choices[0].message.content
```

to:

```
completion = client.completions.create(
                model=model,
                prompt=prompt,
                temperature=temperature,
                max_tokens=max_new_tokens,
                stream=False
            )
            return completion.choices[0].text
```

However, the models predictions are always 'null', but vllm raises no error. For example:

```
INFO 08-06 04:46:11 [async_llm.py:269] Added request cmpl-ba91b440856f49ec9aa181e79e9dbb21-0.
INFO:     127.0.0.1:47438 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 08-06 04:46:11 [logger.py:41] Received request cmpl-cab8ab6cf4bb4fe180efb27e5dfb59bf-0: prompt: 'those corridors and he could have killed Frank without realising he’d got the wrong man. As it happens, we only have Derek’s word for it that Stefan ever went into the room.\n</text>\n\nWhat is the correct answer to this question: Please try to deduce the true story based on the evidence currently known. Who murdered Frank Parris in your deduction?\nChoices:\n(A) Aiden MacNeil\n(B) Martin Williams\n(C) Stefan Codrescu\n(D) Lisa Treherne\n\nFormat your response as follows: "The correct answer is (insert answer here)".', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=0.9, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 386, 852, 1034, 2429, 943, 322, 540, 1033, 505, 9445, 4976, 1728, 1855, 5921, 540, 30010, 29881, 2355, 278, 2743, 767, 29889, 1094, 372, 5930, 29892, 591, 871, 505, 360, 20400, 30010, 29879, 1734, 363, 372, 393, 21512, 3926, 3512, 964, 278, 5716, 29889, 13, 829, 726, 29958, 13, 13, 5618, 338, 278, 1959, 1234, 304, 445, 1139, 29901, 3529, 1018, 304, 21049, 346, 278, 1565, 5828, 2729, 373, 278, 10757, 5279, 2998, 29889, 11644, 13406, 287, 4976, 1459, 3780, 297, 596, 21049, 428, 29973, 13, 15954, 1575, 29901, 13, 29898, 29909, 29897, 319, 3615, 4326, 8139, 309, 13, 29898, 29933, 29897, 6502, 11648, 13, 29898, 29907, 29897, 21512, 315, 397, 690, 4979, 13, 29898, 29928, 29897, 29420, 6479, 2276, 484, 13, 13, 5809, 596, 2933, 408, 4477, 29901, 376, 1576, 1959, 1234, 338, 313, 7851, 1234, 1244, 29897, 1642], prompt_embeds shape: None, lora_request: None.
```

My running process is the following:
```
vllm serve llama-2-7b-hf \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.98

python3 pred.py --model llama-2-7b-hf
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adapting pred.py to non-chat models. #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adapting pred.py to non-chat models. #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions