Skip to content

Adapting pred.py to non-chat models. #131

@DragonEntropy

Description

@DragonEntropy

Hi,

I've noticed that the pred.py script provided is only compatible with chat models. If I wanted to use a non-chat model (for example: Llama2-7b instead of Llama2-7b-chat), how would I perform this? I added llama2 to the json files in the config folder, and attempted to modify the generation code from:

completion = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=max_new_tokens,
                stream=False
return completion.choices[0].message.content

to:

completion = client.completions.create(
                model=model,
                prompt=prompt,
                temperature=temperature,
                max_tokens=max_new_tokens,
                stream=False
            )
            return completion.choices[0].text

However, the models predictions are always 'null', but vllm raises no error. For example:

INFO 08-06 04:46:11 [async_llm.py:269] Added request cmpl-ba91b440856f49ec9aa181e79e9dbb21-0.
INFO:     127.0.0.1:47438 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 08-06 04:46:11 [logger.py:41] Received request cmpl-cab8ab6cf4bb4fe180efb27e5dfb59bf-0: prompt: 'those corridors and he could have killed Frank without realising he’d got the wrong man. As it happens, we only have Derek’s word for it that Stefan ever went into the room.\n</text>\n\nWhat is the correct answer to this question: Please try to deduce the true story based on the evidence currently known. Who murdered Frank Parris in your deduction?\nChoices:\n(A) Aiden MacNeil\n(B) Martin Williams\n(C) Stefan Codrescu\n(D) Lisa Treherne\n\nFormat your response as follows: "The correct answer is (insert answer here)".', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=0.9, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [1, 386, 852, 1034, 2429, 943, 322, 540, 1033, 505, 9445, 4976, 1728, 1855, 5921, 540, 30010, 29881, 2355, 278, 2743, 767, 29889, 1094, 372, 5930, 29892, 591, 871, 505, 360, 20400, 30010, 29879, 1734, 363, 372, 393, 21512, 3926, 3512, 964, 278, 5716, 29889, 13, 829, 726, 29958, 13, 13, 5618, 338, 278, 1959, 1234, 304, 445, 1139, 29901, 3529, 1018, 304, 21049, 346, 278, 1565, 5828, 2729, 373, 278, 10757, 5279, 2998, 29889, 11644, 13406, 287, 4976, 1459, 3780, 297, 596, 21049, 428, 29973, 13, 15954, 1575, 29901, 13, 29898, 29909, 29897, 319, 3615, 4326, 8139, 309, 13, 29898, 29933, 29897, 6502, 11648, 13, 29898, 29907, 29897, 21512, 315, 397, 690, 4979, 13, 29898, 29928, 29897, 29420, 6479, 2276, 484, 13, 13, 5809, 596, 2933, 408, 4477, 29901, 376, 1576, 1959, 1234, 338, 313, 7851, 1234, 1244, 29897, 1642], prompt_embeds shape: None, lora_request: None.

My running process is the following:

vllm serve llama-2-7b-hf \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.98

python3 pred.py --model llama-2-7b-hf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions