Add extended task for LiveCodeBench codegeneration by plaguss · Pull Request #548 · huggingface/lighteval

plaguss · 2025-02-10T11:11:37Z

Adds a new extended task to run LiveCodeBench's codegeneration subset.

The results for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B:

lighteval vllm \
    "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8,generation_parameters={temperature: 0.7}" \
    "extended|lcb:codegeneration|0|0" \
    --use-chat-template

with the yaml file like so:

model:
  base_params:
    model_args: "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8"
  generation:
    temperature: 0.6
    top_p: 0.95

lighteval vllm \
    "lcb.yaml" \
    "extended|lcb:codegeneration|0|0" \
    --use-chat-template
...
|            Task             |Version|Metric|Value|   |Stderr|
|-----------------------------|------:|------|----:|---|-----:|
|all                          |       |maj@16|0.163|±  |0.0188|
|extended:lcb:codegeneration:0|      0|maj@16|0.163|±  |0.0188|

Note: This is just an idea, not sure it's the best approach.

Additionally it adds a way of updating the number of samples required to run a metric via the yaml file:

model:
  base_params:
    model_args: "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=bfloat16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8"
  generation:
    temperature: 0.6
    top_p: 0.95
  metric_options:
    codegen_pass@1:16:
      num_samples: 16

Under the metric_options, an entry can be added with the metric_name to be updated. It would just work with num_samples, but defined like this it shouldn't need more updates. Otherwise, the num_samples can be informed using the metric_name.

HuggingFaceDocBuilderDev · 2025-02-10T11:13:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…s_at_k

NathanHB · 2025-02-11T13:39:15Z

Hi ! Thanks for the PR.
To select dates, I think the only way would be to select the right dataset splits in the task config, there is no way of doing it form the cli.
For different prompts, it's not possible to do at runtime, you need to define it at the task level.

src/lighteval/tasks/extended/lcb/main.py

…of generations

…om the yaml file

plaguss · 2025-02-14T16:43:11Z

There's a job currently running with the following command:

lighteval vllm \
    "lcb.yaml" \
    "extended|lcb:codegeneration|0|0" \
    --custom-tasks src/lighteval/tasks/extended/lcb/main.py \
    --system-prompt "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\n\n" \
    --output-dir $OUTPUT_DIR \
    --save-details

and yaml file:

model:
  base_params:
    model_args: "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8"
  generation:
    temperature: 0.6
    top_p: 0.95

Will let the PR ready for review once a full run completes.
It's been running for 4.5 hours and seems to be half way in the generations, and GPU utilization look like this:

lewtun · 2025-02-15T21:47:08Z

Hi @plaguss @NathanHB will it be possible to run this eval without needing a YAML file?

The reason I ask is that all of our codebases assume one can run lighteval vllm {ARGS} where we just populate {ARGS} at runtime. Having a YAML adds another layer of complexity, where we would need to grep / regex the model_args params and update them.

Also, perhaps we can speed this up dramatically by using data_parallel_size instead of tensor_parallel_size for models that fit on a single H100 (i.e. use a full node with 8 copies)?

plaguss · 2025-02-16T08:05:50Z

Hi @plaguss @NathanHB will it be possible to run this eval without needing a YAML file?

The reason I ask is that all of our codebases assume one can run lighteval vllm {ARGS} where we just populate {ARGS} at runtime. Having a YAML adds another layer of complexity, where we would need to grep / regex the model_args params and update them.

Hi Lewis, I coulnd't find a way of passing the generation parameters in the CLI, which seem relevant for this model. I can update the code to pass them through ARGS (it should be here unless there's already a better way, @NathanHB?)

NEW:
I added the following logic to allow reading the arguments from the CLI to simplify things:

lighteval vllm \
    "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8,generation_parameters={temperature:0.7,top_p:5}" \
    "extended|lcb:codegeneration|0|0" \
    --custom-tasks src/lighteval/tasks/extended/lcb/main.py \
    --output-dir $OUTPUT_DIR \
    --save-details

Now we could read the generation parameters from the model args following this pattern, let me know what you both think.

Also, perhaps we can speed this up dramatically by using data_parallel_size instead of tensor_parallel_size for models that fit on a single H100 (i.e. use a full node with 8 copies)?

Sure, I run it with data_parellel_size=4 for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, it took 4h approx.

plaguss · 2025-02-17T08:29:04Z

The 32B is still running due to an error, but the other values can be found here:

Model	Lighteval (Replica)	LiveCodeBench (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B	0.163	0.169
DeepSeek-R1-Distill-Qwen-7B	0.366	0.376
DeepSeek-R1-Distill-Qwen-8B	0.370	0.396
DeepSeek-R1-Distill-Qwen-14B	0.515	0.531
DeepSeek-R1-Distill-Qwen-32B	0.566	0.572
DeepSeek-R1-Distill-Qwen-70B	0.545	0.575

NathanHB · 2025-02-17T12:19:44Z

The 32B is still running due to an error, but the other values can be found here:

Model Lighteval (Replica) LiveCodeBench (DeepSeek Reported)
DeepSeek-R1-Distill-Qwen-1.5B 0.163 0.169
DeepSeek-R1-Distill-Qwen-7B 0.366 0.376
DeepSeek-R1-Distill-Qwen-8B 0.370 0.396
DeepSeek-R1-Distill-Qwen-14B 0.515 0.531
DeepSeek-R1-Distill-Qwen-32B - 0.572
DeepSeek-R1-Distill-Qwen-70B 0.545 0.575

Great ! thanks for adding a way to pass generation params as args

NathanHB

Great work on this ! The results look great ! I was only wondering about dynamically changing the metric config at runtime, and if you could add some docs !
Otherwise ready to merge :)

NathanHB · 2025-02-17T12:21:20Z

src/lighteval/main_vllm.py

        with open(model_args, "r") as f:
            config = yaml.safe_load(f)["model"]
        model_args = config["base_params"]["model_args"]
+        metric_options = config.get("metric_options", {})


can you add some docs for this ?

NathanHB · 2025-02-17T12:23:28Z

src/lighteval/pipeline.py

+                if metric_data := self._metric_options.get(metric.metric_name, None):
+                    num_samples = metric_data.get("num_samples", None)
+                    if num_samples:
+                        task.num_samples.append(num_samples)


has this been tested ?

Done, it had 2 bugs in fact, thanks! now works as expected:

for metric in task.metrics: if metric_data := self._metric_options.get(metric.metric_name, None): num_samples = metric_data.get("num_samples", None) if num_samples: task.num_samples = [num_samples]

```shell lighteval vllm \ "pretrained=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B,dtype=float16,data_parallel_size=4,max_model_length=32768,gpu_memory_utilisation=0.8,generation_parameters={temperature: 0.7}" \ "extended|lcb:codegeneration|0|0" \ --use-chat-template ```

plaguss added 2 commits February 10, 2025 12:10

Add draft for livecodebench code generation

9369a37

Add extra argument version_tag

2001e7b

plaguss added 5 commits February 10, 2025 12:14

Fix import name

fece552

Remove unused typed dict

e46fc2a

Checkpoint, not ready yet, try simplifying code running and reuse pas…

6a3c007

…s_at_k

Add notes for expected values

987eb2a

Pass version tag to downloader

42fb0f5

NathanHB reviewed Feb 11, 2025

View reviewed changes

src/lighteval/tasks/extended/lcb/main.py Show resolved Hide resolved

NathanHB reviewed Feb 11, 2025

View reviewed changes

src/lighteval/tasks/extended/lcb/main.py Outdated Show resolved Hide resolved

NathanHB reviewed Feb 11, 2025

View reviewed changes

src/lighteval/tasks/extended/lcb/main.py Outdated Show resolved Hide resolved

NathanHB reviewed Feb 11, 2025

View reviewed changes

src/lighteval/tasks/extended/lcb/main.py Outdated Show resolved Hide resolved

plaguss and others added 10 commits February 14, 2025 09:57

Modify helper module and remove dataset version tag

b700dc4

Remove version_tag

29b2bbe

Initial version for lcb:codegeneration

a60e662

Remove outdated argument docs

05a7f01

Remove hardcoded system prompt and pass it via arg

deea663

Merge branch 'main' into lcb-codegeneration

a2863f9

Add kwargs to allow passing other arguments

44f45b5

Make generic function to parse the metric name and obtain the number …

127b4cd

…of generations

Change metric name to make it more informative

a372e05

Add experimental way of passing the number of samples for a metric fr…

53ab417

…om the yaml file

Add more processes to run the tests

f6a7c4f

plaguss marked this pull request as ready for review February 16, 2025 08:08

Allow reading the generation parameters from the CLI

d6abcd0

plaguss requested a review from NathanHB February 17, 2025 08:22

plaguss added 2 commits February 17, 2025 09:53

Update parsing arguments from CLI

158d660

Remove dead code and fix test value

54fa032

plaguss mentioned this pull request Feb 17, 2025

Add LiveCodeBench's codegeneration task from lighteval huggingface/open-r1#346

Merged

NathanHB approved these changes Feb 17, 2025

View reviewed changes

plaguss added 2 commits February 17, 2025 15:55

Fix num_samples update

4a0fe89

Add docs for the new metric_options

f945fdf

NathanHB merged commit fd479ee into huggingface:main Feb 18, 2025
3 checks passed

edbeeching mentioned this pull request Feb 20, 2025

Cannot find tasks extended|lcb:codegeneration in task list or in custom task registry huggingface/open-r1#378

Open

Comments

Conversation

plaguss commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2025

Uh oh!

NathanHB commented Feb 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

plaguss commented Feb 14, 2025

Uh oh!

lewtun commented Feb 15, 2025

Uh oh!

plaguss commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plaguss commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanHB commented Feb 17, 2025

Uh oh!

NathanHB left a comment

Choose a reason for hiding this comment

Uh oh!

NathanHB Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

NathanHB Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

plaguss Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

plaguss commented Feb 10, 2025 •

edited

Loading

plaguss commented Feb 16, 2025 •

edited

Loading

plaguss commented Feb 17, 2025 •

edited

Loading