[draft] Add CritPt benchmark #1200

jiacheng-xu · 2026-01-30T09:17:12Z

WIP

Known issue: tool calling doesn't work in the customized generation module.

Signed-off-by: Jiacheng Xu <[email protected]>

jiacheng-xu · 2026-01-30T21:17:48Z

Hi @Kipok I need some help here. I found that in the generation task I designed, the model can't use tool. The conversation always ended with a tool call (eg. import numpy) and thats the end of conversation.
I also did not find a very elegant way to handle all the if conditions like if self.cfg.code_execution: or parse_reasoning. Is there a centralized util function to handle all these?

Thanks a lot!

file: nemo_skills/inference/eval/critpt.py

gwarmstrong · 2026-01-30T21:25:32Z

@jiacheng-xu can you show an example ns generate/ns eval script that you are using? I suspect the issue would be if you are setting ++code_execution=True--I suspect that's not what you actually want here, as that is an older way of calling python execution before we had tool parsers/structured tool calls integrated.

jiacheng-xu · 2026-01-31T02:06:11Z

@jiacheng-xu can you show an example ns generate/ns eval script that you are using? I suspect the issue would be if you are setting ++code_execution=True--I suspect that's not what you actually want here, as that is an older way of calling python execution before we had tool parsers/structured tool calls integrated.

Yes I am using code_execution=True.

{
  "benchmarks": "critpt:1",
  "split": "test",
  "num_chunks": 1,
  "server_type": "vllm",
  "server_gpus": 4,
  "server_args": "--async-scheduling",
  "model": "/hf_models/gpt-oss-20b",
  "with_sandbox": true,
  "expname": "critpttest-ghb-model_gpt_oss_20b-oci-debug",
  "cluster": "oci",
  "output_dir": "/workspace/critpttest-ghb-model_gpt_oss_20b-oci-debug",
  "wandb_project": "critpt",
  "wandb_name": "012918-critpttest-ghb-model_gpt_oss_20b-oci",
  "__ctx_args": "++max_samples=10 ++inference.temperature=1.0 ++inference.tokens_to_generate=65536 ++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 ++inference.endpoint_type=text --config-path=/nemo_run/code/nemo_skills_stem/configs --config-name=gpt_oss ++chat_template_kwargs.builtin_tools=[python] ++chat_template_kwargs.reasoning_effort=high ++parse_reasoning=True ++code_execution=true"
}

Could you share what's the recommended way for tool calling, one for endpoint='text' (like gpt oss) and one for endpoint='chat'? Thanks!

add critpt to ns

4f6bc30

Signed-off-by: Jiacheng Xu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] Add CritPt benchmark #1200

[draft] Add CritPt benchmark #1200

jiacheng-xu commented Jan 30, 2026

Uh oh!

jiacheng-xu commented Jan 30, 2026

Uh oh!

gwarmstrong commented Jan 30, 2026

Uh oh!

jiacheng-xu commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[draft] Add CritPt benchmark #1200

Are you sure you want to change the base?

[draft] Add CritPt benchmark #1200

Conversation

jiacheng-xu commented Jan 30, 2026

Uh oh!

jiacheng-xu commented Jan 30, 2026

Uh oh!

gwarmstrong commented Jan 30, 2026

Uh oh!

jiacheng-xu commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants