Start the vLLM server
In thinking mode:
vllm serve Qwen/Qwen3-32B \
--dtype auto \
--reasoning-parser deepseek_r1 \
--task generate \
--disable-log-requests \
--max-model-len 8192 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefillUse non-thinking mode, as describe in the Qwen3 docs:
vllm serve Qwen/Qwen3-32B \
--dtype auto \
--chat-template ./qwen3_nonthinking.jinja \
--task generate \
--disable-log-requests \
--max-model-len 8192 \
--gpu-memory-utilization 0.95 \
--enable-chunked-prefillStart the server and run the inference:
python inference.py \
--model_id Qwen/Qwen3-4B \
--gen_kwargs thinking \
--datasets VariErrNLI \
--template_id 01 \
--remote_call_concurrency 10 \
--n_examples 10 \
--vllm.port 8000 \
--vllm.start_server=FalseCreate the sbatch files:
cd lewidi2025/slurm
python create_sbatch_files.pysubmit all those jobs:
cd slurm_scripts/
ls | xargs -n 1 sbatchCheck the status of the jobs:
squeue -u $USERAfter installing the package, you can plot the metrics by running:
lewidi-plot --log_file /dss/dssfs02/lwp-dss-0001/pn76je/pn76je-dss-0000/lewidi-data/sbatch/di38bec/Qwen_Qwen3-32B_thinking/out.logsWhere out.logs is generated by the sbatch file.
find . -name '*_responses.jsonl' | xargs wc -l
find . -name '*_responses.jsonl' | xargs cat > combined.jsonlduckdb -c "COPY (SELECT * FROM read_json_auto('combined.jsonl', union_by_name=True)) TO 'combined.parquet'"srun --partition=lrz-cpu --cpus-per-task=20 --time=01:00:00 --qos=cpu duckdb -c "COPY (SELECT * FROM read_json_auto('combined.jsonl', union_by_name=True)) TO 'combined.parquet'"Alternatively,
import pandas as pd
df = pd.read_json("combined_responses.jsonl", lines=True, dtype={"error": "string"})
df.to_parquet("combined_responses.parquet")