-
Notifications
You must be signed in to change notification settings - Fork 599
Open
Description
Hi,
How can I set the "enforced eager" mode in Llama-3.1-8B command line? For this command:
python -u main.py \
--scenario Offline \
--model-path $CHECKPOINT_PATH \
--batch-size $BATCH_SIZE \
--dtype bfloat16 \
--user-conf user.conf \
--total-sample-count 1 \
--dataset-path $DATASET_PATH \
--output-log-dir output \
--tensor-parallel-size $GPU_COUNT \
--vllm
I see this message:
INFO 12-08 09:22:13 gpu_executor.py:122] # GPU blocks: 28190, # CPU blocks: 2048
INFO 12-08 09:22:13 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 3.44x
INFO 12-08 09:22:16 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-08 09:22:16 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
I tried using -vllm --enforce-eager in the python command but apparently there is no such option for that.
I also tried setting that option manually in main.py as below:
if args.vllm:
sut = sut_cls(
model_path=args.model_path,
dtype=args.dtype,
batch_size=args.batch_size,
dataset_path=args.dataset_path,
total_sample_count=args.total_sample_count,
workers=args.num_workers,
tensor_parallel_size=args.tensor_parallel_size,
enforce_eager=True <=======
)
But I get this error:
File "/mnt/users/m/inference/language/llama3.1-8b/main.py", line 173, in main
sut = sut_cls(
^^^^^^^^
TypeError: SUT.__init__() got an unexpected keyword argument 'enforce_eager'
Metadata
Metadata
Assignees
Labels
No labels