-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Open
Labels
bugSomething isn't workingSomething isn't workingpendingThis problem is yet to be addressedThis problem is yet to be addressed
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
llamafactoryversion: 0.9.5.dev0- Platform: Linux-5.10.134-19.100.al8.x86_64-x86_64-with-glibc2.39
- Python version: 3.12.12
- PyTorch version: 2.10.0+cu128 (GPU)
- Transformers version: 5.2.0
- Datasets version: 4.0.0
- Accelerate version: 1.11.0
- PEFT version: 0.18.1
- GPU type: NVIDIA L20Y
- GPU number: 8
- GPU memory: 79.19GB
- TRL version: 0.24.0
- DeepSpeed version: 0.18.4
- Git commit: fc5b85c
- Default data directory: detected
Reproduction
When I test the Ray implementation for distributed training, I encounter the following issues:
Traceback (most recent call last):
File "/root/miniconda3/envs/demo/bin/llamafactory-cli", line 10, in <module>
sys.exit(main())
^^^^^^
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/cli.py", line 24, in main
launcher.launch()
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/launcher.py", line 157, in launch
run_exp()
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/train/tuner.py", line 123, in run_exp
_ray_training_function(ray_args, config={"args": args, "callbacks": callbacks})
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/train/tuner.py", line 314, in _ray_training_function
ray.get([worker._training_function.remote(config=config) for worker in workers])
File "/root/miniconda3/envs/demo/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/demo/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/demo/lib/python3.12/site-packages/ray/_private/worker.py", line 2981, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/demo/lib/python3.12/site-packages/ray/_private/worker.py", line 1012, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::Worker._training_function() (pid=382911, ip=10.144.205.242, actor_id=feb329387fe251e2405dbe9904000000, repr=<llamafactory.train.tuner.Worker object at 0x7f6f850480e0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/train/tuner.py", line 248, in _training_function
_training_function(config)
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/train/tuner.py", line 60, in _training_function
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/hparams/parser.py", line 290, in get_train_args
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/hparams/parser.py", line 244, in _parse_train_args
return _parse_args(parser, args, allow_extra_keys=allow_extra_keys)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/ali-sh-1/dataset/zeus/ylqiu/codes/demo_exp/LlamaFactory/src/llamafactory/hparams/parser.py", line 91, in _parse_args
return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/demo/lib/python3.12/site-packages/transformers/hf_argparser.py", line 383, in parse_dict
raise ValueError(f"Some keys are not used by the HfArgumentParser: {sorted(unused_keys)}")
ValueError: Some keys are not used by the HfArgumentParser: ['placement_strategy', 'ray_run_name', 'ray_storage_path', 'resources_per_worker']
It seems these keys are unused by transformers lib, and simply removing these keys can fix this problem.
The following script is used for reproducing the issue:
export USE_RAY=1
llamafactory-cli train examples/train_lora/qwen3_lora_sft_ray.yamlOthers
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingpendingThis problem is yet to be addressedThis problem is yet to be addressed