Confused: How to deploy training task in multi nodes with accelerate and deepspeed? #1288
Replies: 3 comments
-
Beta Was this translation helpful? Give feedback.
-
Potentially related to this: But I have experienced similar problems when trying to run axolotl training with very large dataset (ca. 100B tokens at 8k context length). |
Beta Was this translation helpful? Give feedback.
-
Hey, could you check the https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md docs to see if it helps solving that nccl issue. Alternatively, can you try running multi-node with accelerate https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/multi-node.md |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
##Heres the accelerate config:

##Heres the log record:
"""
gpu004: [2024-02-13 02:00:39,717] [DEBUG] [axolotl.load_tokenizer:248] [PID:3056709] [RANK:7] UNK: 0 /
gpu004: [2024-02-13 02:00:39,717] [INFO] [axolotl.load_tokenizer:259] [PID:3056709] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference.
Downloading readme: 591B [00:00, 802kB/s]
Downloading data: 100%|██████████| 11.7M/11.7M [00:05<00:00, 2.20MB/s]
Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 121481.07 examples/s]
Tokenizing Prompts (num_proc=64): 4%|▍ | 394/10000 [00:00<00:18, 532.87 examples/s]
gpu003: multiprocess.pool.RemoteTraceback:
gpu003: """
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompt_tokenizers.py", line 356, in tokenize_prompt
gpu003: for _, part in enumerate(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompters.py", line 328, in build_prompt
gpu003: turns = self._build_result(source)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompters.py", line 318, in _build_result
gpu003: role = roles[sentence["from"]]
gpu003: KeyError: None
gpu003:
gpu003: The above exception was the direct cause of the following exception:
gpu003:
gpu003: Traceback (most recent call last):
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
gpu003: result = (True, func(*args, **kwds))
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 625, in _write_generator_to_queue
gpu003: for i, result in enumerate(func(**kwargs)):
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single
gpu003: example = apply_function_on_filtered_inputs(example, i, offset=offset)
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
gpu003: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/prompt_tokenizers.py", line 443, in tokenize_prompt
gpu003: raise InvalidDataException(str(err)) from err
gpu003: axolotl.prompt_tokenizers.InvalidDataException: None
gpu003: """
gpu003:
gpu003: The above exception was the direct cause of the following exception:
gpu003:
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 79, in prepare_dataset
gpu003: train_dataset, eval_dataset, prompters = load_prepare_datasets(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 461, in load_prepare_datasets
gpu003: dataset, prompters = load_tokenized_prepared_datasets(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 403, in load_tokenized_prepared_datasets
gpu003: dataset_wrapper, dataset_prompter = get_dataset_wrapper(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 554, in get_dataset_wrapper
gpu003: dataset_wrapper = TokenizedPromptDataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/datasets.py", line 43, in init
gpu003: self.process(dataset).data,
gpu003: File "/data/vayu/train/axolotl/src/axolotl/datasets.py", line 55, in process
gpu003: return dataset.map(
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper
gpu003: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper
gpu003: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map
gpu003: for rank, done, content in iflatmap_unordered(
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 665, in iflatmap_unordered
gpu003: [async_result.get(timeout=0.05) for async_result in async_results]
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 665, in
gpu003: [async_result.get(timeout=0.05) for async_result in async_results]
gpu003: File "/home/vayu/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
gpu003: raise self._value
gpu003: axolotl.prompt_tokenizers.InvalidDataException: None
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: Traceback (most recent call last):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: exec(code, run_globals)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: with zero_first(is_main_process()):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: dist.barrier()
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: return func(*args, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: work = default_pg.barrier(opts=opts)
gpu005: RuntimeError: [18] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: Traceback (most recent call last):
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: with zero_first(is_main_process()):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: dist.barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: return func(*args, **kwargs)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu004: RuntimeError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: Traceback (most recent call last):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: exec(code, run_globals)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: fire.Fire(do_cli)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu003: component = fn(*varargs, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: dist.barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return func(*args, **kwargs)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: with zero_first(is_main_process()):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: dist.barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: return func(*args, **kwargs)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu004: RuntimeError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: with zero_first(is_main_process()):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: dist.barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: return func(*args, **kwargs)
gpu006: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu004: RuntimeError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: Traceback (most recent call last):
gpu004: exec(code, run_globals)
gpu006: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: return _run_code(code, main_globals, None,
gpu004: fire.Fire(do_cli)
gpu006: exec(code, run_globals)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: exec(code, run_globals)
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: fire.Fire(do_cli)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: fire.Fire(do_cli)
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: component = fn(*varargs, **kwargs)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu006: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: component = fn(*varargs, **kwargs)
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: with zero_first(is_main_process()):
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: return next(self.gen)
gpu006: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: with zero_first(is_main_process()):
gpu004: barrier()
gpu006: return next(self.gen)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: return next(self.gen)
gpu004: dist.barrier()
gpu006: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: barrier()
gpu004: return func(*args, **kwargs)
gpu006: dist.barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: dist.barrier()
gpu004: work = default_pg.barrier(opts=opts)
gpu006: return func(*args, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: RuntimeError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: return func(*args, **kwargs)
gpu004: Traceback (most recent call last):
gpu006: work = default_pg.barrier(opts=opts)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: RuntimeError: [24] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: work = default_pg.barrier(opts=opts)
gpu004: return _run_code(code, main_globals, None,
gpu006: Traceback (most recent call last):
gpu005: RuntimeError: [16] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: exec(code, run_globals)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: Traceback (most recent call last):
gpu006: return _run_code(code, main_globals, None,
gpu004: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: return _run_code(code, main_globals, None,
gpu006: exec(code, run_globals)
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: exec(code, run_globals)
gpu006: fire.Fire(do_cli)
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: fire.Fire(do_cli)
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component = fn(*varargs, **kwargs)
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: component = fn(*varargs, **kwargs)
gpu003: Traceback (most recent call last):
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu003: return _run_code(code, main_globals, None,
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu003: exec(code, run_globals)
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: barrier()
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu003: fire.Fire(do_cli)
gpu006: with zero_first(is_main_process()):
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: dist.barrier()
gpu005: with zero_first(is_main_process()):
gpu003: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: return next(self.gen)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return next(self.gen)
gpu004: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: barrier()
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: barrier()
gpu004: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu003: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: RuntimeError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: dist.barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: dist.barrier()
gpu004: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu003: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: return func(*args, **kwargs)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: return func(*args, **kwargs)
gpu004: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu003: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: work = default_pg.barrier(opts=opts)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: work = default_pg.barrier(opts=opts)
gpu004: exec(code, run_globals)
gpu006: RuntimeError: [31] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: RuntimeError: [20] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: Traceback (most recent call last):
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu003: with zero_first(is_main_process()):
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: Traceback (most recent call last):
gpu004: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: return next(self.gen)
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu003: barrier()
gpu003: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu003: dist.barrier()
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: return _run_code(code, main_globals, None,
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: exec(code, run_globals)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu003: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu005: exec(code, run_globals)
gpu003: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: fire.Fire(do_cli)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu003: work = default_pg.barrier(opts=opts)
gpu003: RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: component = fn(*varargs, **kwargs)
gpu005: fire.Fire(do_cli)
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component = fn(*varargs, **kwargs)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: component = fn(*varargs, **kwargs)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: with zero_first(is_main_process()):
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu004: return next(self.gen)
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: barrier()
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: with zero_first(is_main_process()):
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: dist.barrier()
gpu005: with zero_first(is_main_process()):
gpu006: return next(self.gen)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: return func(*args, **kwargs)
gpu005: return next(self.gen)
gpu006: barrier()
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: work = default_pg.barrier(opts=opts)
gpu005: barrier()
gpu006: dist.barrier()
gpu004: RuntimeError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: Traceback (most recent call last):
gpu005: dist.barrier()
gpu006: return func(*args, **kwargs)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: return _run_code(code, main_globals, None,
gpu005: return func(*args, **kwargs)
gpu006: work = default_pg.barrier(opts=opts)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: exec(code, run_globals)
gpu006: RuntimeError: [27] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: work = default_pg.barrier(opts=opts)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: Traceback (most recent call last):
gpu005: RuntimeError: [17] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: Traceback (most recent call last):
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: return _run_code(code, main_globals, None,
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: exec(code, run_globals)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: exec(code, run_globals)
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: fire.Fire(do_cli)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: fire.Fire(do_cli)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component = fn(*varargs, **kwargs)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: component = fn(*varargs, **kwargs)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: return next(self.gen)
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: barrier()
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: dist.barrier()
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: with zero_first(is_main_process()):
gpu004: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: with zero_first(is_main_process()):
gpu005: return next(self.gen)
gpu004: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: RuntimeError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: barrier()
gpu006: return next(self.gen)
gpu004: Traceback (most recent call last):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: dist.barrier()
gpu006: barrier()
gpu004: return _run_code(code, main_globals, None,
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: return func(*args, **kwargs)
gpu006: dist.barrier()
gpu004: exec(code, run_globals)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: work = default_pg.barrier(opts=opts)
gpu006: return func(*args, **kwargs)
gpu004: fire.Fire(do_cli)
gpu005: RuntimeError: [23] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: Traceback (most recent call last):
gpu006: work = default_pg.barrier(opts=opts)
gpu004: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: RuntimeError: [26] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: return _run_code(code, main_globals, None,
gpu006: Traceback (most recent call last):
gpu004: component, remaining_args = _CallAndUpdateTrace(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu005: exec(code, run_globals)
gpu006: return _run_code(code, main_globals, None,
gpu004: component = fn(*varargs, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu005: fire.Fire(do_cli)
gpu006: exec(code, run_globals)
gpu004: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: fire.Fire(do_cli)
gpu004: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu004: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: component = fn(*varargs, **kwargs)
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu004: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu006: component = fn(*varargs, **kwargs)
gpu004: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu004: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu004: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu004: dist.barrier()
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: with zero_first(is_main_process()):
gpu004: return func(*args, **kwargs)
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu004: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: return next(self.gen)
gpu004: work = default_pg.barrier(opts=opts)
gpu006: with zero_first(is_main_process()):
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu004: RuntimeError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: barrier()
gpu006: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: dist.barrier()
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: return func(*args, **kwargs)
gpu006: barrier()
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: dist.barrier()
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: return func(*args, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: work = default_pg.barrier(opts=opts)
gpu006: RuntimeError: [28] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: exec(code, run_globals)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: with zero_first(is_main_process()):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: RuntimeError: [22] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: Traceback (most recent call last):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: return _run_code(code, main_globals, None,
gpu006: return next(self.gen)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: exec(code, run_globals)
gpu006: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: dist.barrier()
gpu005: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: return func(*args, **kwargs)
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: work = default_pg.barrier(opts=opts)
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: RuntimeError: [30] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: Traceback (most recent call last):
gpu005: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: exec(code, run_globals)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: with zero_first(is_main_process()):
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu005: return next(self.gen)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu005: barrier()
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu006: with zero_first(is_main_process()):
gpu005: dist.barrier()
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu006: return next(self.gen)
gpu005: return func(*args, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: barrier()
gpu005: work = default_pg.barrier(opts=opts)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: RuntimeError: [19] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: dist.barrier()
gpu005: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu005: return _run_code(code, main_globals, None,
gpu006: return func(*args, **kwargs)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu006: work = default_pg.barrier(opts=opts)
gpu006: RuntimeError: [25] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu006: Traceback (most recent call last):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
gpu006: return _run_code(code, main_globals, None,
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu006: exec(code, run_globals)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu006: fire.Fire(do_cli)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu006: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
gpu005: exec(code, run_globals)
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu006: component, remaining_args = _CallAndUpdateTrace(
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: component = fn(*varargs, **kwargs)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: return do_train(parsed_cfg, parsed_cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: with zero_first(is_main_process()):
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 59, in
gpu005: fire.Fire(do_cli)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
gpu005: component_trace = _Fire(component, args, parsed_flag_args, context, name)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
gpu005: component, remaining_args = _CallAndUpdateTrace(
gpu006: return next(self.gen)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: component = fn(*varargs, **kwargs)
gpu006: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
gpu006: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: return do_train(parsed_cfg, parsed_cli_args)
gpu006: dist.barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/train.py", line 53, in do_train
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
gpu006: return func(*args, **kwargs)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/cli/init.py", line 366, in load_datasets
gpu006: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
gpu006: work = default_pg.barrier(opts=opts)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/data.py", line 70, in prepare_dataset
gpu006: RuntimeError: [29] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu005: with zero_first(is_main_process()):
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/contextlib.py", line 135, in enter
gpu005: return next(self.gen)
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 70, in zero_first
gpu005: barrier()
gpu005: File "/data/vayu/train/axolotl/src/axolotl/utils/distributed.py", line 36, in barrier
gpu005: dist.barrier()
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
gpu005: return func(*args, **kwargs)
gpu005: File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
gpu005: work = default_pg.barrier(opts=opts)
gpu005: RuntimeError: [21] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.
gpu003: [2024-02-13 02:01:06,917] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739792
gpu003: [2024-02-13 02:01:06,917] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739793
gpu005: [2024-02-13 02:01:07,233] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663086
gpu005: [2024-02-13 02:01:07,414] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663087
gpu003: [2024-02-13 02:01:07,548] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739794
gpu005: [2024-02-13 02:01:07,709] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663088
gpu005: [2024-02-13 02:01:07,710] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663089
gpu003: [2024-02-13 02:01:07,770] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739795
gpu004: [2024-02-13 02:01:07,779] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056702
gpu004: [2024-02-13 02:01:07,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056703
gpu003: [2024-02-13 02:01:07,818] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739796
gpu004: [2024-02-13 02:01:07,839] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056704
gpu003: [2024-02-13 02:01:07,842] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739797
gpu003: [2024-02-13 02:01:07,866] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739798
gpu004: [2024-02-13 02:01:07,875] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056705
gpu005: [2024-02-13 02:01:07,889] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663090
gpu003: [2024-02-13 02:01:07,890] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 739799
gpu004: [2024-02-13 02:01:07,909] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056706
gpu003: [2024-02-13 02:01:07,914] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
gpu005: [2024-02-13 02:01:07,934] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663091
gpu004: [2024-02-13 02:01:07,940] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056707
gpu005: [2024-02-13 02:01:07,958] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663092
gpu004: [2024-02-13 02:01:07,972] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056708
gpu005: [2024-02-13 02:01:07,982] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1663093
gpu004: [2024-02-13 02:01:08,002] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3056709
gpu005: [2024-02-13 02:01:08,005] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
gpu004: [2024-02-13 02:01:08,032] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
gpu006: [2024-02-13 02:01:08,205] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690903
gpu006: [2024-02-13 02:01:08,205] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690904
gpu006: [2024-02-13 02:01:08,264] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690905
gpu006: [2024-02-13 02:01:08,301] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690906
gpu006: [2024-02-13 02:01:08,329] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690907
gpu006: [2024-02-13 02:01:08,354] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690908
gpu006: [2024-02-13 02:01:08,379] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690909
gpu006: [2024-02-13 02:01:08,403] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2690910
gpu006: [2024-02-13 02:01:08,427] [ERROR] [launch.py:321:sigkill_handler] ['/data/vayu/miniconda3/envs/axo/bin/python', '-u', '-m', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml'] exits with return code = 1
pdsh@gpu003: gpu003: ssh exited with exit code 1
pdsh@gpu003: gpu004: ssh exited with exit code 1
pdsh@gpu003: gpu005: ssh exited with exit code 1
pdsh@gpu003: gpu006: ssh exited with exit code 1
Traceback (most recent call last):
File "/data/vayu/miniconda3/envs/axo/bin/accelerate", line 8, in
sys.exit(main())
File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
deepspeed_launcher(args)
File "/data/vayu/miniconda3/envs/axo/lib/python3.10/site-packages/accelerate/commands/launch.py", line 712, in deepspeed_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--hostfile', '/data/vayu/train/config/hostfile', '--launcher', 'pdsh', '--num_gpus', '8', '--master_port', '9901', '--module', 'axolotl.cli.train', 'Task-34b-Chat-test.yaml']' returned non-zero exit status 1.
"""
Beta Was this translation helpful? Give feedback.
All reactions