-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Hey, great work and thanks for releasing the code! I am facing some issues while trying to run inference on the released checkpoints. Any idea on how to solve this?
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
Using image_root_dir_val: datasets/point-force/test/mass_understanding_quantitative/wood/images
Using pretrained_controlnet_path: /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
Using model_type: controlnet_with_force_control_signal
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
[W603 22:05:38.149099413 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.226463116 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.234198569 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.268374444 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.334605861 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.395001483 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.421876800 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W603 22:05:38.497424423 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 4
Local process index: 4
Device: cuda:4
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 6
Local process index: 6
Device: cuda:6
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 3
Local process index: 3
Device: cuda:3
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 7
Local process index: 7
Device: cuda:7
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 5
Local process index: 5
Device: cuda:5
Mixed precision type: bf16
06/03/2025 22:05:39 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 8
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.59s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.61s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.43s/it]
Fetching 3 files: 100%|███████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3454.95it/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████| 3/3 [00:00<00:00, 7903.84it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.88s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.89s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.88s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.84s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:13<00:00, 6.95s/it]
Fetching 3 files: 100%|██████████████████████████████████████████████████████| 3/3 [00:00<00:00, 23130.35it/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████| 3/3 [00:00<00:00, 57456.22it/s]
Instantiating CustomCogVideoXTransformer3DModel model under default dtype torch.bfloat16.
{'ofs_embed_dim', 'patch_bias', 'patch_size_t'} was not found in config. Values will be initialized to default values.
Fetching 3 files: 100%|██████████████████████████████████████████████████████| 3/3 [00:00<00:00, 10979.85it/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████| 3/3 [00:00<00:00, 7328.43it/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████| 3/3 [00:00<00:00, 51150.05it/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6374.32it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 41.91it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 43.05it/s]
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 41.72it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.09it/s]
All model checkpoint weights were used when initializing CustomCogVideoXTransformer3DModel.
All the weights of CustomCogVideoXTransformer3DModel were initialized from the model checkpoint at THUDM/CogVideoX-5b-I2V.
If your task is similar to the task the model of the checkpoint was trained on, you can already use CustomCogVideoXTransformer3DModel for predictions without further training.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.97it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 39.22it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 43.25it/s]
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]All model checkpoint weights were used when initializing AutoencoderKLCogVideoX.
All the weights of AutoencoderKLCogVideoX were initialized from the model checkpoint at THUDM/CogVideoX-5b-I2V.
If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKLCogVideoX for predictions without further training.
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.73it/s]
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
Found 2 unique prompts to precompute...
... never mind, we already computed and saved all these embeddings! Will just read the json directly.
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/sriram/research/force-prompting/src/force-prompting/train.py", line 811, in <module>
[rank3]: main(args)
[rank3]: File "/home/sriram/research/force-prompting/src/force-prompting/train.py", line 517, in main
[rank3]: do_inference(
[rank3]: File "/home/sriram/research/force-prompting/src/force-prompting/inference.py", line 352, in do_inference
[rank3]: text_encoder=unwrap_model(accelerator, text_encoder),
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sriram/research/force-prompting/src/force-prompting/utils/model_utils.py", line 266, in unwrap_model
[rank3]: model = accelerator.unwrap_model(model)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2866, in unwrap_model
[rank3]: return extract_model_from_parallel(model, keep_fp32_wrapper, keep_torch_compile)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/utils/other.py", line 176, in extract_model_from_parallel
[rank3]: has_compiled = has_compiled_regions(model)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/utils/other.py", line 70, in has_compiled_regions
[rank3]: if module._modules:
[rank3]: ^^^^^^^^^^^^^^^
[rank3]: AttributeError: 'NoneType' object has no attribute '_modules'
W0603 22:06:27.220000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506503 closing signal SIGTERM
W0603 22:06:27.223000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506504 closing signal SIGTERM
W0603 22:06:27.223000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506505 closing signal SIGTERM
W0603 22:06:27.223000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506507 closing signal SIGTERM
W0603 22:06:27.224000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506508 closing signal SIGTERM
W0603 22:06:27.225000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506509 closing signal SIGTERM
W0603 22:06:27.225000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 506510 closing signal SIGTERM
E0603 22:06:29.721000 506430 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 506506) of binary: /home/sriram/research/force-prompting/conda-env/bin/python3.11
Traceback (most recent call last):
File "/home/sriram/research/force-prompting/conda-env/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1189, in launch_command
multi_gpu_launcher(args)
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 815, in multi_gpu_launcher
distrib_run.run(args)
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/force-prompting/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-06-03_22:06:27
host : ma-gpu02.nec-labs.com
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 506506)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(/home/sriram/research/force-prompting/conda-env) sriram@ma-gpu02:~/research/force-prompting$ clear
(/home/sriram/research/force-prompting/conda-env) sriram@ma-gpu02:~/research/force-prompting$ sh run_inference.sh
Found available port: 31850 (attempt 1)
Using force_type: point_force
Using num_validation_videos: 1
Using csv_path_val: datasets/point-force/test/mass_understanding_quantitative/wood/_materialballrollingballonwoodbowling1_obj1_prompt1.csv
Using image_root_dir_val: datasets/point-force/test/mass_understanding_quantitative/wood/images
Using pretrained_controlnet_path: /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
Using model_type: controlnet_with_force_control_signal
----------------------------------------------------------------------------------------------------
args.pretrained_controlnet_path = True, so we're overwriting output dir to that directory.
args.pretrained_controlnet_path = /net/acadia1a/data/sriram/force_prompting/checkpoints/step-5000-checkpoint-point-force.pt
----------------------------------------------------------------------------------------------------
[W603 22:06:59.250510421 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
06/03/2025 22:06:59 - INFO - __main__ - Distributed environment: DistributedType.MULTI_GPU Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.94s/it]
Fetching 3 files: 100%|██████████████████████████████████████████████████████| 3/3 [00:00<00:00, 26434.69it/s]
Instantiating CustomCogVideoXTransformer3DModel model under default dtype torch.bfloat16.
{'patch_size_t', 'ofs_embed_dim', 'patch_bias'} was not found in config. Values will be initialized to default values.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.92it/s]
All model checkpoint weights were used when initializing CustomCogVideoXTransformer3DModel.
All the weights of CustomCogVideoXTransformer3DModel were initialized from the model checkpoint at THUDM/CogVideoX-5b-I2V.
If your task is similar to the task the model of the checkpoint was trained on, you can already use CustomCogVideoXTransformer3DModel for predictions without further training.
All model checkpoint weights were used when initializing AutoencoderKLCogVideoX.
All the weights of AutoencoderKLCogVideoX were initialized from the model checkpoint at THUDM/CogVideoX-5b-I2V.
If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKLCogVideoX for predictions without further training.
Zero-initializing the weights for the channels corresponding to the last 32 channels, out of 64 total channels.
[ Weights from transformer were loaded into controlnet ] [# missing keys: 23 | # unexpected keys: 873]
List of missing keys: ['controlnet_encode_first.0.weight', 'controlnet_encode_first.0.bias', 'controlnet_encode_first.1.weight', 'controlnet_encode_first.1.bias', 'controlnet_encode_second.0.weight', 'controlnet_encode_second.0.bias', 'controlnet_encode_second.1.weight', 'controlnet_encode_second.1.bias', 'controlnet_zero_conv_before.weight', 'controlnet_zero_conv_before.bias', 'patch_embed.proj.weight', 'controlnet_zero_convs_after.0.weight', 'controlnet_zero_convs_after.0.bias', 'controlnet_zero_convs_after.1.weight', 'controlnet_zero_convs_after.1.bias', 'controlnet_zero_convs_after.2.weight', 'controlnet_zero_convs_after.2.bias', 'controlnet_zero_convs_after.3.weight', 'controlnet_zero_convs_after.3.bias', 'controlnet_zero_convs_after.4.weight', 'controlnet_zero_convs_after.4.bias', 'controlnet_zero_convs_after.5.weight', 'controlnet_zero_convs_after.5.bias']
[ Weights from pretrained controlnet was loaded into controlnet ] [# missing keys:: 0 | # unexpected keys: 0]
Found 2 unique prompts to precompute...
... never mind, we already computed and saved all these embeddings! Will just read the json directly.
06/03/2025 22:07:39 - INFO - __main__ - ***** Running validation *****
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/sriram/research/force-prompting/src/force-prompting/train.py", line 811, in <module>
[rank0]: main(args)
[rank0]: File "/home/sriram/research/force-prompting/src/force-prompting/train.py", line 517, in main
[rank0]: do_inference(
[rank0]: File "/home/sriram/research/force-prompting/src/force-prompting/inference.py", line 352, in do_inference
[rank0]: text_encoder=unwrap_model(accelerator, text_encoder),
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/sriram/research/force-prompting/src/force-prompting/utils/model_utils.py", line 266, in unwrap_model
[rank0]: model = accelerator.unwrap_model(model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2866, in unwrap_model
[rank0]: return extract_model_from_parallel(model, keep_fp32_wrapper, keep_torch_compile)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/utils/other.py", line 176, in extract_model_from_parallel
[rank0]: has_compiled = has_compiled_regions(model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/utils/other.py", line 70, in has_compiled_regions
[rank0]: if module._modules:
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'NoneType' object has no attribute '_modules'
[rank0]:[W603 22:07:39.937354529 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0603 22:07:41.083000 506810 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 506884) of binary: /home/sriram/research/force-prompting/conda-env/bin/python3.11
Traceback (most recent call last):
File "/home/sriram/research/force-prompting/conda-env/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1189, in launch_command
multi_gpu_launcher(args)
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/accelerate/commands/launch.py", line 815, in multi_gpu_launcher
distrib_run.run(args)
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sriram/research/force-prompting/conda-env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/force-prompting/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-06-03_22:07:41
host : ma-gpu02.nec-labs.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 506884)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels