Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

No description provided.

@a-r-r-o-w a-r-r-o-w requested a review from sayakpaul June 24, 2025 06:43
@DN6
Copy link
Collaborator

DN6 commented Jun 24, 2025

Hmm not sure what's happening with the Cosmos AE's. Tests pass locally.

@a-r-r-o-w
Copy link
Contributor Author

Hmm... let me try to look into it too. Will merge this for now and open follow up if I can figure it out

@a-r-r-o-w a-r-r-o-w merged commit 474a248 into main Jun 24, 2025
28 of 29 checks passed
@a-r-r-o-w a-r-r-o-w deleted the fix-framepack-device-tests branch June 24, 2025 08:19
@a-r-r-o-w
Copy link
Contributor Author

Hmm, I SSH'ed into runner and did the exact same environment setup and it does not fail, and neither does locally when I try to match the versions 🤔

logs
root@b0e0581e8ddb:/__w/diffusers/diffusers# python3 -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
root@b0e0581e8ddb:/__w/diffusers/diffusers# python -m uv pip install -e [quality,test]
   Built file:///__w/diffusers/diffusers                                                                                                                Built 1 editable in 731ms
Resolved 154 packages in 585ms
Downloaded 54 packages in 712ms
Installed 55 packages in 29ms
 + annotated-types==0.7.0
 + babel==2.17.0
 + clean-fid==0.1.35
 + clip-anytorch==2.6.0
 + colorama==0.4.6
 + compel==0.1.8
 + csvw==3.5.1
 + dctorch==0.1.2
 + diffusers==0.34.0.dev0 (from file:///__w/diffusers/diffusers)
 + dlinfo==2.0.0
 + einops==0.8.1
 + exceptiongroup==1.3.0
 + execnet==2.1.1
 + ftfy==6.3.1
 - gitpython==3.1.44
 + gitpython==3.1.18
 + imageio==2.37.0
 + importlib-metadata==8.7.0
 + iniconfig==2.1.0
 + isodate==0.7.2
 + isort==6.0.1
 + jsonmerge==1.9.2
 + k-diffusion==0.1.1.post1
 + kornia==0.8.1
 + kornia-rs==0.1.9
 + language-tags==1.2.0
 + parameterized==0.9.0
 + phonemizer==3.3.0
 + pluggy==1.6.0
 + pydantic==2.11.7
 + pydantic-core==2.33.2
 + pygments==2.19.2
 + pyparsing==3.2.3
 + pytest==8.4.1
 + pytest-timeout==2.4.0
 + pytest-xdist==3.7.0
 + rdflib==7.1.4
 + requests-mock==1.10.0
 + rfc3986==1.5.0
 + ruff==0.9.10
 + scikit-image==0.25.2
 + segments==2.3.0
 + sentencepiece==0.2.0
 + sentry-sdk==2.30.0
 + setproctitle==1.3.6
 + tifffile==2025.5.10
 + tiktoken==0.9.0
 + torchdiffeq==0.2.5
 + torchsde==0.2.6
 + trampoline==0.1.2
 + typing-inspection==0.4.1
 + uritemplate==4.2.0
 - urllib3==2.5.0
 + urllib3==1.26.20
 + wandb==0.20.1
 + wcwidth==0.2.13
 + zipp==3.23.0
root@b0e0581e8ddb:/__w/diffusers/diffusers# python -m uv pip install peft@git+https://github.com/huggingface/peft.git
 Updated https://github.com/huggingface/peft.git (59ef3b9)                                                                                              Resolved 43 packages in 1.52s
   Built peft @ git+https://github.com/huggingface/peft.git@59ef3b93c8feda05fa92d8de7d588c30907266b5                                                    Downloaded 1 package in 595ms
Installed 1 package in 4ms
 + peft==0.15.2.dev0 (from git+https://github.com/huggingface/peft.git@59ef3b93c8feda05fa92d8de7d588c30907266b5)
root@b0e0581e8ddb:/__w/diffusers/diffusers# pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
Found existing installation: accelerate 1.8.1
Uninstalling accelerate-1.8.1:
  Successfully uninstalled accelerate-1.8.1
 Updated https://github.com/huggingface/accelerate.git (5987d79)                                                                                        Resolved 39 packages in 1.51s
   Built accelerate @ git+https://github.com/huggingface/accelerate.git@5987d79a538d2270deea1778e5625e869c4936b8                                        Downloaded 4 packages in 576ms
Installed 5 packages in 22ms
 + accelerate==1.9.0.dev0 (from git+https://github.com/huggingface/accelerate.git@5987d79a538d2270deea1778e5625e869c4936b8)
 - fsspec==2025.3.0
 + fsspec==2025.5.1
 - numpy==1.26.4
 + numpy==2.2.6
 - setuptools==65.5.0
 + setuptools==80.9.0
 - urllib3==1.26.20
 + urllib3==2.5.0
root@b0e0581e8ddb:/__w/diffusers/diffusers# pip uninstall transformers -y && python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers.git --no-deps
Found existing installation: transformers 4.52.4
Uninstalling transformers-4.52.4:
  Successfully uninstalled transformers-4.52.4
 Updated https://github.com/huggingface/transformers.git (d3d835d)                                                                                      Resolved 1 package in 15.02s
   Built transformers @ git+https://github.com/huggingface/transformers.git@d3d835d4fc145e5062d2153ac23ccd4b3e2c2cbd                                    Downloaded 1 package in 4.10s
Installed 1 package in 52ms
 + transformers==4.53.0.dev0 (from git+https://github.com/huggingface/transformers.git@d3d835d4fc145e5062d2153ac23ccd4b3e2c2cbd)
root@b0e0581e8ddb:/__w/diffusers/diffusers# pytest -s tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_layerwise_casting_inference
================================================================= test session starts ==================================================================
platform linux -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: timeout-2.4.0, xdist-3.7.0, requests-mock-1.10.0
collected 1 item

tests/models/autoencoders/test_models_autoencoder_cosmos.py .

================================================================== 1 passed in 4.32s ===================================================================

@a-r-r-o-w
Copy link
Contributor Author

Managed to reproduce it! If you run the full autoencoder cosmos test suite with pytest -s tests/models/autoencoders/test_models_autoencoder_cosmos.py, it fails. If you run just the specific tests themselves one-by-one, they pass 🤔

logs
>           self.assertTrue(torch.allclose(base_output[0], new_output[0], atol=1e-5))
E           AssertionError: False is not true

tests/models/test_modeling_common.py:1429: AssertionError
=============================================================== short test summary info ================================================================
FAILED tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_layerwise_casting_inference - AssertionError: np.False_ is not true
FAILED tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_sharded_checkpoints - AssertionError: False is not true
FAILED tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_sharded_checkpoints_with_variant - AssertionError: False is not true
====================================================== 3 failed, 31 passed, 18 skipped in 16.30s =======================================================
root@203e00907ec0:/__w/diffusers/diffusers# pytest -s tests/models/autoencoders/test_models_autoencoder_cosmos.py -k test_layerwise_casting_inference
================================================================= test session starts ==================================================================
platform linux -- Python 3.10.18, pytest-8.4.1, pluggy-1.6.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: timeout-2.4.0, requests-mock-1.10.0, xdist-3.7.0
collected 52 items / 51 deselected / 1 selected

tests/models/autoencoders/test_models_autoencoder_cosmos.py .

=========================================================== 1 passed, 51 deselected in 3.82s ===========================================================
root@203e00907ec0:/__w/diffusers/diffusers#

@sayakpaul
Copy link
Member

From some previous experience, we have seen that it sometimes stems from a different reason. Possible to provide the full test run output (i.e., the output when you run the full Autoencoder Cosmos test suite)?

@a-r-r-o-w
Copy link
Contributor Author

Yep. Running the full test suite at once reproduces it locally for me as well. Individually, they all pass.

logs
pytest -s tests/models/autoencoders/test_models_autoencoder_cosmos.py
======================================== test session starts ========================================
platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/aryan/work/diffusers
configfile: pyproject.toml
plugins: timeout-2.3.1, requests-mock-1.10.0, xdist-3.6.1, hydra-core-1.3.2, anyio-4.6.2.post1
collected 52 items                                                                                  

tests/models/autoencoders/test_models_autoencoder_cosmos.py An error occurred while trying to fetch /tmp/tmpd1189o4b: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpd1189o4b.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
.s..sss..sAn error occurred while trying to fetch /tmp/tmpjroy2wfo: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpjroy2wfo.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
.An error occurred while trying to fetch /tmp/tmpv55lrfcr: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpv55lrfcr.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /tmp/tmpv55lrfcr: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpv55lrfcr.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /tmp/tmpn46esqd2: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpn46esqd2.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /tmp/tmpn46esqd2: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpn46esqd2.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /tmp/tmpqwtc6ia8: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpqwtc6ia8.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /tmp/tmpqwtc6ia8: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpqwtc6ia8.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
.sAn error occurred while trying to fetch /tmp/tmpcg15oaki: Error no file named diffusion_pytorch_model.fp16.safetensors found in directory /tmp/tmpcg15oaki.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /tmp/tmpcg15oaki: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpcg15oaki.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
.........F..sssssAn error occurred while trying to fetch /tmp/tmpi3ckij4c: Error no file named diffusion_pytorch_model.safetensors found in directory /tmp/tmpi3ckij4c.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
Loading checkpoint shards: 100%|███████████████████████████████████████| 2/2 [00:00<00:00, 46.04it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████| 2/2 [00:00<00:00, 46.65it/s]
F....

============================================= FAILURES ==============================================
_____________________ AutoencoderKLCosmosTests.test_layerwise_casting_inference _____________________

self = <tests.models.autoencoders.test_models_autoencoder_cosmos.AutoencoderKLCosmosTests testMethod=test_layerwise_casting_inference>

    def test_layerwise_casting_inference(self):
        from diffusers.hooks.layerwise_casting import DEFAULT_SKIP_MODULES_PATTERN, SUPPORTED_PYTORCH_LAYERS
    
        torch.manual_seed(0)
        config, inputs_dict = self.prepare_init_args_and_inputs_for_common()
        model = self.model_class(**config).eval()
        model = model.to(torch_device)
        base_slice = model(**inputs_dict)[0].flatten().detach().cpu().numpy()
    
        def check_linear_dtype(module, storage_dtype, compute_dtype):
            patterns_to_check = DEFAULT_SKIP_MODULES_PATTERN
            if getattr(module, "_skip_layerwise_casting_patterns", None) is not None:
                patterns_to_check += tuple(module._skip_layerwise_casting_patterns)
            for name, submodule in module.named_modules():
                if not isinstance(submodule, SUPPORTED_PYTORCH_LAYERS):
                    continue
                dtype_to_check = storage_dtype
                if any(re.search(pattern, name) for pattern in patterns_to_check):
                    dtype_to_check = compute_dtype
                if getattr(submodule, "weight", None) is not None:
                    self.assertEqual(submodule.weight.dtype, dtype_to_check)
                if getattr(submodule, "bias", None) is not None:
                    self.assertEqual(submodule.bias.dtype, dtype_to_check)
    
        def test_layerwise_casting(storage_dtype, compute_dtype):
            torch.manual_seed(0)
            config, inputs_dict = self.prepare_init_args_and_inputs_for_common()
            inputs_dict = cast_maybe_tensor_dtype(inputs_dict, torch.float32, compute_dtype)
            model = self.model_class(**config).eval()
            model = model.to(torch_device, dtype=compute_dtype)
            model.enable_layerwise_casting(storage_dtype=storage_dtype, compute_dtype=compute_dtype)
    
            check_linear_dtype(model, storage_dtype, compute_dtype)
            output = model(**inputs_dict)[0].float().flatten().detach().cpu().numpy()
    
            # The precision test is not very important for fast tests. In most cases, the outputs will not be the same.
            # We just want to make sure that the layerwise casting is working as expected.
            self.assertTrue(numpy_cosine_similarity_distance(base_slice, output) < 1.0)
    
>       test_layerwise_casting(torch.float16, torch.float32)

tests/models/test_modeling_common.py:1570: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/models/test_modeling_common.py:1568: in test_layerwise_casting
    self.assertTrue(numpy_cosine_similarity_distance(base_slice, output) < 1.0)
E   AssertionError: np.False_ is not true
_________________________ AutoencoderKLCosmosTests.test_sharded_checkpoints _________________________

self = <tests.models.autoencoders.test_models_autoencoder_cosmos.AutoencoderKLCosmosTests testMethod=test_sharded_checkpoints>

    @require_torch_accelerator
    def test_sharded_checkpoints(self):
        torch.manual_seed(0)
        config, inputs_dict = self.prepare_init_args_and_inputs_for_common()
        model = self.model_class(**config).eval()
        model = model.to(torch_device)
    
        base_output = model(**inputs_dict)
    
        model_size = compute_module_persistent_sizes(model)[""]
        max_shard_size = int((model_size * 0.75) / (2**10))  # Convert to KB as these test models are small.
        with tempfile.TemporaryDirectory() as tmp_dir:
            model.cpu().save_pretrained(tmp_dir, max_shard_size=f"{max_shard_size}KB")
            self.assertTrue(os.path.exists(os.path.join(tmp_dir, SAFE_WEIGHTS_INDEX_NAME)))
    
            # Now check if the right number of shards exists. First, let's get the number of shards.
            # Since this number can be dependent on the model being tested, it's important that we calculate it
            # instead of hardcoding it.
            expected_num_shards = caculate_expected_num_shards(os.path.join(tmp_dir, SAFE_WEIGHTS_INDEX_NAME))
            actual_num_shards = len([file for file in os.listdir(tmp_dir) if file.endswith(".safetensors")])
            self.assertTrue(actual_num_shards == expected_num_shards)
    
            new_model = self.model_class.from_pretrained(tmp_dir).eval()
            new_model = new_model.to(torch_device)
    
            torch.manual_seed(0)
            if "generator" in inputs_dict:
                _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
            new_output = new_model(**inputs_dict)
    
>           self.assertTrue(torch.allclose(base_output[0], new_output[0], atol=1e-5))
E           AssertionError: False is not true

tests/models/test_modeling_common.py:1391: AssertionError
__________________ AutoencoderKLCosmosTests.test_sharded_checkpoints_with_variant ___________________

self = <tests.models.autoencoders.test_models_autoencoder_cosmos.AutoencoderKLCosmosTests testMethod=test_sharded_checkpoints_with_variant>

    @require_torch_accelerator
    def test_sharded_checkpoints_with_variant(self):
        torch.manual_seed(0)
        config, inputs_dict = self.prepare_init_args_and_inputs_for_common()
        model = self.model_class(**config).eval()
        model = model.to(torch_device)
    
        base_output = model(**inputs_dict)
    
        model_size = compute_module_persistent_sizes(model)[""]
        max_shard_size = int((model_size * 0.75) / (2**10))  # Convert to KB as these test models are small.
        variant = "fp16"
        with tempfile.TemporaryDirectory() as tmp_dir:
            # It doesn't matter if the actual model is in fp16 or not. Just adding the variant and
            # testing if loading works with the variant when the checkpoint is sharded should be
            # enough.
            model.cpu().save_pretrained(tmp_dir, max_shard_size=f"{max_shard_size}KB", variant=variant)
    
            index_filename = _add_variant(SAFE_WEIGHTS_INDEX_NAME, variant)
            self.assertTrue(os.path.exists(os.path.join(tmp_dir, index_filename)))
    
            # Now check if the right number of shards exists. First, let's get the number of shards.
            # Since this number can be dependent on the model being tested, it's important that we calculate it
            # instead of hardcoding it.
            expected_num_shards = caculate_expected_num_shards(os.path.join(tmp_dir, index_filename))
            actual_num_shards = len([file for file in os.listdir(tmp_dir) if file.endswith(".safetensors")])
            self.assertTrue(actual_num_shards == expected_num_shards)
    
            new_model = self.model_class.from_pretrained(tmp_dir, variant=variant).eval()
            new_model = new_model.to(torch_device)
    
            torch.manual_seed(0)
            if "generator" in inputs_dict:
                _, inputs_dict = self.prepare_init_args_and_inputs_for_common()
            new_output = new_model(**inputs_dict)
    
>           self.assertTrue(torch.allclose(base_output[0], new_output[0], atol=1e-5))
E           AssertionError: False is not true

tests/models/test_modeling_common.py:1429: AssertionError
========================================= warnings summary ==========================================
../../../../raid/aryan/nightly-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
../../../../raid/aryan/nightly-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
../../../../raid/aryan/nightly-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
../../../../raid/aryan/nightly-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
  /raid/aryan/nightly-venv/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
    warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================== short test summary info ======================================
FAILED tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_layerwise_casting_inference - AssertionError: np.False_ is not true
FAILED tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_sharded_checkpoints - AssertionError: False is not true
FAILED tests/models/autoencoders/test_models_autoencoder_cosmos.py::AutoencoderKLCosmosTests::test_sharded_checkpoints_with_variant - AssertionError: False is not true
======================= 3 failed, 32 passed, 17 skipped, 4 warnings in 38.27s =======================

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Jun 24, 2025

From a quick bisect, it looks like the tests are failing after this PR: #11682

Taking a look to understand what changed

Update: can confirm that removing the test_group_offloading_with_disk test results in all tests passing. So, I think focusing on understanding that will give us a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants