Skip to content

Conversation

@ethanhe42
Copy link

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ethanhe42 ethanhe42 requested a review from a team as a code owner July 11, 2025 19:26
@ethanhe42 ethanhe42 requested a review from Copilot July 11, 2025 19:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive diffusion framework under nemo_vfm, including VAE modules, sampling pipelines (RES, EDM, Cosmos), training entrypoints, and utility support for distributed setups.

  • Added VAE components: perceptual+adversarial loss, encoder/decoder blocks, and autoencoder configuration.
  • Implemented multiple sampler backends (RES, EDM, Flow Matching, Cosmos) with full pipeline integrations.
  • Provided training scripts, parallel initialization utilities, and checkpoint conversion tools.

Reviewed Changes

Copilot reviewed 69 out of 467 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
nemo_vfm/diffusion/vae/contperceptual_loss.py New LPIPS+discriminator loss for VAE training
nemo_vfm/diffusion/vae/blocks.py Core convolutional, ResNet, and attention building blocks
nemo_vfm/diffusion/vae/autovae.py VAE architecture search/generation utilities
nemo_vfm/diffusion/vae/autoencoder.py Full autoencoder implementation with config dataclass
nemo_vfm/diffusion/utils/mcore_parallel_utils.py Megatron model parallel initialization helper

super().__init__()
assert disc_loss in ["hinge", "vanilla"]
self.kl_weight = kl_weight
self.pixel_weight = pixelloss_weight
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pixelloss_weight is stored in self.pixel_weight but never applied in forward. Consider multiplying the L1 reconstruction term by self.pixel_weight to respect the configured pixel loss weight.

Copilot uses AI. Check for mistakes.
nll_grads = torch.autograd.grad(nll_loss, last_layer, retain_graph=True)[0]
g_grads = torch.autograd.grad(g_loss, last_layer, retain_graph=True)[0]
else:
nll_grads = torch.autograd.grad(nll_loss, self.last_layer[0], retain_graph=True)[0]
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code references self.last_layer, which is never defined. This will raise an AttributeError. You should either require last_layer to always be passed, or initialize self.last_layer appropriately.

Copilot uses AI. Check for mistakes.
nn.Module: An instance of the requested attention block.
"""
assert attn_type in ["vanilla", "linear", "none"], f"attn_type {attn_type} unknown"
print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This debug print in make_attn will execute on every block creation and clutter logs. Consider removing it or replacing it with a logger at an appropriate level.

Suggested change
print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
logging.debug(f"Making attention of type '{attn_type}' with {in_channels} in_channels")

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +133
search_space_skleton = self._load_base_json_skeleton()
search_space_skleton["down_block_types"] = choice["down_block_types"]
search_space_skleton["up_block_types"] = choice["up_block_types"]
search_space_skleton["block_out_channels"] = choice["block_out_channels"]
search_space_skleton["layers_per_block"] = choice["layers_per_block"]
search_space_skleton["latent_channels"] = choice["latent_channels"]
return search_space_skleton
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in variable name: search_space_skleton should likely be search_space_skeleton.

Suggested change
search_space_skleton = self._load_base_json_skeleton()
search_space_skleton["down_block_types"] = choice["down_block_types"]
search_space_skleton["up_block_types"] = choice["up_block_types"]
search_space_skleton["block_out_channels"] = choice["block_out_channels"]
search_space_skleton["layers_per_block"] = choice["layers_per_block"]
search_space_skleton["latent_channels"] = choice["latent_channels"]
return search_space_skleton
search_space_skeleton = self._load_base_json_skeleton()
search_space_skeleton["down_block_types"] = choice["down_block_types"]
search_space_skeleton["up_block_types"] = choice["up_block_types"]
search_space_skeleton["block_out_channels"] = choice["block_out_channels"]
search_space_skeleton["layers_per_block"] = choice["layers_per_block"]
search_space_skeleton["latent_channels"] = choice["latent_channels"]
return search_space_skeleton

Copilot uses AI. Check for mistakes.
state_dict = load_sft(ckpt_path)
missing, unexpected = self.load_state_dict(state_dict)
if len(missing) > 0:
logger.warning(f"Following keys are missing from checkpoint loaded: {missing}")
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger is not imported or defined in this module, causing a NameError. Import the standard Python logging module and configure a logger or use self.logger from nn.Module.

Copilot uses AI. Check for mistakes.
Ethan He added 2 commits July 11, 2025 13:19
Signed-off-by: Ethan He <[email protected]>
Signed-off-by: Ethan He <[email protected]>
pablo-garay
pablo-garay previously approved these changes Jul 11, 2025
Ethan He added 2 commits July 11, 2025 13:56
The sparse_attention directory was incorrectly configured as a submodule
without a corresponding .gitmodules file, causing CI/CD checkout failures.
This commit converts it to a regular directory with tracked files.

Signed-off-by: Ethan He <[email protected]>
cp
Signed-off-by: Ethan He <[email protected]>
@ethanhe42 ethanhe42 merged commit d6631a9 into main Jul 11, 2025
3 checks passed
@ethanhe42 ethanhe42 deleted the ethanhe42/projects branch July 11, 2025 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants