Skip to content

[Research] Update fedumm #4390

Merged
ZiyueXu77 merged 12 commits intoNVIDIA:mainfrom
ZiyueXu77:fed_umm
Apr 9, 2026
Merged

[Research] Update fedumm #4390
ZiyueXu77 merged 12 commits intoNVIDIA:mainfrom
ZiyueXu77:fed_umm

Conversation

@ZiyueXu77
Copy link
Copy Markdown
Collaborator

Fixes # .

Description

  • Simplify the whole example to align with the paper experiment itself, remove JanusPro which is not mentioned in the paper
  • Use recipe beyond job
  • Add TB record
  • code restructure

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings April 1, 2026 19:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR streamlines the research/fedumm example to match the FedUMM paper’s BLIP-focused experiment flow, migrating the simulator job to NVFlare’s FedAvgRecipe API and adding TensorBoard logging, while removing the JanusPro backend and related env/scripts.

Changes:

  • Remove JanusPro backend and multi-env launch scripts; simplify backend registration around BLIP-VQA only.
  • Switch job.py to FedAvgRecipe + SimEnv execution and add experiment tracking.
  • Add step-level training logging + TensorBoard scalar recording in the shared training loop and client/baseline scripts.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
research/fedumm/src/model_registry.py Simplifies registry module imports/docs.
research/fedumm/src/januspro_backend.py Removes JanusPro backend implementation.
research/fedumm/src/common.py Adds logging/TensorBoard hooks to training loop; formatting cleanup.
research/fedumm/src/blip_backend.py Adds import fallback and introduces BLIPLoRAModel for recipe model init.
research/fedumm/src/init.py Removes auto-registration behavior.
research/fedumm/scripts/slurm_run.sh Removes SLURM runner script.
research/fedumm/scripts/setup_envs.sh Removes conda env setup helper.
research/fedumm/scripts/launch_januspro.sh Removes JanusPro env wrapper script.
research/fedumm/scripts/launch_blip.sh Removes BLIP env wrapper script.
research/fedumm/requirements.txt Pins datasets and adds TensorBoard/scipy dependencies.
research/fedumm/README.md Rewrites README to the simplified simulator-based workflow.
research/fedumm/job.py Replaces FedJob config with FedAvgRecipe + TensorBoard tracking.
research/fedumm/envs/env_januspro.yml Removes JanusPro conda env file.
research/fedumm/envs/env_blip.yml Removes BLIP conda env file.
research/fedumm/client.py Simplifies to BLIP-only client; adds TensorBoard logging.
research/fedumm/centralized_baseline.py Simplifies to BLIP-only baseline; adds TensorBoard logging.
Comments suppressed due to low confidence (3)

research/fedumm/client.py:87

  • load_dataset(..., trust_remote_code=True) enables execution of arbitrary code from the dataset repository. If this isn’t strictly required for HuggingFaceM4/VQAv2, it should be removed; otherwise consider gating it behind an explicit CLI flag / environment variable and defaulting to False to reduce the security risk.
    research/fedumm/centralized_baseline.py:60
  • load_dataset(..., trust_remote_code=True) enables execution of arbitrary code from the dataset repository. If this isn’t strictly required for HuggingFaceM4/VQAv2, it should be removed; otherwise gate it behind an explicit opt-in flag to reduce the security risk.
    research/fedumm/client.py:97
  • The job enables TensorBoard tracking via add_experiment_tracking(...), but the client uses torch.utils.tensorboard.SummaryWriter, which won’t integrate with NVFlare’s tracking pipeline (and may write to colliding default runs/ directories across simulated clients). Consider using nvflare.client.tracking.SummaryWriter or explicitly setting a per-site log_dir and closing the writer on shutdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR simplifies the FedUMM research example to align with the paper: JanusPro is removed, the implementation is hardcoded to the BLIP-VQA backend, FedAvgRecipe replaces the manual job setup, and TensorBoard logging is added to both the FL client and the centralized baseline. Prior review concerns (missing cur_round guard, writer.close(), and save_pretrained for LoRA sub-modules) have all been addressed.

Confidence Score: 4/5

PR is safe to merge after confirming whether the per-round optimizer reset is intentional.

All prior P1 concerns have been addressed. One new P1 flag — optimizer state discarded each FL round — may be the intended FedAvg behavior but is undocumented and worth confirming with the author before merge.

research/fedumm/client.py — optimizer instantiation inside the FL loop (lines 155–160).

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
research/fedumm/client.py FL client for federated BLIP-VQA fine-tuning; hardcoded to blip_vqa backend, TensorBoard added, cur_round fallback fixed. Minor: optimizer re-created each round (possibly intentional), SummaryWriter initialized slightly before site name is available.
research/fedumm/job.py Switched to FedAvgRecipe-based job; model_name_or_path is conditionally forwarded to client scripts; clean refactor.
research/fedumm/centralized_baseline.py Centralized baseline now saves each PEFT sub-module individually and includes a TensorBoard writer; looks correct.
research/fedumm/src/common.py Shared helpers; empty-dataloader guard raises ValueError as per custom rule; Dirichlet partition logic unchanged and correct.
research/fedumm/src/blip_backend.py BLIP-VQA backend including BLIPLoRAModel server wrapper; LoRA applied to text_encoder/text_decoder; evaluate raises on empty loader.
research/fedumm/README.md README significantly simplified to match paper scope; JanusPro removed; setup instructions updated.

Sequence Diagram

sequenceDiagram
    participant job.py
    participant Server (FedAvgRecipe)
    participant client.py (site-N)

    job.py->>Server (FedAvgRecipe): FedAvgRecipe.execute(SimEnv)
    Server (FedAvgRecipe)->>Server (FedAvgRecipe): BLIPLoRAModel init (LoRA on CPU)
    loop num_rounds
        Server (FedAvgRecipe)->>client.py (site-N): flare.send(FLModel with LoRA params)
        client.py (site-N)->>client.py (site-N): load_trainable_params(model, params)
        client.py (site-N)->>client.py (site-N): train_one_epoch (local_epochs) + TensorBoard
        client.py (site-N)->>client.py (site-N): backend.evaluate → val acc + TensorBoard
        client.py (site-N)->>Server (FedAvgRecipe): flare.send(FLModel with LoRA updates + metrics)
        Server (FedAvgRecipe)->>Server (FedAvgRecipe): FedAvg aggregate LoRA deltas
    end
    Server (FedAvgRecipe)->>job.py: run.get_result()
Loading

Reviews (8): Last reviewed commit: "Merge branch 'main' into fed_umm" | Re-trigger Greptile

@ZiyueXu77 ZiyueXu77 requested a review from holgerroth April 1, 2026 20:00
@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@holgerroth
Copy link
Copy Markdown
Collaborator

@greptileai review the latest changes.

@holgerroth
Copy link
Copy Markdown
Collaborator

/build

Copy link
Copy Markdown
Collaborator

@holgerroth holgerroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensible changes that simplify the example.

@ZiyueXu77 ZiyueXu77 enabled auto-merge (squash) April 2, 2026 15:53
@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@ZiyueXu77 ZiyueXu77 merged commit 8cc09d7 into NVIDIA:main Apr 9, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants