Skip to content

[Research] Update fedumm #4390

Open
ZiyueXu77 wants to merge 11 commits intoNVIDIA:mainfrom
ZiyueXu77:fed_umm
Open

[Research] Update fedumm #4390
ZiyueXu77 wants to merge 11 commits intoNVIDIA:mainfrom
ZiyueXu77:fed_umm

Conversation

@ZiyueXu77
Copy link
Copy Markdown
Collaborator

Fixes # .

Description

  • Simplify the whole example to align with the paper experiment itself, remove JanusPro which is not mentioned in the paper
  • Use recipe beyond job
  • Add TB record
  • code restructure

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings April 1, 2026 19:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR streamlines the research/fedumm example to match the FedUMM paper’s BLIP-focused experiment flow, migrating the simulator job to NVFlare’s FedAvgRecipe API and adding TensorBoard logging, while removing the JanusPro backend and related env/scripts.

Changes:

  • Remove JanusPro backend and multi-env launch scripts; simplify backend registration around BLIP-VQA only.
  • Switch job.py to FedAvgRecipe + SimEnv execution and add experiment tracking.
  • Add step-level training logging + TensorBoard scalar recording in the shared training loop and client/baseline scripts.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
research/fedumm/src/model_registry.py Simplifies registry module imports/docs.
research/fedumm/src/januspro_backend.py Removes JanusPro backend implementation.
research/fedumm/src/common.py Adds logging/TensorBoard hooks to training loop; formatting cleanup.
research/fedumm/src/blip_backend.py Adds import fallback and introduces BLIPLoRAModel for recipe model init.
research/fedumm/src/init.py Removes auto-registration behavior.
research/fedumm/scripts/slurm_run.sh Removes SLURM runner script.
research/fedumm/scripts/setup_envs.sh Removes conda env setup helper.
research/fedumm/scripts/launch_januspro.sh Removes JanusPro env wrapper script.
research/fedumm/scripts/launch_blip.sh Removes BLIP env wrapper script.
research/fedumm/requirements.txt Pins datasets and adds TensorBoard/scipy dependencies.
research/fedumm/README.md Rewrites README to the simplified simulator-based workflow.
research/fedumm/job.py Replaces FedJob config with FedAvgRecipe + TensorBoard tracking.
research/fedumm/envs/env_januspro.yml Removes JanusPro conda env file.
research/fedumm/envs/env_blip.yml Removes BLIP conda env file.
research/fedumm/client.py Simplifies to BLIP-only client; adds TensorBoard logging.
research/fedumm/centralized_baseline.py Simplifies to BLIP-only baseline; adds TensorBoard logging.
Comments suppressed due to low confidence (3)

research/fedumm/client.py:87

  • load_dataset(..., trust_remote_code=True) enables execution of arbitrary code from the dataset repository. If this isn’t strictly required for HuggingFaceM4/VQAv2, it should be removed; otherwise consider gating it behind an explicit CLI flag / environment variable and defaulting to False to reduce the security risk.
    research/fedumm/centralized_baseline.py:60
  • load_dataset(..., trust_remote_code=True) enables execution of arbitrary code from the dataset repository. If this isn’t strictly required for HuggingFaceM4/VQAv2, it should be removed; otherwise gate it behind an explicit opt-in flag to reduce the security risk.
    research/fedumm/client.py:97
  • The job enables TensorBoard tracking via add_experiment_tracking(...), but the client uses torch.utils.tensorboard.SummaryWriter, which won’t integrate with NVFlare’s tracking pipeline (and may write to colliding default runs/ directories across simulated clients). Consider using nvflare.client.tracking.SummaryWriter or explicitly setting a per-site log_dir and closing the writer on shutdown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR simplifies the FedUMM research example to focus solely on the BLIP-VQA backbone used in the paper, removing JanusPro support, migrating from FedJob/FedAvg to FedAvgRecipe, and adding TensorBoard logging throughout. Previous review concerns (null cur_round, missing writer.close(), LoRA sub-module save, model. prefix round-trip) are all addressed.

Confidence Score: 5/5

  • Safe to merge; all prior P1 issues are resolved and only minor P2 style suggestions remain.
  • All blocking issues from the previous review round have been addressed. The three remaining findings are P2: a missing lora_dropout in the server model dict (dropout doesn't affect param structure so aggregation is unaffected), an exact datasets pin that could cause install conflicts, and a broad ImportError catch that could obscure dependency errors. None of these block correct FL execution.
  • research/fedumm/job.py (lora_dropout omission), research/fedumm/requirements.txt (exact datasets pin)

Important Files Changed

Filename Overview
research/fedumm/job.py Switched from FedJob+FedAvg to FedAvgRecipe; lora_dropout forwarded to clients via script_args but omitted from server-side BLIPLoRAModel config dict
research/fedumm/client.py Renamed from src/fl_client.py; hardcoded blip_vqa backend, added TensorBoard logging, fixed cur_round fallback to 0, symmetric model. prefix round-trip, writer.close() added
research/fedumm/centralized_baseline.py Renamed from src/local_train.py; hardcoded blip_vqa, added TensorBoard writer with explicit log_dir, fixed LoRA save via sub-module save_pretrained, writer.close() added
research/fedumm/src/blip_backend.py Added decoder_input_ids field, proper label masking for padding tokens, BLIPLoRAModel server wrapper, ValueError on empty eval dataloader
research/fedumm/src/common.py train_one_epoch extended with TensorBoard writer, prefix, log_interval, and global_step_offset parameters; backward() and optimizer step logic unchanged
research/fedumm/requirements.txt Bumped nvflare to >=2.7.2, pinned datasets==2.19.2 exactly, added tensorboard and scipy
research/fedumm/README.md Simplified README to match paper scope; removed JanusPro references and detailed CLI docs, added GPU requirements table

Sequence Diagram

sequenceDiagram
    participant job as job.py (FedAvgRecipe)
    participant server as NVFlare Server (BLIPLoRAModel)
    participant client as client.py × N

    job->>server: init BLIPLoRAModel(lora_r, lora_alpha)
    job->>client: launch with script_args (num_clients, lr, lora_*, …)

    loop FL rounds
        server->>client: FLModel(params={model.* keys})
        client->>client: strip "model." prefix → load_trainable_params
        client->>client: train_one_epoch (TensorBoard loss/step)
        client->>client: evaluate → TensorBoard val/acc
        client->>server: FLModel(params={"model."+k: v}, metrics)
        server->>server: FedAvg aggregate
    end

    server-->>job: run.get_status() / get_result()
Loading

Reviews (7): Last reviewed commit: "Merge branch 'main' into fed_umm" | Re-trigger Greptile

@ZiyueXu77 ZiyueXu77 requested a review from holgerroth April 1, 2026 20:00
@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@holgerroth
Copy link
Copy Markdown
Collaborator

@greptileai review the latest changes.

@holgerroth
Copy link
Copy Markdown
Collaborator

/build

Copy link
Copy Markdown
Collaborator

@holgerroth holgerroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensible changes that simplify the example.

@ZiyueXu77 ZiyueXu77 enabled auto-merge (squash) April 2, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants