Skip to content

Conversation

@ZiyueXu77
Copy link
Collaborator

Fixes # .

Description

From job api to recipe

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings December 11, 2025 17:50

This comment was marked as outdated.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 11, 2025

Greptile Summary

This PR successfully converts the LLM HuggingFace example from the low-level FedJob API to the higher-level FedAvgRecipe pattern, modernizing the codebase and improving maintainability.

Key Changes:

  • Refactored job.py to use FedAvgRecipe instead of manually constructing FedJob with controllers and persistors
  • Introduced per_site_config dictionary for flexible per-client configuration (training arguments, commands, GPU assignments)
  • Removed the site- prefix from client names - now uses client IDs directly (e.g., dolly instead of site-dolly)
  • Updated documentation (README.md, MULTINODE.md) to reflect the recipe-based approach with comprehensive examples
  • Fixed nvflare.slurm to use correct site path (dolly instead of site-dolly)
  • Added model_name_or_path attribute storage in model classes for consistency
  • Enhanced client.py with improved documentation and clearer WandB defaults
  • Added --export_config flag to support exporting job configuration without running it

Benefits:

  • Simplified code: Recipe pattern abstracts away boilerplate controller/persistor setup (reduced from ~309 to 238 lines)
  • Better maintainability: Cleaner separation of concerns with recipe encapsulation
  • More flexible configuration: per_site_config enables site-specific settings without conditional logic
  • Improved documentation: Comprehensive updates explain the recipe pattern and multi-node setup clearly

Testing considerations:

  • Verify single-node simulation mode still works correctly
  • Test multi-node training with the updated configuration
  • Confirm quantization filters are properly applied through the recipe
  • Validate that site naming changes work in production deployments

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it's a well-executed refactoring with comprehensive documentation updates
  • Score of 4 reflects a solid refactoring with proper attention to detail. The conversion from FedJob to FedAvgRecipe is clean and maintains backward compatibility in functionality. Previous review comments about site naming have been addressed. The code changes are well-documented and the pattern is simpler than the original implementation. One point deducted because this is a significant architectural change that would benefit from integration testing verification, particularly for multi-node scenarios and quantization filters.
  • No files require special attention - all changes are clean and well-documented

Important Files Changed

Filename Overview
examples/advanced/llm_hf/job.py Converted from FedJob API to FedAvgRecipe pattern with per_site_config support for multi-node training
examples/advanced/llm_hf/client.py Added documentation, improved logging, and clarified WandB defaults - no functional changes
examples/advanced/llm_hf/README.md Updated documentation to reflect recipe-based approach and corrected site naming conventions

Sequence Diagram

sequenceDiagram
    participant User
    participant job.py
    participant FedAvgRecipe
    participant SimEnv/ProdEnv
    participant FL_Server
    participant FL_Client
    participant client.py
    
    User->>job.py: python job.py --client_ids dolly --data_path ./dataset
    job.py->>job.py: Parse arguments and setup per_site_config
    job.py->>FedAvgRecipe: Create recipe with initial_model, per_site_config
    FedAvgRecipe->>FedAvgRecipe: Configure FedAvg controller, model persistor
    job.py->>FedAvgRecipe: Add quantization filters (if enabled)
    job.py->>FedAvgRecipe: Add client timeouts and wrapper script (multi-node)
    job.py->>FedAvgRecipe: recipe.export(job_dir)
    job.py->>SimEnv/ProdEnv: Create execution environment
    job.py->>FedAvgRecipe: recipe.execute(env)
    
    FedAvgRecipe->>FL_Server: Start FL server
    FedAvgRecipe->>FL_Client: Start FL client with per_site_config
    
    loop Each FL Round
        FL_Server->>FL_Client: Send global model
        FL_Client->>client.py: Launch training script (via wrapper if multi-node)
        client.py->>client.py: flare.init(rank=rank)
        client.py->>FL_Client: flare.receive() - get global model
        client.py->>client.py: Load global model weights
        client.py->>client.py: Evaluate global model
        client.py->>client.py: Train locally (SFTTrainer)
        client.py->>client.py: Extract updated weights
        client.py->>FL_Client: flare.send(output_model)
        FL_Client->>FL_Server: Send trained model
        FL_Server->>FL_Server: Aggregate models (FedAvg)
    end
    
    FL_Server->>job.py: Training complete
    job.py->>User: Print job status and results
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@holgerroth holgerroth marked this pull request as draft December 16, 2025 18:10
@ZiyueXu77
Copy link
Collaborator Author

convert to draft, wait for per-site config feature

@chesterxgchen
Copy link
Collaborator

@ZiyueXu77 what happens to this example, why this is still in draft status 1 month later ?

@ZiyueXu77
Copy link
Collaborator Author

ZiyueXu77 commented Jan 11, 2026 via email

@ZiyueXu77 ZiyueXu77 marked this pull request as ready for review January 13, 2026 19:32
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have some questions regarding how to set client ids and site names,
otherwise LGTM

Copy link
Collaborator

@holgerroth holgerroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I made some minor comments.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@ZiyueXu77
Copy link
Collaborator Author

/build

@ZiyueXu77
Copy link
Collaborator Author

/build

@ZiyueXu77 ZiyueXu77 enabled auto-merge (squash) January 14, 2026 19:00
@ZiyueXu77 ZiyueXu77 merged commit 08aae05 into NVIDIA:main Jan 14, 2026
21 checks passed
@ZiyueXu77 ZiyueXu77 deleted the llm_rcp branch January 14, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants