Convert job to recipe for LLM_HF example #3888

ZiyueXu77 · 2025-12-11T17:50:30Z

Fixes # .

Description

From job api to recipe

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

greptile-apps · 2025-12-11T18:06:30Z

Greptile Summary

This PR successfully converts the LLM HuggingFace example from the low-level FedJob API to the higher-level FedAvgRecipe pattern, modernizing the codebase and improving maintainability.

Key Changes:

Refactored job.py to use FedAvgRecipe instead of manually constructing FedJob with controllers and persistors
Introduced per_site_config dictionary for flexible per-client configuration (training arguments, commands, GPU assignments)
Removed the site- prefix from client names - now uses client IDs directly (e.g., dolly instead of site-dolly)
Updated documentation (README.md, MULTINODE.md) to reflect the recipe-based approach with comprehensive examples
Fixed nvflare.slurm to use correct site path (dolly instead of site-dolly)
Added model_name_or_path attribute storage in model classes for consistency
Enhanced client.py with improved documentation and clearer WandB defaults
Added --export_config flag to support exporting job configuration without running it

Benefits:

Simplified code: Recipe pattern abstracts away boilerplate controller/persistor setup (reduced from ~309 to 238 lines)
Better maintainability: Cleaner separation of concerns with recipe encapsulation
More flexible configuration: per_site_config enables site-specific settings without conditional logic
Improved documentation: Comprehensive updates explain the recipe pattern and multi-node setup clearly

Testing considerations:

Verify single-node simulation mode still works correctly
Test multi-node training with the updated configuration
Confirm quantization filters are properly applied through the recipe
Validate that site naming changes work in production deployments

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it's a well-executed refactoring with comprehensive documentation updates
Score of 4 reflects a solid refactoring with proper attention to detail. The conversion from FedJob to FedAvgRecipe is clean and maintains backward compatibility in functionality. Previous review comments about site naming have been addressed. The code changes are well-documented and the pattern is simpler than the original implementation. One point deducted because this is a significant architectural change that would benefit from integration testing verification, particularly for multi-node scenarios and quantization filters.
No files require special attention - all changes are clean and well-documented

Important Files Changed

Filename	Overview
examples/advanced/llm_hf/job.py	Converted from FedJob API to FedAvgRecipe pattern with per_site_config support for multi-node training
examples/advanced/llm_hf/client.py	Added documentation, improved logging, and clarified WandB defaults - no functional changes
examples/advanced/llm_hf/README.md	Updated documentation to reflect recipe-based approach and corrected site naming conventions

Sequence Diagram

sequenceDiagram
    participant User
    participant job.py
    participant FedAvgRecipe
    participant SimEnv/ProdEnv
    participant FL_Server
    participant FL_Client
    participant client.py
    
    User->>job.py: python job.py --client_ids dolly --data_path ./dataset
    job.py->>job.py: Parse arguments and setup per_site_config
    job.py->>FedAvgRecipe: Create recipe with initial_model, per_site_config
    FedAvgRecipe->>FedAvgRecipe: Configure FedAvg controller, model persistor
    job.py->>FedAvgRecipe: Add quantization filters (if enabled)
    job.py->>FedAvgRecipe: Add client timeouts and wrapper script (multi-node)
    job.py->>FedAvgRecipe: recipe.export(job_dir)
    job.py->>SimEnv/ProdEnv: Create execution environment
    job.py->>FedAvgRecipe: recipe.execute(env)
    
    FedAvgRecipe->>FL_Server: Start FL server
    FedAvgRecipe->>FL_Client: Start FL client with per_site_config
    
    loop Each FL Round
        FL_Server->>FL_Client: Send global model
        FL_Client->>client.py: Launch training script (via wrapper if multi-node)
        client.py->>client.py: flare.init(rank=rank)
        client.py->>FL_Client: flare.receive() - get global model
        client.py->>client.py: Load global model weights
        client.py->>client.py: Evaluate global model
        client.py->>client.py: Train locally (SFTTrainer)
        client.py->>client.py: Extract updated weights
        client.py->>FL_Client: flare.send(output_model)
        FL_Client->>FL_Server: Send trained model
        FL_Server->>FL_Server: Aggregate models (FedAvg)
    end
    
    FL_Server->>job.py: Training complete
    job.py->>User: Print job status and results

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ZiyueXu77 · 2026-01-05T15:38:30Z

convert to draft, wait for per-site config feature

chesterxgchen · 2026-01-11T02:15:08Z

@ZiyueXu77 what happens to this example, why this is still in draft status 1 month later ?

ZiyueXu77 · 2026-01-11T03:07:49Z

It is dependent on yuanting’s multigpu/node functionality to be updated, so has been waiting for that Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Chester Chen ***@***.***> Sent: Saturday, January 10, 2026 9:15:31 PM To: NVIDIA/NVFlare ***@***.***> Cc: Ziyue Xu ***@***.***>; Mention ***@***.***> Subject: Re: [NVIDIA/NVFlare] Convert job to recipe for LLM_HF example (PR #3888) [https://avatars.githubusercontent.com/u/512707?s=20&v=4]chesterxgchen left a comment (NVIDIA/NVFlare#3888)<#3888 (comment)> @ZiyueXu77<https://github.com/ZiyueXu77> what happens to this example, why this is still in draft status 1 month later ? — Reply to this email directly, view it on GitHub<#3888 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARDWAT3BYONBFKK5XSFTACL4GGW4HAVCNFSM6AAAAACOYLRF6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZTHAZTGNRYGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

examples/advanced/llm_hf/job.py

YuanTingHsieh

have some questions regarding how to set client ids and site names,
otherwise LGTM

examples/advanced/llm_hf/job.py

examples/advanced/llm_hf/client.py

examples/advanced/llm_hf/job.py

examples/advanced/llm_hf/README.md

holgerroth

Looks good. I made some minor comments.

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

examples/advanced/llm_hf/job.py

examples/advanced/llm_hf/README.md

ZiyueXu77 · 2026-01-14T18:24:26Z

/build

ZiyueXu77 · 2026-01-14T18:59:59Z

/build

convert job to recipe

5dc5b01

Copilot AI review requested due to automatic review settings December 11, 2025 17:50

Copilot started reviewing on behalf of ZiyueXu77 December 11, 2025 17:51 View session

This comment was marked as outdated.

Sign in to view

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

bug correction, further polish

946ae81

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

ZiyueXu77 requested review from YuanTingHsieh, holgerroth and nvkevlu December 11, 2025 19:16

holgerroth marked this pull request as draft December 16, 2025 18:10

ZiyueXu77 added 2 commits January 13, 2026 13:42

Merge branch 'NVIDIA:main' into llm_rcp

1a93502

update llm example with latest multi-gpu func

2e61e2d

ZiyueXu77 marked this pull request as ready for review January 13, 2026 19:32

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

examples/advanced/llm_hf/job.py Show resolved Hide resolved

ZiyueXu77 added 3 commits January 13, 2026 14:59

update readmes

e3e2229

fix arg issue

8e46477

Merge branch 'main' into llm_rcp

ffee274

YuanTingHsieh reviewed Jan 13, 2026

View reviewed changes

examples/advanced/llm_hf/job.py Outdated Show resolved Hide resolved

examples/advanced/llm_hf/job.py Outdated Show resolved Hide resolved

examples/advanced/llm_hf/job.py Outdated Show resolved Hide resolved

ZiyueXu77 added 2 commits January 14, 2026 09:50

further polishes

885912a

Merge branch 'main' into llm_rcp

c6cc74c