-
Notifications
You must be signed in to change notification settings - Fork 233
Fix Job API TF examples #3927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Job API TF examples #3927
Conversation
|
/build |
chesterxgchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although LGTM, but this can be avoided simply ask user to download_data first
Greptile SummaryThis PR applies fixes from #3926 to the main branch, updating NVFlare version constraints and adding robust CIFAR10 dataset loading. The changes update the The main enhancement is the addition of
Key Changes:
Issues Found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant cifar10_split
participant _partition_data
participant load_cifar10_with_retry
participant FileLock
participant TensorFlow
participant FileSystem
User->>cifar10_split: cifar10_split(split_dir, num_sites, alpha, seed)
cifar10_split->>_partition_data: _partition_data(num_sites, alpha)
_partition_data->>load_cifar10_with_retry: load_cifar10_with_retry()
loop For each retry attempt (max 3)
load_cifar10_with_retry->>FileLock: Acquire lock on /tmp/cifar10_download.lock
FileLock-->>load_cifar10_with_retry: Lock acquired
alt Retry attempt > 0
load_cifar10_with_retry->>FileSystem: Check if ~/.keras/datasets/cifar-10-batches-py exists
FileSystem-->>load_cifar10_with_retry: Directory exists
load_cifar10_with_retry->>FileSystem: shutil.rmtree(cifar10_path)
FileSystem-->>load_cifar10_with_retry: Corrupted data removed
end
load_cifar10_with_retry->>TensorFlow: datasets.cifar10.load_data()
alt Success
TensorFlow-->>load_cifar10_with_retry: Return (train_data, test_data)
load_cifar10_with_retry->>FileLock: Release lock
load_cifar10_with_retry-->>_partition_data: Return dataset
else Timeout or UnpicklingError
TensorFlow-->>load_cifar10_with_retry: Exception
load_cifar10_with_retry->>FileLock: Release lock
alt Not final attempt
load_cifar10_with_retry->>load_cifar10_with_retry: time.sleep(retry_delay)
else Final attempt
load_cifar10_with_retry-->>_partition_data: Raise RuntimeError
end
end
end
_partition_data->>_partition_data: Partition data using Dirichlet sampling
_partition_data-->>cifar10_split: Return site_idx, class_sum
cifar10_split->>FileSystem: Save site splits to .npy files
cifar10_split-->>User: Return train_idx_paths
|
|
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/advanced/job_api/tf/src/cifar10_data_split.py, line 166 (link)logic: unreachable code - loop will always raise at line 162 when last attempt fails
4 files reviewed, 1 comment
|
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/advanced/job_api/tf/src/cifar10_data_split.py, line 166 (link)logic: this line is unreachable because the loop either returns successfully or raises an exception on the last retry attempt (line 162)
4 files reviewed, 1 comment
|
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
examples/advanced/job_api/tf/src/cifar10_data_split.py, line 142 (link)logic: Lock path uses absolute
/tmpdirectory which may not exist or be writable on all systems (e.g., Windows). The reference implementation inexamples/advanced/cifar10/tf/src/cifar10_data_split.py:144uses a relative path"cifar10_download.lock"which is more portable.
4 files reviewed, 1 comment
|
Just look at this job API example structure it is not align with out new example job structure. we should change it, correct ? or we should remove it ( as @YuanTingHsieh was suggesting) |
|
/build |
|
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR applies fixes from #3926 to the main branch, updating NVFlare version constraints and adding robust CIFAR10 dataset loading.
Key Changes:
- Updated
nvflaredependency from~=2.5.0rcto>=2.5.0across all three job API examples (PyTorch, sklearn, TensorFlow) - Added
load_cifar10_with_retry()function with file locking and retry mechanism to address concurrent download issues
Issues Found:
- Unreachable dead code at line 166 (logic error)
- Inconsistent use of
pickleinstead of_pickle(style) - Hardcoded
/tmppath reduces Windows portability (style) - Import statement inside retry loop (style)
- Narrow exception handling may miss other download failures (style)
Confidence Score: 3/5
- This PR is moderately safe to merge with one logic error and several style improvements needed
- Score reflects one unreachable dead code issue (logic error) that should be removed, plus multiple style inconsistencies compared to other examples in the codebase. The version updates are correct, and the retry mechanism addresses the intended problem, but code quality could be improved
- Pay special attention to
examples/advanced/job_api/tf/src/cifar10_data_split.py- contains unreachable code and style inconsistencies
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| examples/advanced/job_api/tf/src/cifar10_data_split.py | 3/5 | Added retry mechanism for CIFAR10 dataset loading with file locking. Issues: unreachable code, inconsistent imports, hardcoded path, suboptimal error handling |
| examples/advanced/job_api/tf/requirements.txt | 5/5 | Updated nvflare version from ~=2.5.0rc to >=2.5.0, appropriate for stable release |
| examples/advanced/job_api/pt/requirements.txt | 5/5 | Updated nvflare version from ~=2.5.0rc to >=2.5.0, appropriate for stable release |
| examples/advanced/job_api/sklearn/requirements.txt | 5/5 | Updated nvflare version from ~=2.5.0rc to >=2.5.0, appropriate for stable release |
Sequence Diagram
sequenceDiagram
participant Client as TF Training Script
participant Loader as load_cifar10_with_retry()
participant Lock as FileLock
participant Keras as datasets.cifar10.load_data()
participant FS as File System
Client->>Loader: Call load_cifar10_with_retry()
loop Retry Loop (max 3 attempts)
Loader->>Lock: Acquire lock (cifar10_download.lock)
Lock-->>Loader: Lock acquired
alt First attempt failed
Loader->>FS: Check ~/.keras/datasets/cifar-10-batches-py
FS-->>Loader: Exists
Loader->>FS: shutil.rmtree() - Remove corrupted data
FS-->>Loader: Removed
end
Loader->>Keras: load_data()
alt Success
Keras-->>Loader: Return (train, test) data
Loader->>Lock: Release lock
Loader-->>Client: Return dataset
else Timeout or UnpicklingError
Keras-->>Loader: Exception raised
Loader->>Lock: Release lock
alt Not last attempt
Loader->>Loader: Print error, sleep(retry_delay)
else Last attempt
Loader-->>Client: Raise RuntimeError
end
end
end
Fixes # .
Description
Apply #3926 to main branch
Types of changes
./runtest.sh.