Skip to content

Conversation

@holgerroth
Copy link
Collaborator

Fixes # .

Description

Apply #3926 to main branch

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@holgerroth
Copy link
Collaborator Author

/build

chesterxgchen
chesterxgchen previously approved these changes Jan 5, 2026
Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although LGTM, but this can be avoided simply ask user to download_data first

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR applies fixes from #3926 to the main branch, updating NVFlare version constraints and adding robust CIFAR10 dataset loading. The changes update the nvflare dependency from release candidate ~=2.5.0rc to stable version >=2.5.0 across all three job API examples (PyTorch, sklearn, TensorFlow).

The main enhancement is the addition of load_cifar10_with_retry() function in the TensorFlow example, which addresses concurrent download issues and corrupted pickle data by implementing:

  • File-based locking mechanism using filelock to prevent race conditions when multiple processes download CIFAR10 simultaneously
  • Retry logic with automatic cleanup of corrupted downloads on subsequent attempts
  • Better error handling for Timeout and pickle.UnpicklingError exceptions

Key Changes:

  • Updated nvflare version constraints from ~=2.5.0rc to >=2.5.0 in all three requirements.txt files
  • Added load_cifar10_with_retry() with retry mechanism and file locking to handle concurrent downloads
  • Added required imports: pickle, time, and filelock.FileLock/Timeout

Issues Found:

  • Lock file path uses hardcoded /tmp directory which may cause issues on Windows systems

Confidence Score: 4/5

  • Safe to merge with minor portability concern on Windows systems
  • The version updates are straightforward and correct. The retry mechanism implementation is sound and follows established patterns. Score reduced by 1 due to hardcoded /tmp path that differs from reference implementation and may cause issues on Windows platforms.
  • Pay attention to examples/advanced/job_api/tf/src/cifar10_data_split.py for the lock path portability issue

Important Files Changed

Filename Overview
examples/advanced/job_api/pt/requirements.txt Updated nvflare version constraint from ~=2.5.0rc to >=2.5.0 to use stable release
examples/advanced/job_api/sklearn/requirements.txt Updated nvflare version constraint from ~=2.5.0rc to >=2.5.0 to use stable release
examples/advanced/job_api/tf/requirements.txt Updated nvflare version constraint from ~=2.5.0rc to >=2.5.0 to use stable release
examples/advanced/job_api/tf/src/cifar10_data_split.py Added load_cifar10_with_retry function with file locking and retry logic to handle concurrent downloads and corrupted data, using hardcoded /tmp path for lock file

Sequence Diagram

sequenceDiagram
    participant User
    participant cifar10_split
    participant _partition_data
    participant load_cifar10_with_retry
    participant FileLock
    participant TensorFlow
    participant FileSystem

    User->>cifar10_split: cifar10_split(split_dir, num_sites, alpha, seed)
    cifar10_split->>_partition_data: _partition_data(num_sites, alpha)
    _partition_data->>load_cifar10_with_retry: load_cifar10_with_retry()
    
    loop For each retry attempt (max 3)
        load_cifar10_with_retry->>FileLock: Acquire lock on /tmp/cifar10_download.lock
        FileLock-->>load_cifar10_with_retry: Lock acquired
        
        alt Retry attempt > 0
            load_cifar10_with_retry->>FileSystem: Check if ~/.keras/datasets/cifar-10-batches-py exists
            FileSystem-->>load_cifar10_with_retry: Directory exists
            load_cifar10_with_retry->>FileSystem: shutil.rmtree(cifar10_path)
            FileSystem-->>load_cifar10_with_retry: Corrupted data removed
        end
        
        load_cifar10_with_retry->>TensorFlow: datasets.cifar10.load_data()
        
        alt Success
            TensorFlow-->>load_cifar10_with_retry: Return (train_data, test_data)
            load_cifar10_with_retry->>FileLock: Release lock
            load_cifar10_with_retry-->>_partition_data: Return dataset
        else Timeout or UnpicklingError
            TensorFlow-->>load_cifar10_with_retry: Exception
            load_cifar10_with_retry->>FileLock: Release lock
            alt Not final attempt
                load_cifar10_with_retry->>load_cifar10_with_retry: time.sleep(retry_delay)
            else Final attempt
                load_cifar10_with_retry-->>_partition_data: Raise RuntimeError
            end
        end
    end
    
    _partition_data->>_partition_data: Partition data using Dirichlet sampling
    _partition_data-->>cifar10_split: Return site_idx, class_sum
    cifar10_split->>FileSystem: Save site splits to .npy files
    cifar10_split-->>User: Return train_idx_paths
Loading

greptile-apps[bot]

This comment was marked as resolved.

@holgerroth
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/job_api/tf/src/cifar10_data_split.py, line 166 (link)

    logic: unreachable code - loop will always raise at line 162 when last attempt fails

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@holgerroth holgerroth enabled auto-merge (squash) January 6, 2026 03:50
@holgerroth
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/job_api/tf/src/cifar10_data_split.py, line 166 (link)

    logic: this line is unreachable because the loop either returns successfully or raises an exception on the last retry attempt (line 162)

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@holgerroth
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. examples/advanced/job_api/tf/src/cifar10_data_split.py, line 142 (link)

    logic: Lock path uses absolute /tmp directory which may not exist or be writable on all systems (e.g., Windows). The reference implementation in examples/advanced/cifar10/tf/src/cifar10_data_split.py:144 uses a relative path "cifar10_download.lock" which is more portable.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@chesterxgchen
Copy link
Collaborator

Just look at this job API example structure it is not align with out new example job structure. we should change it, correct ? or we should remove it ( as @YuanTingHsieh was suggesting)

@holgerroth
Copy link
Collaborator Author

/build

greptile-apps[bot]

This comment was marked as off-topic.

greptile-apps[bot]

This comment was marked as outdated.

@holgerroth
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR applies fixes from #3926 to the main branch, updating NVFlare version constraints and adding robust CIFAR10 dataset loading.

Key Changes:

  • Updated nvflare dependency from ~=2.5.0rc to >=2.5.0 across all three job API examples (PyTorch, sklearn, TensorFlow)
  • Added load_cifar10_with_retry() function with file locking and retry mechanism to address concurrent download issues

Issues Found:

  • Unreachable dead code at line 166 (logic error)
  • Inconsistent use of pickle instead of _pickle (style)
  • Hardcoded /tmp path reduces Windows portability (style)
  • Import statement inside retry loop (style)
  • Narrow exception handling may miss other download failures (style)

Confidence Score: 3/5

  • This PR is moderately safe to merge with one logic error and several style improvements needed
  • Score reflects one unreachable dead code issue (logic error) that should be removed, plus multiple style inconsistencies compared to other examples in the codebase. The version updates are correct, and the retry mechanism addresses the intended problem, but code quality could be improved
  • Pay special attention to examples/advanced/job_api/tf/src/cifar10_data_split.py - contains unreachable code and style inconsistencies

Important Files Changed

File Analysis

Filename Score Overview
examples/advanced/job_api/tf/src/cifar10_data_split.py 3/5 Added retry mechanism for CIFAR10 dataset loading with file locking. Issues: unreachable code, inconsistent imports, hardcoded path, suboptimal error handling
examples/advanced/job_api/tf/requirements.txt 5/5 Updated nvflare version from ~=2.5.0rc to >=2.5.0, appropriate for stable release
examples/advanced/job_api/pt/requirements.txt 5/5 Updated nvflare version from ~=2.5.0rc to >=2.5.0, appropriate for stable release
examples/advanced/job_api/sklearn/requirements.txt 5/5 Updated nvflare version from ~=2.5.0rc to >=2.5.0, appropriate for stable release

Sequence Diagram

sequenceDiagram
    participant Client as TF Training Script
    participant Loader as load_cifar10_with_retry()
    participant Lock as FileLock
    participant Keras as datasets.cifar10.load_data()
    participant FS as File System
    
    Client->>Loader: Call load_cifar10_with_retry()
    loop Retry Loop (max 3 attempts)
        Loader->>Lock: Acquire lock (cifar10_download.lock)
        Lock-->>Loader: Lock acquired
        
        alt First attempt failed
            Loader->>FS: Check ~/.keras/datasets/cifar-10-batches-py
            FS-->>Loader: Exists
            Loader->>FS: shutil.rmtree() - Remove corrupted data
            FS-->>Loader: Removed
        end
        
        Loader->>Keras: load_data()
        
        alt Success
            Keras-->>Loader: Return (train, test) data
            Loader->>Lock: Release lock
            Loader-->>Client: Return dataset
        else Timeout or UnpicklingError
            Keras-->>Loader: Exception raised
            Loader->>Lock: Release lock
            alt Not last attempt
                Loader->>Loader: Print error, sleep(retry_delay)
            else Last attempt
                Loader-->>Client: Raise RuntimeError
            end
        end
    end
Loading

@holgerroth holgerroth merged commit 60aefd7 into NVIDIA:main Jan 9, 2026
20 of 22 checks passed
@holgerroth holgerroth deleted the main_job_api_examples branch January 9, 2026 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants