Context hpc datasets #275

Samuel-WM · 2026-01-14T22:02:01Z

This PR adds the initial end-to-end HPC/DDP implementation for the sklearn-style interface.

Before opening the PR, I ran scaling-oriented validation on the new distributed path to confirm the training loop was stable under multi-rank execution and that performance characteristics (epoch time / throughput / step time) behaved as expected as world size increased. Those checks focused on fit and predict workloads and included basic correctness signals.

After those initial runs, I discovered an implementation mistake introduced during subsequent development on the DDP prediction/aggregation. The reason this was not obvious from the earlier scaling results is that DDP training stability and scaling can look correct even when the predict path is missing the required cross-rank aggregation details (synchronizing outputs, restoring global row ordering, and deduplicating sampler padding). I have made the necessary corrections in this branch, and am redoing the benchmark and perturbation tests with the updated package.

cnellington · 2026-01-14T22:19:10Z

See if you can get the tests to run. If not, move to a branch on the repo instead of your fork.

Samuel-WM and others added 19 commits October 22, 2025 19:38

latest local updates

41b26b7

Accumulated changes from osc server repo updates

4014613

HPC scaling + SKLearnWrapper updates

29501d4

added file for testing GPU scalability

a0c1f1d

change to the constructor

ceb7e9c

change to test

e81483d

scale bench update

ae66e80

update to scale benchmarking

fc35b68

cpu benchmark file added

3fcb387

added cleaner functions to wrapper

55b1464

fix redundancy issue

06cab99

initial clean up of files for hpc implementation

581a6ed

fixed errors with lightning_modules params

661e418

Update lightning_modules error

91f6562

updates to file formatting

ca4b9a7

Update imports on regression lightning module and datamodules

19d4638

Remove unused files

78b3ce3

cleaning old testing files and updating doc strings

52934a7

reversion of files

abb13da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context hpc datasets #275

Context hpc datasets #275

Uh oh!

Samuel-WM commented Jan 14, 2026

Uh oh!

cnellington commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Context hpc datasets #275

Are you sure you want to change the base?

Context hpc datasets #275

Uh oh!

Conversation

Samuel-WM commented Jan 14, 2026

Uh oh!

cnellington commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants