Skip to content

Validation issue with ring-based all reduce #209

@neoblizz

Description

@neoblizz

Find the root cause for why validation is failing anything greater than 1 GPU:

python examples/15_gemm_all_reduce_ring_based/benchmark.py --benchmark --validate --num_ranks 2
[Iris] [1/2] Validating...
[Iris] [0/2] Validating...
[Iris] [1/2] Max absolute difference: 646.0
[Iris] [0/2] Max absolute difference: 605.5
[Iris] [0/2] Mismatch at index (4641, 1056): C=45.90625, expected=55.25
[Iris] [0/2] Final C validation failed.
[Iris] [1/2] Mismatch at index (21, 3616): C=44.0, expected=22.28125
[Iris] [1/2] Final C validation failed.
[Iris] [1/2] Validation completed
[Iris] [1/2] Benchmarking...
[Iris] [0/2] Validation completed
[Iris] [0/2] Benchmarking...

Use the following branch: https://github.com/ROCm/iris/tree/muhosama/insane-all-reduce

Metadata

Metadata

Assignees

Labels

examplesExamples showcasing Iris APIs and usageirisIris project issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions