recognize and issue error if GPU does not support bf16 #1344

mikekgfb · 2024-11-05T19:43:31Z

Address #1298 which causes models using bf16 as dtype to fail on T4 (and other pre-9.0 arch level GPUs) by selecting an alternate dtype when possible, and issue a clear error describing the issue otherwise

Address pytorch#1298 which causes models to fail on T4 (and other pre-9.0 arch level GPUs) by selecting an alternate dtype when possible, and issue an error otherwise

pytorch-bot · 2024-11-05T19:43:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1344

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit fd0f53e with merge base 9480258 ():

NEW FAILURES - The following jobs have failed:

pull / test-gpu-aoti-bfloat16 (cuda, stories15M) / linux-job (gh)
RuntimeError: target device cuda does not support the bfloat16 data type
pull / test-gpu-aoti-float32 (cuda, stories15M) / linux-job (gh)
RuntimeError: Command docker exec -t 0c04f41801b5b99edf38894a351f81f6035a89564e96cc0a10f282a35ffb760a /exec failed with exit code 1
pull / test-gpu-compile (cuda, stories15M) / linux-job (gh)
RuntimeError: Command docker exec -t 641a1775331a491490b919302e2203b88899c793eae3472f3fb9ac5a1081400a /exec failed with exit code 1
pull / test-gpu-eval-sanity-check (cuda, stories15M) / linux-job (gh)
RuntimeError: target device cuda does not support the bfloat16 data type
Run the aoti runner with CUDA using stories / test-runner-aot-cuda / linux-job (gh)
RuntimeError: Command docker exec -t 1fec3a511723e09118a9ce7a39ce0b139043b4d9510d28753621fe8877f01abc /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

typo

mikekgfb · 2024-11-06T19:22:06Z

Because we emulate BF16 on pre-V9.0 CUDA architecture, this test is overly restrictive. In a nutshell, the only problem are a small set of functions, like torchao's linear:int4 operator (xref: pytorch/ao#1110), that don't emulate FP16. Given the general posture of PyTorch, emulation for those operators, rather than issuing an error, would be the way to go (or conversely, limit an error to a more specific architecture check in the linear:int4 transformation).

mikekgfb · 2024-11-06T19:25:13Z

Closing, as this PR implements an overly restrictive check given the emulation of BF16 for older architectures.

recognize and issue error if GPU does not support bf16

bfab3f1

Address pytorch#1298 which causes models to fail on T4 (and other pre-9.0 arch level GPUs) by selecting an alternate dtype when possible, and issue an error otherwise

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 5, 2024

mikekgfb mentioned this pull request Nov 5, 2024

Named Symbol not found (torchchat #1298) pytorch/ao#1110

Open

mikekgfb added 2 commits November 5, 2024 12:20

Update build_utils.py

ae69428

typo

Update build_utils.py

fd0f53e

typo

mikekgfb closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

recognize and issue error if GPU does not support bf16 #1344

recognize and issue error if GPU does not support bf16 #1344

Uh oh!

mikekgfb commented Nov 5, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

Uh oh!

mikekgfb commented Nov 6, 2024 •

edited

Loading

Uh oh!

mikekgfb commented Nov 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

recognize and issue error if GPU does not support bf16 #1344

recognize and issue error if GPU does not support bf16 #1344

Uh oh!

Conversation

mikekgfb commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1344

❌ 5 New Failures

Uh oh!

mikekgfb commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekgfb commented Nov 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikekgfb commented Nov 5, 2024 •

edited

Loading

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

mikekgfb commented Nov 6, 2024 •

edited

Loading