-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][infra] Enable single-gpu CI on spark #9304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
1481e89
Add a spart test stage
EmmaQiaoCh dc89ec7
Check path /dev/gdrdrv first before mount in dlcluster
EmmaQiaoCh 6be1b77
Using a new pynvml package since onld one doesn't support spark
EmmaQiaoCh 42a27c8
Set a default value for spark
EmmaQiaoCh b0617b5
No need to install nvidia-ml-py
EmmaQiaoCh a21e89d
Correct stage name
EmmaQiaoCh 9a1ffbe
correct test-db file path and name
EmmaQiaoCh 16aec7d
Merge branch 'main' into emma/enable_spark_ci
EmmaQiaoCh 58b92e9
Add blossom ci for spark
EmmaQiaoCh f966823
Fix typo
EmmaQiaoCh ad44142
Add some debug
EmmaQiaoCh fda6530
Add more debug info
EmmaQiaoCh efab04c
Fix yml format
EmmaQiaoCh d1f3f34
Add more info to podtemplate
EmmaQiaoCh eb0ec59
Merge branch 'main' into emma/enable_spark_ci
EmmaQiaoCh 10b177a
Reduce memory
EmmaQiaoCh dc0c01a
Remove some properties to debug
EmmaQiaoCh eb25f1f
Reduce memory
EmmaQiaoCh 9125c9a
Correct podTemplate for gb10
EmmaQiaoCh 6060287
Add tolerations to template
EmmaQiaoCh 1eb5751
Merge pull request #2 from JennyLiu-nv/dev-jenny-dgx-spark-gpu-mem
EmmaQiaoCh fa032ef
Merge branch 'main' into emma/enable_spark_ci
EmmaQiaoCh bb47ca9
Fix a typo
EmmaQiaoCh 8c324db
Fix yml format
EmmaQiaoCh 9b0e414
Update test list for gb10
EmmaQiaoCh 0879ac4
Merge branch 'main' into emma/enable_spark_ci
EmmaQiaoCh 41b9c01
Comment out failed cases on spark
EmmaQiaoCh 4fdadaa
Fix some typo and test for flashing driver
EmmaQiaoCh bd5b92f
Update for the spark cloud env
EmmaQiaoCh f133942
Move back to not support flash driver
EmmaQiaoCh bfd660f
Merge branch 'main' into emma/enable_spark_ci
EmmaQiaoCh 14563b7
Update key for slurm run
EmmaQiaoCh 2f1f98a
Change to use BSL main branch since the change is merged
EmmaQiaoCh 2246a1d
Update jenkins/L0_Test.groovy
EmmaQiaoCh 3d4c0e8
Update jenkins/L0_Test.groovy
EmmaQiaoCh bd0b54d
Update jenkins/L0_Test.groovy
EmmaQiaoCh 0d8c34d
Update tests/integration/test_lists/test-db/l0_gb10.yml
EmmaQiaoCh 03df582
Fix for comments
EmmaQiaoCh f9d8dd0
Remove a debug info
EmmaQiaoCh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| version: 0.0.1 | ||
| # DGX Spark is also named as GB10 Grace Blackwell Superchip. | ||
| l0_gb10: | ||
EmmaQiaoCh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - condition: | ||
| ranges: | ||
| system_gpu_count: | ||
| gte: 1 | ||
| lte: 1 | ||
| wildcards: | ||
| gpu: | ||
| - '*gb10*' | ||
| linux_distribution_name: ubuntu* | ||
| cpu: aarch64 | ||
| terms: | ||
| stage: post_merge | ||
| backend: pytorch | ||
| tests: | ||
| # ------------- PyTorch tests --------------- | ||
| - unittest/_torch/attention/test_attention_mla.py | ||
| - test_e2e.py::test_ptp_quickstart_bert[VANILLA-BertForSequenceClassification-bert/bert-base-uncased-yelp-polarity] | ||
| - test_e2e.py::test_ptp_quickstart_bert[TRTLLM-BertForSequenceClassification-bert/bert-base-uncased-yelp-polarity] | ||
| - accuracy/test_llm_api_pytorch.py::TestQwen3_8B::test_bf16[latency] | ||
| - condition: | ||
| ranges: | ||
| system_gpu_count: | ||
| gte: 1 | ||
| lte: 1 | ||
| wildcards: | ||
| gpu: | ||
| - '*gb10*' | ||
| linux_distribution_name: ubuntu* | ||
| cpu: aarch64 | ||
| terms: | ||
| stage: pre_merge | ||
| backend: pytorch | ||
| tests: | ||
| # ------------- PyTorch tests --------------- | ||
| # Below cases which are commented out due to they failed on gb10 | ||
| # - unittest/_torch/modeling -k "modeling_mllama" | ||
| - unittest/_torch/modeling -k "modeling_out_of_tree" | ||
EmmaQiaoCh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # - unittest/_torch/modules/test_fused_moe.py::test_fused_moe_nvfp4[CUTLASS-dtype0] | ||
| # - unittest/_torch/modules/test_fused_moe.py::test_fused_moe_nvfp4[CUTLASS-dtype1] | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.