Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Nov 15, 2024

linux.4xlarge.memory has 128 GB of memory with 16 CPU cores while linux.12xlarge, a more expensive runner, has only 96 GB of memory on a whopping 48 CPU cores.

Testing

Example runs for:

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6896

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8204737 with merge base ec68eb3 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024
@huydhn huydhn marked this pull request as ready for review November 18, 2024 18:40
Copy link
Contributor

@guangy10 guangy10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huydhn Before merging can you link the comparison of job execution time between using linux.4xlarge.memory and linux.12xlarge? Would like to understand the actual trade-off we made, ideally we should expect it to be tiny

"resnet50": "linux.12xlarge",
"llava": "linux.12xlarge",
"llama3_2_vision_encoder": "linux.12xlarge",
# "llama3_2_text_decoder": "linux.12xlarge", # TODO: re-enable test when Huy's change is in / model gets smaller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you attach the job link to this model since we re-enable it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think I will comment it out and leave it for latter. It doesn't OOM but take forever to export (close to 6 hours so far). I don't have much context, so probably need help from @dvorjackz to figure this one out

@huydhn huydhn merged commit aadf2ee into main Nov 26, 2024
65 of 66 checks passed
@huydhn huydhn deleted the try-r5-instances branch November 26, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants