Skip to content

Conversation

@ryan-williams
Copy link
Collaborator

Summary

  • Add .github/workflows/gpu-e2e.yml for GPU-based e2e training tests
  • Add --gpu flag to tests/e2e_train.py for GPU acceleration

Uses Open-Athena/lambda-gha to spin up ephemeral Lambda Labs GPU instances.

Required Setup

Secrets:

  • LAMBDA_API_KEY
  • GH_SA_TOKEN
  • LAMBDA_SSH_PRIVATE_KEY

Variables:

  • LAMBDA_SSH_KEY_NAMES

Test plan

  • Trigger via workflow_dispatch after merge
  • Verify GPU instance spins up
  • Verify training runs with --gpu flag
  • Verify instance self-terminates

🤖 Generated with Claude Code

- `tests/e2e_train.py`: CLI for seeded training on sample data (5 epochs,
  8 channels, 2 residual blocks). Verifies val_loss matches expected value.
- `tests/expected_loss.txt`: expected val_loss (0.613389) for regression
- `tests/test_e2e_train.py`: pytest wrapper
- `examples/e2e_training_demo.ipynb`: notebook version with training curve plot

Training is fully deterministic (47% val_loss improvement over 5 epochs).
Add notebook deps: `nbconvert`, `ipykernel`, `papermill`.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- `.github/workflows/gpu-e2e.yml`: workflow_dispatch workflow using
  Open-Athena/lambda-gha for ephemeral Lambda Labs GPU instances
- `tests/e2e_train.py`: add `--gpu` flag to enable GPU acceleration

Workflow spins up A10 GPU by default, runs training test, then
instance self-terminates. Requires LAMBDA_API_KEY, GH_SA_TOKEN,
LAMBDA_SSH_PRIVATE_KEY secrets and LAMBDA_SSH_KEY_NAMES variable.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants