Skip to content

Commit 2ee5906

Browse files
committed
terraform: Add DataCrunch GPU cloud provider integration
Add comprehensive DataCrunch cloud provider support for GPU instances ranging from Blackwell B300 to Tesla V100. Features intelligent GPU selection with three strategies: wildcard variant selection (ANY_1H100), tier-based fallback (H100_OR_LESS, B300_OR_LESS) for automatic fallback through GPU tiers when top options unavailable, and explicit instance types for production workloads. GPU tier hierarchy supports 10 tiers with automatic fallback to maximize provisioning success while capping costs. Tier-based selection recommended for most users (~$1.99/hr cap for H100 tier). Add capacity checking infrastructure to validate instance availability before provisioning. API credential management uses OAuth2 with secure token retrieval from ~/.datacrunch/credentials. Special handling for local provider development with dev_overrides in ~/.terraformrc, skipping force_init in Ansible terraform module. Add ML environment setup for DataCrunch instances: package updates, ML dependencies, PyTorch in virtualenv, NVIDIA driver reload with proper module unload sequence (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia) to avoid PyTorch/driver mismatch, Claude Code installation, MOTD with PyTorch activation instructions, and bashrc auto-activation. Add persistent volume support with KEEP=1 configuration option. Volume mappings cached in ~/.cache/kdevops/datacrunch/ enable fast reprovisioning (seconds vs minutes) while managing ~$10/month storage costs. Volume lifecycle wrapper scripts automate volume ID tracking across instance destroy/recreate cycles. Add defconfig files for tier-based selection, specific GPUs, and multi-GPU configurations. Comprehensive documentation focuses on capacity challenges, selection strategies, and best practices. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <[email protected]>
1 parent c8cb24e commit 2ee5906

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+5262
-10
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,3 +131,4 @@ terraform/oci/scripts/__pycache__/
131131

132132
scripts/__pycache__/
133133
docs/contrib/kdevops_contributions*
134+
cloud-bill

Makefile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,15 @@ ifdef DECLARE_HOSTS
2727
export DECLARED_HOSTS := $(DECLARE_HOSTS)
2828
endif
2929

30+
# Export workflow CLI overrides
31+
ifdef KNLP
32+
export KNLP
33+
endif
34+
35+
ifdef KEEP
36+
export KEEP
37+
endif
38+
3039
include scripts/refs.Makefile
3140

3241
KDEVOPS_NODES_ROLE_TEMPLATE_DIR := $(KDEVOPS_PLAYBOOKS_DIR)/roles/gen_nodes/templates

defconfigs/datacrunch-4x-b200

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# DataCrunch 4x B200 (Blackwell) instance - latest GPU architecture
2+
CONFIG_TERRAFORM=y
3+
CONFIG_TERRAFORM_DATACRUNCH=y
4+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_4B200_120V=y
5+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
6+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y

defconfigs/datacrunch-4x-b300

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# DataCrunch 4x B300 (Blackwell) instance - latest GPU architecture
2+
CONFIG_TERRAFORM=y
3+
CONFIG_TERRAFORM_DATACRUNCH=y
4+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_4B300_120V=y
5+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
6+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# DataCrunch 4x H100 PCIe instance with PyTorch - pay-as-you-go pricing
2+
CONFIG_TERRAFORM=y
3+
CONFIG_TERRAFORM_DATACRUNCH=y
4+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_4H100_80S_176V=y
5+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
6+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y

defconfigs/datacrunch-a100

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# DataCrunch single A100 40GB SXM instance - pay-as-you-go pricing
2+
CONFIG_TERRAFORM=y
3+
CONFIG_TERRAFORM_DATACRUNCH=y
4+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_1A100_40S_22V=y
5+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
6+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# DataCrunch GPU with tier-based fallback (A100-40 maximum tier)
2+
# Uses A100_40_OR_LESS for best available single GPU up to A100-40
3+
# Fallback order: A100-40 → RTX PRO 6000 → RTX 6000 Ada → L40S → RTX A6000 → Tesla V100
4+
CONFIG_TERRAFORM=y
5+
CONFIG_TERRAFORM_DATACRUNCH=y
6+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_A100_40_OR_LESS=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
8+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
9+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
10+
CONFIG_WORKFLOWS=y
11+
CONFIG_WORKFLOWS_TESTS=y
12+
CONFIG_WORKFLOWS_LINUX_TESTS=y
13+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# DataCrunch GPU with tier-based fallback (A100-80 maximum tier)
2+
# Uses A100_80_OR_LESS for best available single GPU up to A100-80
3+
# Fallback order: A100-80 → A100-40 → RTX PRO 6000 → RTX 6000 Ada → L40S → RTX A6000 → Tesla V100
4+
CONFIG_TERRAFORM=y
5+
CONFIG_TERRAFORM_DATACRUNCH=y
6+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_A100_80_OR_LESS=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
8+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
9+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
10+
CONFIG_WORKFLOWS=y
11+
CONFIG_WORKFLOWS_TESTS=y
12+
CONFIG_WORKFLOWS_LINUX_TESTS=y
13+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y

defconfigs/datacrunch-b200-or-less

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# DataCrunch GPU with tier-based fallback (B200 maximum tier)
2+
# Uses B200_OR_LESS for best available single GPU up to B200
3+
# Fallback order: B200 → H100 → A100-80 → A100-40 → RTX PRO 6000 → RTX 6000 Ada → L40S → RTX A6000 → Tesla V100
4+
CONFIG_TERRAFORM=y
5+
CONFIG_TERRAFORM_DATACRUNCH=y
6+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_B200_OR_LESS=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
8+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
9+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
10+
CONFIG_WORKFLOWS=y
11+
CONFIG_WORKFLOWS_TESTS=y
12+
CONFIG_WORKFLOWS_LINUX_TESTS=y
13+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y

defconfigs/datacrunch-b300

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# DataCrunch single NVIDIA Blackwell B300 GPU (latest generation)
2+
CONFIG_TERRAFORM=y
3+
CONFIG_TERRAFORM_DATACRUNCH=y
4+
CONFIG_TERRAFORM_DATACRUNCH_INSTANCE_TYPE_1B300_30V=y
5+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
6+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
7+
CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
8+
CONFIG_WORKFLOWS=y
9+
CONFIG_WORKFLOWS_TESTS=y
10+
CONFIG_WORKFLOWS_LINUX_TESTS=y
11+
CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y

0 commit comments

Comments
 (0)