Contact Details [Optional]
arrobaouassim@gmail.com
System Information
ZENML_LOCAL_VERSION: 0.94.1
ZENML_SERVER_VERSION: 0.94.1
ZENML_SERVER_DATABASE: mysql
ZENML_SERVER_DEPLOYMENT_TYPE: other
ZENML_CONFIG_DIR: /home/<user>/.config/zenml
ZENML_LOCAL_STORE_DIR: /home/<user>/.config/zenml/local_stores
ZENML_SERVER_URL: <redacted>
ZENML_ACTIVE_REPOSITORY_ROOT: /home/<user>/dev/<project>
PYTHON_VERSION: 3.13.12
ENVIRONMENT: wsl
SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '24.04'}
ACTIVE_PROJECT: default
ACTIVE_STACK: cloud_stack
ACTIVE_USER: <redacted>
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: <redacted>
ANALYTICS_USER_ID: <redacted>
ANALYTICS_SERVER_ID: <redacted>
INTEGRATIONS: ['scipy', 'numpy', 'kaniko', 'kubernetes', 'wandb', 'airflow', 's3', 'sklearn', 'pandas', 'pillow']
CURRENT STACK
Name: cloud_stack
ID: <redacted>
User: <redacted>
IMAGE_BUILDER: local_builder
Name: local_builder
ID: <redacted>
Type: image_builder
Flavor: local
Configuration: {}
User: <redacted>
EXPERIMENT_TRACKER: wandb_tracker
Name: wandb_tracker
ID: <redacted>
Type: experiment_tracker
Flavor: wandb
Configuration: {'api_key': '********', 'entity': '<redacted>', 'project_name': '<redacted>'}
User: <redacted>
ORCHESTRATOR: skypilot_gcp
Name: skypilot_gcp
ID: <redacted>
Type: orchestrator
Flavor: vm_gcp
Configuration: {'region': 'us-east4', 'idle_minutes_to_autostop': 30, 'down': True, 'stream_logs': True, 'project': '<gcp-project>'}
User: <redacted>
CONTAINER_REGISTRY: artifact_registry
Name: artifact_registry
ID: <redacted>
Type: container_registry
Flavor: gcp
Configuration: {'uri': '<artifact-registry-uri>'}
User: <redacted>
ARTIFACT_STORE: gcs_store
Name: gcs_store
ID: <redacted>
Type: artifact_store
Flavor: gcp
Configuration: {'path': 'gs://<bucket>/zenml'}
User: <redacted>
What happened?
I'll be honest, this is something Claude debugged for me, so I'm not entirely sure it isn't a misconfiguration on my end. That said, the root cause traces to a specific line in the integration, so I'm filing it in case it helps others.
Filling in the checkboxes I missed when creating this issue via CLI.
ZenML version: 0.94.1
Stack: vm_gcp orchestrator + GCP Artifact Registry (containerized steps)
Describe the bug
In zenml/integrations/skypilot/utils.py, the function create_docker_run_command() generates environment flags in the form -e KEY (key only, no value):
docker_environment_str = " ".join(
f"-e {shlex.quote(k)}" for k in environment
)
Docker's -e KEY syntax means "inherit this variable from the calling shell environment." This works fine without sudo.
However, skypilot_base_vm_orchestrator.py calls this function with use_sudo=True (hardcoded), and sudo resets the environment by default — so every -e KEY flag passes an empty/unset variable into the container.
Effect: the container receives none of the ZenML configuration variables (ZENML_STORE_URL, ZENML_STORE_TYPE, auth tokens, etc.), falls back to a local SQLite store, and immediately crashes:
ModuleNotFoundError: No module named 'sqlalchemy_utils'
This only affects the vm_gcp orchestrator path. The vm_kubernetes path takes a different branch (runs Python directly in a virtualenv and never calls create_docker_run_command()), which is likely why this went unnoticed.
Expected behavior
The container should receive all environment variables and connect to the remote ZenML server as configured.
Proposed fix
Use -e KEY=VALUE to pass values explicitly, bypassing sudo's environment reset:
docker_environment_str = " ".join(
f"-e {shlex.quote(k)}={shlex.quote(str(v))}"
for k, v in environment.items()
)
Reproduction steps
- Set up a ZenML stack with a
vm_gcp SkyPilot orchestrator and a container registry
- Run any pipeline that uses a Docker image (i.e. the stack has a container registry configured)
- The orchestrator submits
sudo docker run -e ZENML_STORE_URL ... on the GCP VM
- Observe that the container starts with no ZenML configuration
Relevant log output
ModuleNotFoundError: No module named 'sqlalchemy_utils'
And a few lines above it in the traceback:
KeyError: 'ZENML_STORE_URL'
Code of Conduct
Contact Details [Optional]
arrobaouassim@gmail.com
System Information
What happened?
Filling in the checkboxes I missed when creating this issue via CLI.
ZenML version:
0.94.1Stack:
vm_gcporchestrator + GCP Artifact Registry (containerized steps)Describe the bug
In
zenml/integrations/skypilot/utils.py, the functioncreate_docker_run_command()generates environment flags in the form-e KEY(key only, no value):Docker's
-e KEYsyntax means "inherit this variable from the calling shell environment." This works fine withoutsudo.However,
skypilot_base_vm_orchestrator.pycalls this function withuse_sudo=True(hardcoded), andsudoresets the environment by default — so every-e KEYflag passes an empty/unset variable into the container.Effect: the container receives none of the ZenML configuration variables (
ZENML_STORE_URL,ZENML_STORE_TYPE, auth tokens, etc.), falls back to a local SQLite store, and immediately crashes:This only affects the
vm_gcporchestrator path. Thevm_kubernetespath takes a different branch (runs Python directly in a virtualenv and never callscreate_docker_run_command()), which is likely why this went unnoticed.Expected behavior
The container should receive all environment variables and connect to the remote ZenML server as configured.
Proposed fix
Use
-e KEY=VALUEto pass values explicitly, bypassingsudo's environment reset:Reproduction steps
vm_gcpSkyPilot orchestrator and a container registrysudo docker run -e ZENML_STORE_URL ...on the GCP VMRelevant log output
Code of Conduct