Skip to content

skypilot/utils.py: create_docker_run_command passes -e KEY without value, breaking env vars under sudo #4652

@XnetLoL

Description

@XnetLoL

Contact Details [Optional]

arrobaouassim@gmail.com

System Information

ZENML_LOCAL_VERSION: 0.94.1
ZENML_SERVER_VERSION: 0.94.1
ZENML_SERVER_DATABASE: mysql
ZENML_SERVER_DEPLOYMENT_TYPE: other
ZENML_CONFIG_DIR: /home/<user>/.config/zenml
ZENML_LOCAL_STORE_DIR: /home/<user>/.config/zenml/local_stores
ZENML_SERVER_URL: <redacted>
ZENML_ACTIVE_REPOSITORY_ROOT: /home/<user>/dev/<project>
PYTHON_VERSION: 3.13.12
ENVIRONMENT: wsl
SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '24.04'}
ACTIVE_PROJECT: default
ACTIVE_STACK: cloud_stack
ACTIVE_USER: <redacted>
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: <redacted>
ANALYTICS_USER_ID: <redacted>
ANALYTICS_SERVER_ID: <redacted>
INTEGRATIONS: ['scipy', 'numpy', 'kaniko', 'kubernetes', 'wandb', 'airflow', 's3', 'sklearn', 'pandas', 'pillow']

CURRENT STACK

Name: cloud_stack
ID: <redacted>
User: <redacted>

IMAGE_BUILDER: local_builder

Name: local_builder
ID: <redacted>
Type: image_builder
Flavor: local
Configuration: {}
User: <redacted>

EXPERIMENT_TRACKER: wandb_tracker

Name: wandb_tracker
ID: <redacted>
Type: experiment_tracker
Flavor: wandb
Configuration: {'api_key': '********', 'entity': '<redacted>', 'project_name': '<redacted>'}
User: <redacted>

ORCHESTRATOR: skypilot_gcp

Name: skypilot_gcp
ID: <redacted>
Type: orchestrator
Flavor: vm_gcp
Configuration: {'region': 'us-east4', 'idle_minutes_to_autostop': 30, 'down': True, 'stream_logs': True, 'project': '<gcp-project>'}
User: <redacted>

CONTAINER_REGISTRY: artifact_registry

Name: artifact_registry
ID: <redacted>
Type: container_registry
Flavor: gcp
Configuration: {'uri': '<artifact-registry-uri>'}
User: <redacted>

ARTIFACT_STORE: gcs_store

Name: gcs_store
ID: <redacted>
Type: artifact_store
Flavor: gcp
Configuration: {'path': 'gs://<bucket>/zenml'}
User: <redacted>

What happened?

I'll be honest, this is something Claude debugged for me, so I'm not entirely sure it isn't a misconfiguration on my end. That said, the root cause traces to a specific line in the integration, so I'm filing it in case it helps others.

Filling in the checkboxes I missed when creating this issue via CLI.

ZenML version: 0.94.1
Stack: vm_gcp orchestrator + GCP Artifact Registry (containerized steps)


Describe the bug

In zenml/integrations/skypilot/utils.py, the function create_docker_run_command() generates environment flags in the form -e KEY (key only, no value):

docker_environment_str = " ".join(
    f"-e {shlex.quote(k)}" for k in environment
)

Docker's -e KEY syntax means "inherit this variable from the calling shell environment." This works fine without sudo.

However, skypilot_base_vm_orchestrator.py calls this function with use_sudo=True (hardcoded), and sudo resets the environment by default — so every -e KEY flag passes an empty/unset variable into the container.

Effect: the container receives none of the ZenML configuration variables (ZENML_STORE_URL, ZENML_STORE_TYPE, auth tokens, etc.), falls back to a local SQLite store, and immediately crashes:

ModuleNotFoundError: No module named 'sqlalchemy_utils'

This only affects the vm_gcp orchestrator path. The vm_kubernetes path takes a different branch (runs Python directly in a virtualenv and never calls create_docker_run_command()), which is likely why this went unnoticed.


Expected behavior

The container should receive all environment variables and connect to the remote ZenML server as configured.


Proposed fix

Use -e KEY=VALUE to pass values explicitly, bypassing sudo's environment reset:

docker_environment_str = " ".join(
    f"-e {shlex.quote(k)}={shlex.quote(str(v))}"
    for k, v in environment.items()
)

Reproduction steps

  1. Set up a ZenML stack with a vm_gcp SkyPilot orchestrator and a container registry
  2. Run any pipeline that uses a Docker image (i.e. the stack has a container registry configured)
  3. The orchestrator submits sudo docker run -e ZENML_STORE_URL ... on the GCP VM
  4. Observe that the container starts with no ZenML configuration

Relevant log output

ModuleNotFoundError: No module named 'sqlalchemy_utils'

And a few lines above it in the traceback:

KeyError: 'ZENML_STORE_URL'

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    core-teamIssues that are being handled by the core teamplannedPlanned for the short term

    Type

    Projects

    Status

    Next-in-line

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions