Skip to content

Releases: dstackai/dstack

0.19.9

15 May 09:51
2f96871

Choose a tag to compare

Metrics

Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.

Screenshot 2025-05-15 at 20 02 59

AMD

On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.

CLI

Container exit status

The CLI now displays the container exit status of each failed run or job:

Screenshot 2025-05-15 at 16 36 49

This information can be seen via dstack ps if you pass -v:

Screenshot 2025-05-15 at 16 23 07

Server

Robust handling of networking issues

It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.

Environment variables

Two new environment variables are now available within runs:

  • DSTACK_RUN_ID stores the UUID of the run. It's unique for a run unlike DSTACK_RUN_NAME.
  • DSTACK_JOB_ID stores the UUID of the job submission. It's unique for every replica, job, and retry attempt.

What's changed

New contributors

Full Changelog: 0.19.8...0.19.9

0.19.8

07 May 15:46
2e3da2c

Choose a tag to compare

Nebius

InfiniBand clusters

The nebius backend now supports InfiniBand clusters. A cluster is automatically created when you apply a fleet configuration with placement: cluster and supported GPUs: e.g. 8xH100 or 8xH200.

type: fleet
name: my-fleet

nodes: 2
placement: cluster

resources:
  gpu: H100,H200:8

A suitable InfiniBand fabric for the cluster is selected automatically. You can also limit the allowed fabrics in the backend settings.

Once the cluster is provisioned, you can benefit from its high-speed networking when running distributed tasks, such as NCCL tests or Hugging Face TRL.

ARM

dstack now supports compute instances with ARM CPUs. To request ARM CPUs in a run or fleet configuration, specify the arm architecture in the resources.cpu property:

resources:
  cpu: arm:4..  # 4 or more ARM cores

If the hosts in an SSH fleet have ARM CPUs, dstack will automatically detect them and enable their use.

To see available offers with ARM CPUs, pass --cpu arm to the dstack offer command.

Lambda

GH200

With the lambda backend, it's now possible to use GH200 instances that come with an ARM-based 72-core NVIDIA Grace CPU and an NVIDIA H200 Tensor Core GPU, connected with a high-bandwidth, memory-coherent NVIDIA NVLink-C2C interconnect.

type: dev-environment
name: my-env

ide: vscode

resources:
  gpu: GH200:1

If Lambda has GH200 on-demand instances at the time, you'll see them when you run dstack apply:

$ dstack apply -f .dstack.yml

 #   BACKEND             RESOURCES                                      INSTANCE TYPE  PRICE
 1   lambda (us-east-3)  cpu=arm:64 mem=464GB disk=4399GB GH200:96GB:1  gpu_1x_gh200   $1.49

Note, if no GH200 is available at the moment, you can specify the retry policy in your run configuration so that dstack can run the configuration once the GPU becomes available.

Azure

Managed identities

The new vm_managed_identity backend setting allows you to configure the managed identity that is assigned to VMs created in the azure backend.

projects:
- name: main
  backends:
  - type: azure
    subscription_id: 06c82ce3-28ff-4285-a146-c5e981a9d808
    tenant_id: f84a7584-88e4-4fd2-8e97-623f0a715ee1
    creds:
      type: default
    vm_managed_identity: dstack-rg/my-managed-identity

Make sure that dstack has the required permissions for managed identities to work.

What's changed

  • Fix: handle OSError from os.get_terminal_size() in CLI table rendering for non-TTY environments by @vuyelwadr in #2599
  • Clarify how retry works for tasks and services by @r4victor in #2600
  • [Docs] Added Tenstorrent example by @peterschmidt85 in #2596
  • Lambda: Docker: use cgroupfs driver by @un-def in #2603
  • Don't collect Prometheus metrics on container-based backends by @un-def in #2605
  • Support Nebius InfiniBand clusters by @jvstme in #2604
  • Add ARM64 support by @un-def in #2595
  • Allow to configure Nebius InfiniBand fabrics by @jvstme in #2607
  • Support vm_managed_identity for Azure by @r4victor in #2608
  • Fix API quota hitting when provisioning many A3 instances by @r4victor in #2610

New contributors

Full changelog: 0.19.7...0.19.8

0.19.7

01 May 14:05
1321113

Choose a tag to compare

This update fixes multi-node fleet provisioning on GCP.

What's changed

  • Revert "Use AS_COMPACT collocation for gcp placement groups (#2587)" by @un-def in #2592

Full changelog: 0.19.6...0.19.7

0.19.6

01 May 14:04
f57e7cc

Choose a tag to compare

Plugins

Run configurations have many options. While dstack aims to simplify them and provide rational defaults, teams may sometimes want to enforce their own defaults and configurations across projects.

To support this, we're introducing a plugin system that allows such enforcements to be defined programmatically. You can now define a plugin using dstack's Python SDK and bundle it with the dstack server.

For example, you can create your own plugin to override run configuration options—e.g., to prepend commands, set policies, and more.

For more information on plugin development, see the documentation and example.

Note

Plugins are currently an experimental feature. Backward compatibility is not guaranteed between releases.

Tenstorrent

The new update introduces initial support for Tenstorrent's Wormhole accelerators.

Now, if you create SSH fleets with hosts that have N150 or N300 PCIe boards, dstack will automatically detect them and allow you to use such a fleet for running dev environments, tasks, and services.

Dedicated examples for using dstack with Tenstorrent's accelerators will be published soon.

Warning

Ensure you update to 0.19.7, which includes a critical hot-fix for GCP.

What's changed

Full changelog: 0.19.5...0.19.6

0.19.5

23 Apr 10:39
16ddda8

Choose a tag to compare

CLI

Offers

You can now list available offers (hardware configurations) from the configured backends using the CLI—without needing to define a run configuration. Just run dstack offer and specify the resource requirements. The CLI will output available offers, including backend, region, instance type, resources, spot availability, and pricing:

$ dstack offer --gpu H100:1.. --max-offers 10

 #   BACKEND     REGION     INSTANCE TYPE          RESOURCES                                     SPOT  PRICE   
 1   datacrunch  FIN-01     1H100.80S.30V          30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 2   datacrunch  FIN-02     1H100.80S.30V          30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 3   datacrunch  FIN-02     1H100.80S.32V          32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 4   datacrunch  ICE-01     1H100.80S.32V          32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 5   runpod      US-KS-2    NVIDIA H100 PCIe       16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.39   
 6   runpod      CA         NVIDIA H100 80GB HBM3  24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.69   
 7   nebius      eu-north1  gpu-h100-sxm           16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.95   
 8   runpod      AP-JP-1    NVIDIA H100 80GB HBM3  20xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
 9   runpod      CA-MTL-1   NVIDIA H100 80GB HBM3  28xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
 10  runpod      CA-MTL-2   NVIDIA H100 80GB HBM3  26xCPU, 125GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
     ...                                                                                                                
 Shown 10 of 99 offers, $127.816 max

Learn more about how the new CLI works in the reference

Configuration

Resource tags

It's now possible to set custom resource-level tags using the new tags property:

type: dev-environment
ide: vscode
tags:
  my_custom_tag: some_value
  another_tag: another_value_123

The tags property is supported by all configuration types: runs, fleets, volumes, gateways, and profiles. The tags are propagated to the underlying cloud resources on backends that support tags. Currently, it's AWS, Azure, and GCP.

Shell configuration

With the new shell property you can specify the shell used to run commands (or init for dev environments):

type: task
image: ubuntu

shell: bash
commands:
  # now we can use Bash features, e.g., arrays:
  - words=(dstack is)
  - words+=(awesome)
  - echo ${words[@]}  # prints "dstack is awesome"

GCP

A3 High and A3 Edge

dstack now automatically sets up GCP A3 High and A3 Edge instances with GPUDirect-TCPX optimized NCCL communication.

An example on how to provision an A3 High cluster and run NCCL tests on it using dstack is coming soon!

Volumes

Total cost

The UI now shows volumes total cost and termination date alongside volume price. Previously, only the price information was available.

Screenshot 2025-04-23 at 14 10 05

What's changed

Full changelog: 0.19.4...0.19.5

0.19.4

17 Apr 10:23
fb57f55

Choose a tag to compare

Services

Rate limits

You can now configure rate limits for your services running behind a gateway.

type: service
image: my-app:latest
port: 80

rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
  rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
  burst: 9

Examples

TensorRT-LLM

We added a new example on TensorRT-LLM that shows how to deploy both DeepSeek R1 and its distilled version
using TensorRT-LLM and dstack.

Llama 4

The Llama example was updated to demonstrate the deployment of Llama 4 Scout using dstack.

Contributing

We continue to make contributing to dstack easier and improve dev experience. Since the last release, we moved from pip to uv in CI and dev pipelines. Dependencies installation times went from ~70 seconds to less than 10 seconds. The Development guide was updated to show how to get the dstack development setup with uv. The CI Build pipeline triggered on pull requests were optimized from 9 minutes to 4 minutes.

We also documented uv as one of the recommended installation options for dstack.

What's changed

New contributors

Full changelog: 0.19.3...0.19.4

0.19.3

10 Apr 10:35

Choose a tag to compare

GCP

A3 Mega

dstack now automatically sets up GCP A3 Mega instances with GPUDirect-TCPXO optimized NCCL communication to take advantage of the 1800Gbps maximum network bandwidth. Here's NCCL tests results on an A3 Mega cluster provisioned with dstack:

✗ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml 

nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0

                                                             out-of-place                       in-place          
      size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
       (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        131072     float    none      -1    166.6   50.34   47.19    N/A    164.1   51.11   47.92    N/A
    16777216        262144     float    none      -1    204.6   82.01   76.89    N/A    203.8   82.30   77.16    N/A
    33554432        524288     float    none      -1    284.0  118.17  110.78    N/A    281.7  119.12  111.67    N/A
    67108864       1048576     float    none      -1    447.4  150.00  140.62    N/A    443.5  151.31  141.86    N/A
   134217728       2097152     float    none      -1    808.3  166.05  155.67    N/A    801.9  167.38  156.92    N/A
   268435456       4194304     float    none      -1   1522.1  176.36  165.34    N/A   1518.7  176.76  165.71    N/A
   536870912       8388608     float    none      -1   2892.3  185.62  174.02    N/A   2894.4  185.49  173.89    N/A
  1073741824      16777216     float    none      -1   5532.7  194.07  181.94    N/A   5530.7  194.14  182.01    N/A
  2147483648      33554432     float    none      -1    10863  197.69  185.34    N/A    10837  198.17  185.78    N/A
  4294967296      67108864     float    none      -1    21481  199.94  187.45    N/A    21466  200.08  187.58    N/A
  8589934592     134217728     float    none      -1    42713  201.11  188.54    N/A    42701  201.16  188.59    N/A
Out of bounds values : 0 OK
Avg bus bandwidth    : 146.948 

Done

For more information on how to provision and use A3 Mega clusters with GPUDirect-TCPXO, see the A3 Mega example.

DataCrunch

H200 and B200 support

You can now provision H200 and B200 instances on DataCrunch. DataCrunch is the first dstack backend to support B200:

✗ dstack apply --gpu B200
 Project              main                                   
 User                 admin                                  
 Configuration        .dstack.yml                            
 Type                 dev-environment                        
 Resources            1..xCPU, 2GB.., 1xB200, 100GB.. (disk) 
 Max price            -                                      
 Max duration         -                                      
 Inactivity duration  -                                      
 Spot policy          auto                                   
 Retry policy         -                                      
 Creation policy      reuse-or-create                        
 Idle duration        5m                                     
 Reservation          -                                      

 #  BACKEND     REGION  INSTANCE   RESOURCES                                      SPOT  PRICE                
 1  datacrunch  FIN-03  1B200.31V  31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk)  yes   $1.3                 
 2  datacrunch  FIN-03  1B200.31V  31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk)  no    $4.49
 3  datacrunch  FIN-01  1B200.31V  31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk)  yes   $1.3   not available 
    ...                                                                                                      
 Shown 3 of 8 offers, $4.49 max

Submit a new run? [y/n]:                        

CUDO

The CUDO backend is updated to support H100, A100, A40 and all other GPUs currently offered by CUDO.

Fleets

With the new fleets property and --fleet dstack apply option, it's now possible to restrict a set of fleets considered for reuse:

type: task

fleets: [my-fleet-1, my-fleet-2]

or

dstack apply --fleet my-fleet-1 --fleet my-fleet-2

What's changed

New contributors

Full changelog: 0.19.2...0.19.3

0.19.2

03 Apr 13:38
e23783a

Choose a tag to compare

Nebius

This update introduces an integration with Nebius, a cloud provider offering top-tier NVIDIA GPUs at competitive prices.

$ dstack apply
 #  BACKEND  REGION     RESOURCES                        SPOT  PRICE
 1  nebius   eu-north1  8xCPU, 32GB, 1xL40S (48GB)       no    $1.5484
 2  nebius   eu-north1  16xCPU, 200GB, 1xH100 (80GB)     no    $2.95
 3  nebius   eu-north1  16xCPU, 200GB, 1xH200 (141GB)    no    $3.5
 4  nebius   eu-north1  64xCPU, 384GB, 2xL40S (48GB)     no    $4.5688
 5  nebius   eu-north1  128xCPU, 768GB, 4xL40S (48GB)    no    $9.1376
 6  nebius   eu-north1  128xCPU, 1600GB, 8xH100 (80GB)   no    $23.6
 7  nebius   eu-north1  128xCPU, 1600GB, 8xH200 (141GB)  no    $28

The new nebius backend supports CPU and GPU instances, fleets, distributed tasks, and more. Support for network volumes and enhanced inter-node connectivity is coming in future releases. See the docs for instructions on configuring Nebius in your dstack project.

Metrics

This release brings a long-awaited feature — the Metrics page in the UI:

Screenshot 2025-04-03 at 11-49-05 dstack

In addition, the dstack stats command was renamed to dstack metrics and updated — previously, the max value of CPU utilization depended on a number of CPUs (for example, it was 400% for 4-core CPU), now it's normalized to 100%.

$ dstack metrics nccl-tests
 NAME        CPU  MEMORY            GPU
 nccl-tests  81%  2754MB/1638400MB  #0 100740MB/144384MB 100% Util
                                    #1 100740MB/144384MB 100% Util
                                    #2 100740MB/144384MB 99% Util
                                    #3 100740MB/144384MB 99% Util
                                    #4 100740MB/144384MB 99% Util
                                    #5 100740MB/144384MB 99% Util
                                    #6 100740MB/144384MB 99% Util
                                    #7 100740MB/144384MB 100% Util

What's Changed

Full Changelog: 0.19.1...0.19.2

0.19.1

26 Mar 12:26
9d0b83f

Choose a tag to compare

Metrics

With this update, we've added more metrics that you can export to Prometheus. The new metrics allow tracking job CPU and system memory utilization, user and project usage stats, success/error rate, and more.

Runs

Name Type Description Examples
dstack_run_count_total counter The total number of runs 537
dstack_run_count_terminated_total counter The number of terminated runs 118
dstack_run_count_failed_total counter The number of failed runs 27
dstack_run_count_done_total counter The number of successful runs 218

Run jobs

Name Type Description Examples
dstack_job_cpu_count gauge Job CPU count 32.0
dstack_job_cpu_time_seconds_total counter Total CPU time consumed by the job, seconds 11.727975
dstack_job_memory_total_bytes gauge Total memory allocated for the job, bytes 4009754624.0
dstack_job_memory_usage_bytes gauge Memory used by the job (including cache), bytes 339017728.0
dstack_job_memory_working_set_bytes gauge Memory used by the job (not including cache), bytes 147251200.0

For more details on metrics, check Metrics

Major bugfixes

Fixed a bug introduced in 0.19.0 where the working directory in the container was incorrectly set by default to / instead of /workflow.

What's changed

Full changelog: 0.19.0...0.19.1

0.19.0

20 Mar 10:19
575776b

Choose a tag to compare

Contributing

Simplified backend integration

To provide best multi-cloud experience and GPU availability, dstack integrates with many cloud GPU providers including AWS, Azure, GCP, RunPod, Lambda, Vultr, and others. As we'd like to see even more GPU providers supported by dstack, this release comes with a major internal refactoring aimed to simplify the process of adding new integrations. See the Backend integration guide for more details. Join our Discord if have any questions about the integration process.

Examples

MPI workloads and NCCL tests

dstack now configures internode SSH connectivity for distributed tasks. You can log in to any node from any node via SSH with a simple ssh <node_ip> command. The out-of-the-box SSH connectivity also allows running mpirun. See the NCCL Tests example.

Monitoring

Cost and usage metrics

In addition to DCGM metrics, dstack now exports a set of Prometheus metrics for cost and usage tracking. Here's how it may look in the Grafana dashboard:

image

See the documentation for a full list of metrics and labels.

Cursor IDE support

dstack can now launch Cursor dev environments. Just specify ide: cursor in the run configuration:

type: dev-environment
ide: cursor

Deprecations

  • The Python API methods get_plan(), exec_plan(), and submit() are deprecated in favor of get_run_plan(), apply_plan(), and apply_configuration(). The deprecated methods had clumsy signatures with many top-level parameters. The new signatures align better with the CLI and HTTP API.

Breaking changes

The 0.19.0 release drops several previously deprecated or undocumented features. There are no other significant breaking changes. The 0.19.0 server continues to support 0.18.x CLI versions. But the 0.19.0 CLI does not work with older 0.18.x servers, so you should update the server first or the server and the clients simultaneously.

  • Drop the dstack run CLI command.
  • Drop the --attach mode for the dstack logs CLI command.
  • Drop Pools functionality:
    • The dstack pool CLI commands.
    • /api/project/{project_name}/runs/get_offers/api/project/{project_name}/runs/create_instance/api/pools/list_instances/api/project/{project_name}/pool/* API endpoints.
    • pool_name and instance_name parameters in profiles and run configurations.
  • Remove retry_policy from profiles.
  • Remove termination_idle_time and termination_policy from profiles and fleet configurations.
  • Drop RUN_NAME and REPO_ID run environment variables.
  • Drop the /api/backends/config_values endpoint used for interactive configuration.
  • The API accepts and returns azure_config["regions"] instead of azure_config["locations"] (unified with server/config.yml).

What's Changed

Full Changelog: 0.18.44...0.19.0