Releases · dstackai/dstack

15 May 09:51

r4victor

0.19.9

2f96871

0.19.9

Metrics

Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.

AMD

On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.

CLI

Container exit status

The CLI now displays the container exit status of each failed run or job:

This information can be seen via dstack ps if you pass -v:

Server

Robust handling of networking issues

It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.

Environment variables

Two new environment variables are now available within runs:

DSTACK_RUN_ID stores the UUID of the run. It's unique for a run unlike DSTACK_RUN_NAME.
DSTACK_JOB_ID stores the UUID of the job submission. It's unique for every replica, job, and retry attempt.

What's changed

[UX] Set the default gpu count to 1.. by @r4victor in #2624
[UX] Introduce JOB_DISCONNECTED_RETRY_TIMEOUT by @r4victor in #2627
[Feature] Pull and store process exit status from jobs by @un-def in #2615
[Internal] Add dstackai/amd-smi image by @un-def in #2611
[Runner] Improve GPU metrics collector by @un-def in #2612
[Feature] Set DSTACK_RUN_ID and DSTACK_JOB_ID by @r4victor in #2622
[Internal] Drop override message when overriding finished runs by @r4victor in #2623
[Nebius] Support InfiniBand fabric for us-central1 by @jvstme in #2629
[Feature] Keep the last metrics for finished jobs by @un-def in #2628
[Nebius] Update Nebius default project detection by @jvstme in #2633
[CUDO] Update the VM image by @r4victor in #2636
[Docs]: Nebius InfiniBand clusters by @jvstme in #2634
[Examples] Added examples/rccl-tests by @Bihan in #2613
[Docs] Extracted Distributed training examples by @peterschmidt85 in #2614
[Docs] fix YAML indent on trl example by @aaroniscode in #2617
[Docs] Add example of including plugins into the dstack-server Docker image by @r4victor in #2620
[Tests] Fix python-test by @peterschmidt85 in #2619

New contributors

@aaroniscode made their first contribution in #2617

Full Changelog: 0.19.8...0.19.9

Contributors

aaroniscode, un-def, and 4 other contributors

Assets 2

07 May 15:46

jvstme

0.19.8

2e3da2c

0.19.8

Nebius

InfiniBand clusters

The nebius backend now supports InfiniBand clusters. A cluster is automatically created when you apply a fleet configuration with placement: cluster and supported GPUs: e.g. 8xH100 or 8xH200.

type: fleet
name: my-fleet

nodes: 2
placement: cluster

resources:
  gpu: H100,H200:8

A suitable InfiniBand fabric for the cluster is selected automatically. You can also limit the allowed fabrics in the backend settings.

Once the cluster is provisioned, you can benefit from its high-speed networking when running distributed tasks, such as NCCL tests or Hugging Face TRL.

ARM

dstack now supports compute instances with ARM CPUs. To request ARM CPUs in a run or fleet configuration, specify the arm architecture in the resources.cpu property:

resources:
  cpu: arm:4..  # 4 or more ARM cores

If the hosts in an SSH fleet have ARM CPUs, dstack will automatically detect them and enable their use.

To see available offers with ARM CPUs, pass --cpu arm to the dstack offer command.

Lambda

GH200

With the lambda backend, it's now possible to use GH200 instances that come with an ARM-based 72-core NVIDIA Grace CPU and an NVIDIA H200 Tensor Core GPU, connected with a high-bandwidth, memory-coherent NVIDIA NVLink-C2C interconnect.

type: dev-environment
name: my-env

ide: vscode

resources:
  gpu: GH200:1

If Lambda has GH200 on-demand instances at the time, you'll see them when you run dstack apply:

$ dstack apply -f .dstack.yml

 #   BACKEND             RESOURCES                                      INSTANCE TYPE  PRICE
 1   lambda (us-east-3)  cpu=arm:64 mem=464GB disk=4399GB GH200:96GB:1  gpu_1x_gh200   $1.49

Note, if no GH200 is available at the moment, you can specify the retry policy in your run configuration so that dstack can run the configuration once the GPU becomes available.

Azure

Managed identities

The new vm_managed_identity backend setting allows you to configure the managed identity that is assigned to VMs created in the azure backend.

projects:
- name: main
  backends:
  - type: azure
    subscription_id: 06c82ce3-28ff-4285-a146-c5e981a9d808
    tenant_id: f84a7584-88e4-4fd2-8e97-623f0a715ee1
    creds:
      type: default
    vm_managed_identity: dstack-rg/my-managed-identity

Make sure that dstack has the required permissions for managed identities to work.

What's changed

Fix: handle OSError from os.get_terminal_size() in CLI table rendering for non-TTY environments by @vuyelwadr in #2599
Clarify how retry works for tasks and services by @r4victor in #2600
[Docs] Added Tenstorrent example by @peterschmidt85 in #2596
Lambda: Docker: use cgroupfs driver by @un-def in #2603
Don't collect Prometheus metrics on container-based backends by @un-def in #2605
Support Nebius InfiniBand clusters by @jvstme in #2604
Add ARM64 support by @un-def in #2595
Allow to configure Nebius InfiniBand fabrics by @jvstme in #2607
Support vm_managed_identity for Azure by @r4victor in #2608
Fix API quota hitting when provisioning many A3 instances by @r4victor in #2610

New contributors

@vuyelwadr made their first contribution in #2599

Full changelog: 0.19.7...0.19.8

Contributors

un-def, r4victor, and 3 other contributors

Assets 2

01 May 14:05

un-def

0.19.7

1321113

0.19.7

This update fixes multi-node fleet provisioning on GCP.

What's changed

Revert "Use AS_COMPACT collocation for gcp placement groups (#2587)" by @un-def in #2592

Full changelog: 0.19.6...0.19.7

Contributors

un-def

Assets 2

01 May 14:04

un-def

0.19.6

f57e7cc

0.19.6

Plugins

Run configurations have many options. While dstack aims to simplify them and provide rational defaults, teams may sometimes want to enforce their own defaults and configurations across projects.

To support this, we're introducing a plugin system that allows such enforcements to be defined programmatically. You can now define a plugin using dstack's Python SDK and bundle it with the dstack server.

For example, you can create your own plugin to override run configuration options—e.g., to prepend commands, set policies, and more.

For more information on plugin development, see the documentation and example.

Note

Plugins are currently an experimental feature. Backward compatibility is not guaranteed between releases.

Tenstorrent

The new update introduces initial support for Tenstorrent's Wormhole accelerators.

Now, if you create SSH fleets with hosts that have N150 or N300 PCIe boards, dstack will automatically detect them and allow you to use such a fleet for running dev environments, tasks, and services.

Dedicated examples for using dstack with Tenstorrent's accelerators will be published soon.

Warning

Ensure you update to 0.19.7, which includes a critical hot-fix for GCP.

What's changed

Fix client backward compatibility when reapplying runs by @r4victor in #2558
Add A3 High example by @r4victor in #2559
Document GCP firewall allowing inter-VPC traffic by @r4victor in #2563
[CI] Build dstack-{shim,runner} for ARM64 by @un-def in #2561
Implement /api/project/{project_name}/fleets/apply by @r4victor in #2577
Introduce effective_spec for runs and fleets by @r4victor in #2579
Support Nebius tenancies with multiple projects by @jvstme in #2575
[UX] Shorter resource syntax for dstack apply, dstack offer, anddstack ps by @peterschmidt85 in #2572
Fix missing /fleets/apply for old servers by @r4victor in #2582
Updated runner and shim contributing guide by @peterschmidt85 in #2534
Mount volumes at /mnt/disks by @r4victor in #2584
Use gVNIC for GCP A3 VMs by @r4victor in #2585
[Bug] Several issues with vastai provider #142 #2566 by @peterschmidt85 in #2567
[Feature] Support Tenstorrent's Wormhole accelerators #2573 by @peterschmidt85 in #2574
Implement plugins by @r4victor in #2581
[Feature] Support Tenstorrent's Wormhole accelerators #2573 by @peterschmidt85 in #2589

Full changelog: 0.19.5...0.19.6

Contributors

un-def, r4victor, and 2 other contributors

Assets 2

23 Apr 10:39

r4victor

0.19.5

16ddda8

0.19.5

CLI

Offers

You can now list available offers (hardware configurations) from the configured backends using the CLI—without needing to define a run configuration. Just run dstack offer and specify the resource requirements. The CLI will output available offers, including backend, region, instance type, resources, spot availability, and pricing:

$ dstack offer --gpu H100:1.. --max-offers 10

 #   BACKEND     REGION     INSTANCE TYPE          RESOURCES                                     SPOT  PRICE   
 1   datacrunch  FIN-01     1H100.80S.30V          30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 2   datacrunch  FIN-02     1H100.80S.30V          30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 3   datacrunch  FIN-02     1H100.80S.32V          32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 4   datacrunch  ICE-01     1H100.80S.32V          32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 5   runpod      US-KS-2    NVIDIA H100 PCIe       16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.39   
 6   runpod      CA         NVIDIA H100 80GB HBM3  24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.69   
 7   nebius      eu-north1  gpu-h100-sxm           16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.95   
 8   runpod      AP-JP-1    NVIDIA H100 80GB HBM3  20xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
 9   runpod      CA-MTL-1   NVIDIA H100 80GB HBM3  28xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
 10  runpod      CA-MTL-2   NVIDIA H100 80GB HBM3  26xCPU, 125GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
     ...                                                                                                                
 Shown 10 of 99 offers, $127.816 max

Learn more about how the new CLI works in the reference

Configuration

Resource tags

It's now possible to set custom resource-level tags using the new tags property:

type: dev-environment
ide: vscode
tags:
  my_custom_tag: some_value
  another_tag: another_value_123

The tags property is supported by all configuration types: runs, fleets, volumes, gateways, and profiles. The tags are propagated to the underlying cloud resources on backends that support tags. Currently, it's AWS, Azure, and GCP.

Shell configuration

With the new shell property you can specify the shell used to run commands (or init for dev environments):

type: task
image: ubuntu

shell: bash
commands:
  # now we can use Bash features, e.g., arrays:
  - words=(dstack is)
  - words+=(awesome)
  - echo ${words[@]}  # prints "dstack is awesome"

GCP

A3 High and A3 Edge

dstack now automatically sets up GCP A3 High and A3 Edge instances with GPUDirect-TCPX optimized NCCL communication.

An example on how to provision an A3 High cluster and run NCCL tests on it using dstack is coming soon!

Volumes

Total cost

The UI now shows volumes total cost and termination date alongside volume price. Previously, only the price information was available.

What's changed

Update Axolotl Examples by @Bihan in #2502
Update TGI Example with Llama 4 Scout by @Bihan in #2529
Implement custom per-resource tags by @r4victor in #2533
Add try_advisory_lock_ctx by @r4victor in #2537
[chore]: Drop is_core_model_instance by @jvstme in #2536
[runner] Rework env variables exporting by @un-def in #2535
Fix ruff version discrepancy by @jvstme in #2539
Add volume cost by @r4victor in #2541
Add shell run property by @un-def in #2542
[Feature]: Support dstack offer #2142 by @peterschmidt85 in #2540
Update Llama4 Readme with Axolotl fine-tuning example by @Bihan in #2545
[Docs] Document dstack offer by @peterschmidt85 in #2546
[Docs]: Replace vRAM -> VRAM by @jvstme in #2548
Include statics as artifacts in both wheel and sdist by @r4victor in #2544
Support A3 High/Edge GCP clusters with GPUDirect-TCPX by @r4victor in #2549

Full changelog: 0.19.4...0.19.5

Contributors

un-def, Bihan, and 3 other contributors

Assets 2

17 Apr 10:23

r4victor

0.19.4

fb57f55

0.19.4

Services

Rate limits

You can now configure rate limits for your services running behind a gateway.

type: service
image: my-app:latest
port: 80

rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
  rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
  burst: 9

Examples

TensorRT-LLM

We added a new example on TensorRT-LLM that shows how to deploy both DeepSeek R1 and its distilled version
using TensorRT-LLM and dstack.

Llama 4

The Llama example was updated to demonstrate the deployment of Llama 4 Scout using dstack.

Contributing

We continue to make contributing to dstack easier and improve dev experience. Since the last release, we moved from pip to uv in CI and dev pipelines. Dependencies installation times went from ~70 seconds to less than 10 seconds. The Development guide was updated to show how to get the dstack development setup with uv. The CI Build pipeline triggered on pull requests were optimized from 9 minutes to 4 minutes.

We also documented uv as one of the recommended installation options for dstack.

What's changed

[Landing] Refactoring (WIP) by @peterschmidt85 in #2495
Fix CloudWatchLogStorage with sparse logs by @un-def in #2501
Migrate to uv by @colinjc in #2455
Fix docs build with uv by @r4victor in #2504
[Example] Update Llama 4 Examples by @Bihan in #2508
Move to uv in dstack-server Docker image by @r4victor in #2509
Fix dstack dependency for gateway by @r4victor in #2511
[Docs] Add uv to Installation; Minor improvements by @peterschmidt85 in #2510
Validate usernames by @r4victor in #2514
Run pytest in parallel with pytest-xdist by @r4victor in #2515
Add Llama4 AMD example by @Bihan in #2513
Use exponentially increasing retry delays for pending runs by @r4victor in #2519
Speed up frontend CI by @r4victor in #2520
Service rate limits by @jvstme in #2517
Set no-guess-dev for dev package versions by @r4victor in #2521
Detect dstack version from file instead of git by @r4victor in #2524
Add TensorrRT-LLM Example by @Bihan in #2444
Fix Nginx upstream name conflicts by @jvstme in #2526
Fix detaching from dstack attach by @jvstme in #2528

New contributors

@colinjc made their first contribution in #2455

Full changelog: 0.19.3...0.19.4

Contributors

un-def, Bihan, and 4 other contributors

Assets 2

10 Apr 10:35

r4victor

0.19.3

a2b56e7

0.19.3

GCP

A3 Mega

dstack now automatically sets up GCP A3 Mega instances with GPUDirect-TCPXO optimized NCCL communication to take advantage of the 1800Gbps maximum network bandwidth. Here's NCCL tests results on an A3 Mega cluster provisioned with dstack:

✗ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml 

nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0

                                                             out-of-place                       in-place          
      size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
       (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        131072     float    none      -1    166.6   50.34   47.19    N/A    164.1   51.11   47.92    N/A
    16777216        262144     float    none      -1    204.6   82.01   76.89    N/A    203.8   82.30   77.16    N/A
    33554432        524288     float    none      -1    284.0  118.17  110.78    N/A    281.7  119.12  111.67    N/A
    67108864       1048576     float    none      -1    447.4  150.00  140.62    N/A    443.5  151.31  141.86    N/A
   134217728       2097152     float    none      -1    808.3  166.05  155.67    N/A    801.9  167.38  156.92    N/A
   268435456       4194304     float    none      -1   1522.1  176.36  165.34    N/A   1518.7  176.76  165.71    N/A
   536870912       8388608     float    none      -1   2892.3  185.62  174.02    N/A   2894.4  185.49  173.89    N/A
  1073741824      16777216     float    none      -1   5532.7  194.07  181.94    N/A   5530.7  194.14  182.01    N/A
  2147483648      33554432     float    none      -1    10863  197.69  185.34    N/A    10837  198.17  185.78    N/A
  4294967296      67108864     float    none      -1    21481  199.94  187.45    N/A    21466  200.08  187.58    N/A
  8589934592     134217728     float    none      -1    42713  201.11  188.54    N/A    42701  201.16  188.59    N/A
Out of bounds values : 0 OK
Avg bus bandwidth    : 146.948 

Done

For more information on how to provision and use A3 Mega clusters with GPUDirect-TCPXO, see the A3 Mega example.

DataCrunch

H200 and B200 support

You can now provision H200 and B200 instances on DataCrunch. DataCrunch is the first dstack backend to support B200:

✗ dstack apply --gpu B200
 Project              main                                   
 User                 admin                                  
 Configuration        .dstack.yml                            
 Type                 dev-environment                        
 Resources            1..xCPU, 2GB.., 1xB200, 100GB.. (disk) 
 Max price            -                                      
 Max duration         -                                      
 Inactivity duration  -                                      
 Spot policy          auto                                   
 Retry policy         -                                      
 Creation policy      reuse-or-create                        
 Idle duration        5m                                     
 Reservation          -                                      

 #  BACKEND     REGION  INSTANCE   RESOURCES                                      SPOT  PRICE                
 1  datacrunch  FIN-03  1B200.31V  31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk)  yes   $1.3                 
 2  datacrunch  FIN-03  1B200.31V  31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk)  no    $4.49
 3  datacrunch  FIN-01  1B200.31V  31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk)  yes   $1.3   not available 
    ...                                                                                                      
 Shown 3 of 8 offers, $4.49 max

Submit a new run? [y/n]:

CUDO

The CUDO backend is updated to support H100, A100, A40 and all other GPUs currently offered by CUDO.

Fleets

With the new fleets property and --fleet dstack apply option, it's now possible to restrict a set of fleets considered for reuse:

type: task

fleets: [my-fleet-1, my-fleet-2]

dstack apply --fleet my-fleet-1 --fleet my-fleet-2

What's changed

[Blog] Built-in UI for monitoring basic GPU metrics by @peterschmidt85 in #2470
Fix Nebius project discovery by @jvstme in #2473
Support A3 Mega GCP clusters with GPUDirect-TCPXO by @r4victor in #2469
Fix Nebius private networks with non-default CIDR by @jvstme in #2475
Add region for Lambda by @HSaddiq in #2471
Fix relative date in CLI for weeks and months by @jvstme in #2481
Fix terminating TensorDock instances by @jvstme in #2480
Use all Lambda regions by default by @jvstme in #2478
Allow mounting volumes into /workflow by @r4victor in #2483
Improve Datacrunch backend by @r4victor in #2487
UI improvements by @olgenn in #2489
Add fleets property to run configurations and CLI by @un-def in #2488
Fix GitIgnore by @un-def in #2491
Remove hardcoded cudo regions by @r4victor in #2493
Optimize GCP list usable subnets across regions by @r4victor in #2494
Make regions filtering case insensitive by @r4victor in #2499

New contributors

@HSaddiq made their first contribution in #2471

Full changelog: 0.19.2...0.19.3

Contributors

un-def, olgenn, and 4 other contributors

Assets 2

03 Apr 13:38

un-def

0.19.2

e23783a

0.19.2

Nebius

This update introduces an integration with Nebius, a cloud provider offering top-tier NVIDIA GPUs at competitive prices.

$ dstack apply
 #  BACKEND  REGION     RESOURCES                        SPOT  PRICE
 1  nebius   eu-north1  8xCPU, 32GB, 1xL40S (48GB)       no    $1.5484
 2  nebius   eu-north1  16xCPU, 200GB, 1xH100 (80GB)     no    $2.95
 3  nebius   eu-north1  16xCPU, 200GB, 1xH200 (141GB)    no    $3.5
 4  nebius   eu-north1  64xCPU, 384GB, 2xL40S (48GB)     no    $4.5688
 5  nebius   eu-north1  128xCPU, 768GB, 4xL40S (48GB)    no    $9.1376
 6  nebius   eu-north1  128xCPU, 1600GB, 8xH100 (80GB)   no    $23.6
 7  nebius   eu-north1  128xCPU, 1600GB, 8xH200 (141GB)  no    $28

The new nebius backend supports CPU and GPU instances, fleets, distributed tasks, and more. Support for network volumes and enhanced inter-node connectivity is coming in future releases. See the docs for instructions on configuring Nebius in your dstack project.

Metrics

This release brings a long-awaited feature — the Metrics page in the UI:

In addition, the dstack stats command was renamed to dstack metrics and updated — previously, the max value of CPU utilization depended on a number of CPUs (for example, it was 400% for 4-core CPU), now it's normalized to 100%.

$ dstack metrics nccl-tests
 NAME        CPU  MEMORY            GPU
 nccl-tests  81%  2754MB/1638400MB  #0 100740MB/144384MB 100% Util
                                    #1 100740MB/144384MB 100% Util
                                    #2 100740MB/144384MB 99% Util
                                    #3 100740MB/144384MB 99% Util
                                    #4 100740MB/144384MB 99% Util
                                    #5 100740MB/144384MB 99% Util
                                    #6 100740MB/144384MB 99% Util
                                    #7 100740MB/144384MB 100% Util

What's Changed

[Docs] Update the home page to include a updated diagram by @peterschmidt85 in #2450
Update ChatCompletionsChunk for Deepseek-R1 response by @Bihan in #2452
[Blog] Minor blog refactoring by @peterschmidt85 in #2457
[Blog] Accessing dev environments with Cursor by @peterschmidt85 in #2456
Move cachetools to base deps by @r4victor in #2459
[Blog] Prometheus by @peterschmidt85 in #2458
Add optional bearer auth to metrics endpoint by @un-def in #2460
[CLI] Rename stats command to metrics by @un-def in #2462
Add SgLang Example by @Bihan in #2461
Update NIM example with DeepSeek-R1-Distill by @Bihan in #2454
[Blog] Supporting MPI and NCCL/RCCL tests by @peterschmidt85 in #2465
Add Nebius backend by @jvstme in #2463
[Feature] Show Run metrics on the UI by @olgenn in #2446
[CLI] Divide CPU util by a number of vCPUs in dstack metrics by @un-def in #2466

Full Changelog: 0.19.1...0.19.2

Contributors

un-def, olgenn, and 4 other contributors

Assets 2

26 Mar 12:26

un-def

0.19.1

9d0b83f

0.19.1

Metrics

With this update, we've added more metrics that you can export to Prometheus. The new metrics allow tracking job CPU and system memory utilization, user and project usage stats, success/error rate, and more.

Runs

Name	Type	Description	Examples
`dstack_run_count_total`	counter	The total number of runs	`537`
`dstack_run_count_terminated_total`	counter	The number of terminated runs	`118`
`dstack_run_count_failed_total`	counter	The number of failed runs	`27`
`dstack_run_count_done_total`	counter	The number of successful runs	`218`

Run jobs

Name	Type	Description	Examples
`dstack_job_cpu_count`	gauge	Job CPU count	`32.0`
`dstack_job_cpu_time_seconds_total`	counter	Total CPU time consumed by the job, seconds	`11.727975`
`dstack_job_memory_total_bytes`	gauge	Total memory allocated for the job, bytes	`4009754624.0`
`dstack_job_memory_usage_bytes`	gauge	Memory used by the job (including cache), bytes	`339017728.0`
`dstack_job_memory_working_set_bytes`	gauge	Memory used by the job (not including cache), bytes	`147251200.0`

For more details on metrics, check Metrics

Major bugfixes

Fixed a bug introduced in 0.19.0 where the working directory in the container was incorrectly set by default to / instead of /workflow.

What's changed

Fix trying fleet instance offers by @jvstme in #2443
Add job system metrics, run metrics by @un-def in #2445
Fix default working dir in containers by @jvstme in #2449
[Examples] Update nccl-tests by @un-def in #2451

Full changelog: 0.19.0...0.19.1

Contributors

un-def and jvstme

Assets 2

20 Mar 10:19

r4victor

0.19.0

575776b

0.19.0

Contributing

Simplified backend integration

To provide best multi-cloud experience and GPU availability, dstack integrates with many cloud GPU providers including AWS, Azure, GCP, RunPod, Lambda, Vultr, and others. As we'd like to see even more GPU providers supported by dstack, this release comes with a major internal refactoring aimed to simplify the process of adding new integrations. See the Backend integration guide for more details. Join our Discord if have any questions about the integration process.

Examples

MPI workloads and NCCL tests

dstack now configures internode SSH connectivity for distributed tasks. You can log in to any node from any node via SSH with a simple ssh <node_ip> command. The out-of-the-box SSH connectivity also allows running mpirun. See the NCCL Tests example.

Monitoring

Cost and usage metrics

In addition to DCGM metrics, dstack now exports a set of Prometheus metrics for cost and usage tracking. Here's how it may look in the Grafana dashboard:

See the documentation for a full list of metrics and labels.

Cursor IDE support

dstack can now launch Cursor dev environments. Just specify ide: cursor in the run configuration:

type: dev-environment
ide: cursor

Deprecations

The Python API methods get_plan(), exec_plan(), and submit() are deprecated in favor of get_run_plan(), apply_plan(), and apply_configuration(). The deprecated methods had clumsy signatures with many top-level parameters. The new signatures align better with the CLI and HTTP API.

Breaking changes

The 0.19.0 release drops several previously deprecated or undocumented features. There are no other significant breaking changes. The 0.19.0 server continues to support 0.18.x CLI versions. But the 0.19.0 CLI does not work with older 0.18.x servers, so you should update the server first or the server and the clients simultaneously.

Drop the dstack run CLI command.
Drop the --attach mode for the dstack logs CLI command.
Drop Pools functionality:
- The dstack pool CLI commands.
- /api/project/{project_name}/runs/get_offers, /api/project/{project_name}/runs/create_instance, /api/pools/list_instances, /api/project/{project_name}/pool/* API endpoints.
- pool_name and instance_name parameters in profiles and run configurations.
Remove retry_policy from profiles.
Remove termination_idle_time and termination_policy from profiles and fleet configurations.
Drop RUN_NAME and REPO_ID run environment variables.
Drop the /api/backends/config_values endpoint used for interactive configuration.
The API accepts and returns azure_config["regions"] instead of azure_config["locations"] (unified with server/config.yml).

What's Changed

Fix gateways with a previously used IP address by @jvstme in #2388
Simplify backend configurators and models by @r4victor in #2389
Store BackendType as string instead of enum in the DB by @r4victor in #2393
Introduce ComputeWith classes to detect compute features by @r4victor in #2392
Move backend/compute configs from config.py to models.py by @r4victor in #2395
Provide default run_job implementation for VM backends by @r4victor in #2396
Configure inter-node SSH on multi-node tasks by @un-def in #2394
[Blog] Using SSH fleets with TensorWave's private AMD cloud by @peterschmidt85 in #2391
Add script to generate boilerplate code for new backend by @r4victor in #2397
Add datacenter-gpu-manager-4-proprietary to CUDA images by @un-def in #2399
Drop pools by @r4victor in #2401
Transition high-level Python runs API to new methods by @r4victor in #2403
Drop dstack run by @r4victor in #2404
Drop dstack logs --attach by @r4victor in #2405
Remove retry_policy from profiles by @r4victor in #2406
Remove termination_idle_time and termination_policy by @r4victor in #2407
Clean up models backward compatibility code by @r4victor in #2408
Restore removed models fields for compatibility with 0.18 clients by @r4victor in #2414
Clean up legacy repo fields by @jvstme in #2411
Switch AWS gateways from t2.micro to t3.micro by @r4victor in #2416
Remove old client excludes by @r4victor in #2417
Use new JobTerminationReason values by @r4victor in #2418
Drop RUN_NAME and REPO_ID env vars by @r4victor in #2419
Drop irrelevant Nebius backend implementation by @jvstme in #2421
[Feature]: Support the cursor IDE #2412 by @peterschmidt85 in #2413
Simplify implementation of new backends #2372 by @olgenn in #2423
Support multiple domains with Entra login by @r4victor in #2424
Support setting project members by email by @r4victor in #2429
Fix json schema reference and invalid properties errors by @r4victor in #2433
[Blog]: DeepSeek R1 inference performance: MI300X vs. H200 by @peterschmidt85 in #2425
Add new metrics by @un-def in #2434
Add instance and job cost/usage Prometheus metrics by @un-def in #2432
[Docker] Add dstackai/efa image by @un-def in #2422
Restore fleet termination_policy for 0.18 backward compatibility by @r4victor in #2436
[Bug]: Search over users doesn't work by @olgenn in #2439
[Feature]: Support activating/deactivating users via the UI by @olgenn in #2440
[Feature]: Display Assigned Gateway Information on Run Pages by @olgenn in #2438
[Docs]: Update the Metrics guide by @peterschmidt85 in #2441
[Examples] Update nccl-tests by @un-def in #2415

Full Changelog: 0.18.44...0.19.0

Contributors

un-def, olgenn, and 3 other contributors

Assets 2

Releases: dstackai/dstack

0.19.9

Metrics

AMD

CLI

Container exit status

Server

Robust handling of networking issues

Environment variables

What's changed

New contributors

Contributors

Uh oh!

0.19.8

Nebius

InfiniBand clusters

ARM

Lambda

GH200

Azure

Managed identities

What's changed

New contributors

Contributors

Uh oh!

0.19.7

What's changed

Contributors

Uh oh!

0.19.6

Plugins

Tenstorrent

What's changed

Contributors

Uh oh!

0.19.5

CLI

Offers

Configuration

Resource tags

Shell configuration

GCP

A3 High and A3 Edge

Volumes

Total cost

What's changed

Contributors

Uh oh!

0.19.4

Services

Rate limits

Examples

TensorRT-LLM

Llama 4

Contributing

What's changed

New contributors

Contributors

Uh oh!

0.19.3

GCP

A3 Mega

DataCrunch

H200 and B200 support

CUDO

Fleets

What's changed

New contributors

Contributors

Uh oh!

0.19.2

Nebius

Metrics

What's Changed

Contributors

Uh oh!

0.19.1

Metrics

Runs

Run jobs