Releases: dstackai/dstack
0.19.9
Metrics
Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.
AMD
On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.
CLI
Container exit status
The CLI now displays the container exit status of each failed run or job:
This information can be seen via dstack ps if you pass -v:
Server
Robust handling of networking issues
It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.
Environment variables
Two new environment variables are now available within runs:
DSTACK_RUN_IDstores the UUID of the run. It's unique for a run unlikeDSTACK_RUN_NAME.DSTACK_JOB_IDstores the UUID of the job submission. It's unique for every replica, job, and retry attempt.
What's changed
- [UX] Set the default
gpucount to1..by @r4victor in #2624 - [UX] Introduce
JOB_DISCONNECTED_RETRY_TIMEOUTby @r4victor in #2627 - [Feature] Pull and store process exit status from jobs by @un-def in #2615
- [Internal] Add
dstackai/amd-smiimage by @un-def in #2611 - [Runner] Improve GPU metrics collector by @un-def in #2612
- [Feature] Set
DSTACK_RUN_IDandDSTACK_JOB_IDby @r4victor in #2622 - [Internal] Drop override message when overriding finished runs by @r4victor in #2623
- [Nebius] Support InfiniBand fabric for
us-central1by @jvstme in #2629 - [Feature] Keep the last metrics for finished jobs by @un-def in #2628
- [Nebius] Update Nebius default project detection by @jvstme in #2633
- [CUDO] Update the VM image by @r4victor in #2636
- [Docs]: Nebius InfiniBand clusters by @jvstme in #2634
- [Examples] Added
examples/rccl-testsby @Bihan in #2613 - [Docs] Extracted Distributed training examples by @peterschmidt85 in #2614
- [Docs] fix YAML indent on trl example by @aaroniscode in #2617
- [Docs] Add example of including plugins into the dstack-server Docker image by @r4victor in #2620
- [Tests] Fix python-test by @peterschmidt85 in #2619
New contributors
- @aaroniscode made their first contribution in #2617
Full Changelog: 0.19.8...0.19.9
0.19.8
Nebius
InfiniBand clusters
The nebius backend now supports InfiniBand clusters. A cluster is automatically created when you apply a fleet configuration with placement: cluster and supported GPUs: e.g. 8xH100 or 8xH200.
type: fleet
name: my-fleet
nodes: 2
placement: cluster
resources:
gpu: H100,H200:8A suitable InfiniBand fabric for the cluster is selected automatically. You can also limit the allowed fabrics in the backend settings.
Once the cluster is provisioned, you can benefit from its high-speed networking when running distributed tasks, such as NCCL tests or Hugging Face TRL.
ARM
dstack now supports compute instances with ARM CPUs. To request ARM CPUs in a run or fleet configuration, specify the arm architecture in the resources.cpu property:
resources:
cpu: arm:4.. # 4 or more ARM coresIf the hosts in an SSH fleet have ARM CPUs, dstack will automatically detect them and enable their use.
To see available offers with ARM CPUs, pass --cpu arm to the dstack offer command.
Lambda
GH200
With the lambda backend, it's now possible to use GH200 instances that come with an ARM-based 72-core NVIDIA Grace CPU and an NVIDIA H200 Tensor Core GPU, connected with a high-bandwidth, memory-coherent NVIDIA NVLink-C2C interconnect.
type: dev-environment
name: my-env
ide: vscode
resources:
gpu: GH200:1If Lambda has GH200 on-demand instances at the time, you'll see them when you run dstack apply:
$ dstack apply -f .dstack.yml
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 lambda (us-east-3) cpu=arm:64 mem=464GB disk=4399GB GH200:96GB:1 gpu_1x_gh200 $1.49Note, if no GH200 is available at the moment, you can specify the retry policy in your run configuration so that dstack can run the configuration once the GPU becomes available.
Azure
Managed identities
The new vm_managed_identity backend setting allows you to configure the managed identity that is assigned to VMs created in the azure backend.
projects:
- name: main
backends:
- type: azure
subscription_id: 06c82ce3-28ff-4285-a146-c5e981a9d808
tenant_id: f84a7584-88e4-4fd2-8e97-623f0a715ee1
creds:
type: default
vm_managed_identity: dstack-rg/my-managed-identityMake sure that dstack has the required permissions for managed identities to work.
What's changed
- Fix: handle OSError from os.get_terminal_size() in CLI table rendering for non-TTY environments by @vuyelwadr in #2599
- Clarify how retry works for tasks and services by @r4victor in #2600
- [Docs] Added Tenstorrent example by @peterschmidt85 in #2596
- Lambda: Docker: use
cgroupfsdriver by @un-def in #2603 - Don't collect Prometheus metrics on container-based backends by @un-def in #2605
- Support Nebius InfiniBand clusters by @jvstme in #2604
- Add ARM64 support by @un-def in #2595
- Allow to configure Nebius InfiniBand fabrics by @jvstme in #2607
- Support vm_managed_identity for Azure by @r4victor in #2608
- Fix API quota hitting when provisioning many A3 instances by @r4victor in #2610
New contributors
- @vuyelwadr made their first contribution in #2599
Full changelog: 0.19.7...0.19.8
0.19.7
0.19.6
Plugins
Run configurations have many options. While dstack aims to simplify them and provide rational defaults, teams may sometimes want to enforce their own defaults and configurations across projects.
To support this, we're introducing a plugin system that allows such enforcements to be defined programmatically. You can now define a plugin using dstack's Python SDK and bundle it with the dstack server.
For example, you can create your own plugin to override run configuration options—e.g., to prepend commands, set policies, and more.
For more information on plugin development, see the documentation and example.
Note
Plugins are currently an experimental feature. Backward compatibility is not guaranteed between releases.
Tenstorrent
The new update introduces initial support for Tenstorrent's Wormhole accelerators.
Now, if you create SSH fleets with hosts that have N150 or N300 PCIe boards, dstack will automatically detect them and allow you to use such a fleet for running dev environments, tasks, and services.
Dedicated examples for using dstack with Tenstorrent's accelerators will be published soon.
Warning
Ensure you update to 0.19.7, which includes a critical hot-fix for GCP.
What's changed
- Fix client backward compatibility when reapplying runs by @r4victor in #2558
- Add A3 High example by @r4victor in #2559
- Document GCP firewall allowing inter-VPC traffic by @r4victor in #2563
- [CI] Build
dstack-{shim,runner}for ARM64 by @un-def in #2561 - Implement
/api/project/{project_name}/fleets/applyby @r4victor in #2577 - Introduce
effective_specfor runs and fleets by @r4victor in #2579 - Support Nebius tenancies with multiple projects by @jvstme in #2575
- [UX] Shorter resource syntax for
dstack apply,dstack offer, anddstack psby @peterschmidt85 in #2572 - Fix missing /fleets/apply for old servers by @r4victor in #2582
- Updated runner and shim contributing guide by @peterschmidt85 in #2534
- Mount volumes at /mnt/disks by @r4victor in #2584
- Use gVNIC for GCP A3 VMs by @r4victor in #2585
- [Bug] Several issues with
vastaiprovider #142 #2566 by @peterschmidt85 in #2567 - [Feature] Support Tenstorrent's Wormhole accelerators #2573 by @peterschmidt85 in #2574
- Implement plugins by @r4victor in #2581
- [Feature] Support Tenstorrent's Wormhole accelerators #2573 by @peterschmidt85 in #2589
Full changelog: 0.19.5...0.19.6
0.19.5
CLI
Offers
You can now list available offers (hardware configurations) from the configured backends using the CLI—without needing to define a run configuration. Just run dstack offer and specify the resource requirements. The CLI will output available offers, including backend, region, instance type, resources, spot availability, and pricing:
$ dstack offer --gpu H100:1.. --max-offers 10
# BACKEND REGION INSTANCE TYPE RESOURCES SPOT PRICE
1 datacrunch FIN-01 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
2 datacrunch FIN-02 1H100.80S.30V 30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
3 datacrunch FIN-02 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
4 datacrunch ICE-01 1H100.80S.32V 32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk) no $2.19
5 runpod US-KS-2 NVIDIA H100 PCIe 16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.39
6 runpod CA NVIDIA H100 80GB HBM3 24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.69
7 nebius eu-north1 gpu-h100-sxm 16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk) no $2.95
8 runpod AP-JP-1 NVIDIA H100 80GB HBM3 20xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.99
9 runpod CA-MTL-1 NVIDIA H100 80GB HBM3 28xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk) no $2.99
10 runpod CA-MTL-2 NVIDIA H100 80GB HBM3 26xCPU, 125GB, 1xH100 (80GB), 100.0GB (disk) no $2.99
...
Shown 10 of 99 offers, $127.816 maxLearn more about how the new CLI works in the reference
Configuration
Resource tags
It's now possible to set custom resource-level tags using the new tags property:
type: dev-environment
ide: vscode
tags:
my_custom_tag: some_value
another_tag: another_value_123The tags property is supported by all configuration types: runs, fleets, volumes, gateways, and profiles. The tags are propagated to the underlying cloud resources on backends that support tags. Currently, it's AWS, Azure, and GCP.
Shell configuration
With the new shell property you can specify the shell used to run commands (or init for dev environments):
type: task
image: ubuntu
shell: bash
commands:
# now we can use Bash features, e.g., arrays:
- words=(dstack is)
- words+=(awesome)
- echo ${words[@]} # prints "dstack is awesome"GCP
A3 High and A3 Edge
dstack now automatically sets up GCP A3 High and A3 Edge instances with GPUDirect-TCPX optimized NCCL communication.
An example on how to provision an A3 High cluster and run NCCL tests on it using dstack is coming soon!
Volumes
Total cost
The UI now shows volumes total cost and termination date alongside volume price. Previously, only the price information was available.
What's changed
- Update Axolotl Examples by @Bihan in #2502
- Update TGI Example with Llama 4 Scout by @Bihan in #2529
- Implement custom per-resource tags by @r4victor in #2533
- Add try_advisory_lock_ctx by @r4victor in #2537
- [chore]: Drop
is_core_model_instanceby @jvstme in #2536 - [runner] Rework env variables exporting by @un-def in #2535
- Fix ruff version discrepancy by @jvstme in #2539
- Add volume cost by @r4victor in #2541
- Add
shellrun property by @un-def in #2542 - [Feature]: Support
dstack offer#2142 by @peterschmidt85 in #2540 - Update Llama4 Readme with Axolotl fine-tuning example by @Bihan in #2545
- [Docs] Document
dstack offerby @peterschmidt85 in #2546 - [Docs]: Replace vRAM -> VRAM by @jvstme in #2548
- Include statics as artifacts in both wheel and sdist by @r4victor in #2544
- Support A3 High/Edge GCP clusters with GPUDirect-TCPX by @r4victor in #2549
Full changelog: 0.19.4...0.19.5
0.19.4
Services
Rate limits
You can now configure rate limits for your services running behind a gateway.
type: service
image: my-app:latest
port: 80
rate_limits:
# For /api/auth/* - 1 request per second, no bursts
- prefix: /api/auth/
rps: 1
# For other URLs - 4 requests per second + bursts of up to 9 requests
- rps: 4
burst: 9Examples
TensorRT-LLM
We added a new example on TensorRT-LLM that shows how to deploy both DeepSeek R1 and its distilled version
using TensorRT-LLM and dstack.
Llama 4
The Llama example was updated to demonstrate the deployment of Llama 4 Scout using dstack.
Contributing
We continue to make contributing to dstack easier and improve dev experience. Since the last release, we moved from pip to uv in CI and dev pipelines. Dependencies installation times went from ~70 seconds to less than 10 seconds. The Development guide was updated to show how to get the dstack development setup with uv. The CI Build pipeline triggered on pull requests were optimized from 9 minutes to 4 minutes.
We also documented uv as one of the recommended installation options for dstack.
What's changed
- [Landing] Refactoring (WIP) by @peterschmidt85 in #2495
- Fix CloudWatchLogStorage with sparse logs by @un-def in #2501
- Migrate to uv by @colinjc in #2455
- Fix docs build with uv by @r4victor in #2504
- [Example] Update Llama 4 Examples by @Bihan in #2508
- Move to uv in dstack-server Docker image by @r4victor in #2509
- Fix dstack dependency for gateway by @r4victor in #2511
- [Docs] Add
uvtoInstallation; Minor improvements by @peterschmidt85 in #2510 - Validate usernames by @r4victor in #2514
- Run pytest in parallel with pytest-xdist by @r4victor in #2515
- Add Llama4 AMD example by @Bihan in #2513
- Use exponentially increasing retry delays for pending runs by @r4victor in #2519
- Speed up frontend CI by @r4victor in #2520
- Service rate limits by @jvstme in #2517
- Set no-guess-dev for dev package versions by @r4victor in #2521
- Detect dstack version from file instead of git by @r4victor in #2524
- Add TensorrRT-LLM Example by @Bihan in #2444
- Fix Nginx upstream name conflicts by @jvstme in #2526
- Fix detaching from
dstack attachby @jvstme in #2528
New contributors
Full changelog: 0.19.3...0.19.4
0.19.3
GCP
A3 Mega
dstack now automatically sets up GCP A3 Mega instances with GPUDirect-TCPXO optimized NCCL communication to take advantage of the 1800Gbps maximum network bandwidth. Here's NCCL tests results on an A3 Mega cluster provisioned with dstack:
✗ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 166.6 50.34 47.19 N/A 164.1 51.11 47.92 N/A
16777216 262144 float none -1 204.6 82.01 76.89 N/A 203.8 82.30 77.16 N/A
33554432 524288 float none -1 284.0 118.17 110.78 N/A 281.7 119.12 111.67 N/A
67108864 1048576 float none -1 447.4 150.00 140.62 N/A 443.5 151.31 141.86 N/A
134217728 2097152 float none -1 808.3 166.05 155.67 N/A 801.9 167.38 156.92 N/A
268435456 4194304 float none -1 1522.1 176.36 165.34 N/A 1518.7 176.76 165.71 N/A
536870912 8388608 float none -1 2892.3 185.62 174.02 N/A 2894.4 185.49 173.89 N/A
1073741824 16777216 float none -1 5532.7 194.07 181.94 N/A 5530.7 194.14 182.01 N/A
2147483648 33554432 float none -1 10863 197.69 185.34 N/A 10837 198.17 185.78 N/A
4294967296 67108864 float none -1 21481 199.94 187.45 N/A 21466 200.08 187.58 N/A
8589934592 134217728 float none -1 42713 201.11 188.54 N/A 42701 201.16 188.59 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 146.948
DoneFor more information on how to provision and use A3 Mega clusters with GPUDirect-TCPXO, see the A3 Mega example.
DataCrunch
H200 and B200 support
You can now provision H200 and B200 instances on DataCrunch. DataCrunch is the first dstack backend to support B200:
✗ dstack apply --gpu B200
Project main
User admin
Configuration .dstack.yml
Type dev-environment
Resources 1..xCPU, 2GB.., 1xB200, 100GB.. (disk)
Max price -
Max duration -
Inactivity duration -
Spot policy auto
Retry policy -
Creation policy reuse-or-create
Idle duration 5m
Reservation -
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 datacrunch FIN-03 1B200.31V 31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk) yes $1.3
2 datacrunch FIN-03 1B200.31V 31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk) no $4.49
3 datacrunch FIN-01 1B200.31V 31xCPU, 250GB, 1xB200 (180GB), 100.0GB (disk) yes $1.3 not available
...
Shown 3 of 8 offers, $4.49 max
Submit a new run? [y/n]: CUDO
The CUDO backend is updated to support H100, A100, A40 and all other GPUs currently offered by CUDO.
Fleets
With the new fleets property and --fleet dstack apply option, it's now possible to restrict a set of fleets considered for reuse:
type: task
fleets: [my-fleet-1, my-fleet-2]or
dstack apply --fleet my-fleet-1 --fleet my-fleet-2What's changed
- [Blog] Built-in UI for monitoring basic GPU metrics by @peterschmidt85 in #2470
- Fix Nebius project discovery by @jvstme in #2473
- Support A3 Mega GCP clusters with GPUDirect-TCPXO by @r4victor in #2469
- Fix Nebius private networks with non-default CIDR by @jvstme in #2475
- Add region for Lambda by @HSaddiq in #2471
- Fix relative date in CLI for weeks and months by @jvstme in #2481
- Fix terminating TensorDock instances by @jvstme in #2480
- Use all Lambda regions by default by @jvstme in #2478
- Allow mounting volumes into /workflow by @r4victor in #2483
- Improve Datacrunch backend by @r4victor in #2487
- UI improvements by @olgenn in #2489
- Add
fleetsproperty to run configurations and CLI by @un-def in #2488 - Fix GitIgnore by @un-def in #2491
- Remove hardcoded cudo regions by @r4victor in #2493
- Optimize GCP list usable subnets across regions by @r4victor in #2494
- Make regions filtering case insensitive by @r4victor in #2499
New contributors
Full changelog: 0.19.2...0.19.3
0.19.2
Nebius
This update introduces an integration with Nebius, a cloud provider offering top-tier NVIDIA GPUs at competitive prices.
$ dstack apply
# BACKEND REGION RESOURCES SPOT PRICE
1 nebius eu-north1 8xCPU, 32GB, 1xL40S (48GB) no $1.5484
2 nebius eu-north1 16xCPU, 200GB, 1xH100 (80GB) no $2.95
3 nebius eu-north1 16xCPU, 200GB, 1xH200 (141GB) no $3.5
4 nebius eu-north1 64xCPU, 384GB, 2xL40S (48GB) no $4.5688
5 nebius eu-north1 128xCPU, 768GB, 4xL40S (48GB) no $9.1376
6 nebius eu-north1 128xCPU, 1600GB, 8xH100 (80GB) no $23.6
7 nebius eu-north1 128xCPU, 1600GB, 8xH200 (141GB) no $28The new nebius backend supports CPU and GPU instances, fleets, distributed tasks, and more. Support for network volumes and enhanced inter-node connectivity is coming in future releases. See the docs for instructions on configuring Nebius in your dstack project.
Metrics
This release brings a long-awaited feature — the Metrics page in the UI:
In addition, the dstack stats command was renamed to dstack metrics and updated — previously, the max value of CPU utilization depended on a number of CPUs (for example, it was 400% for 4-core CPU), now it's normalized to 100%.
$ dstack metrics nccl-tests
NAME CPU MEMORY GPU
nccl-tests 81% 2754MB/1638400MB #0 100740MB/144384MB 100% Util
#1 100740MB/144384MB 100% Util
#2 100740MB/144384MB 99% Util
#3 100740MB/144384MB 99% Util
#4 100740MB/144384MB 99% Util
#5 100740MB/144384MB 99% Util
#6 100740MB/144384MB 99% Util
#7 100740MB/144384MB 100% Util
What's Changed
- [Docs] Update the home page to include a updated diagram by @peterschmidt85 in #2450
- Update ChatCompletionsChunk for Deepseek-R1 response by @Bihan in #2452
- [Blog] Minor blog refactoring by @peterschmidt85 in #2457
- [Blog] Accessing dev environments with Cursor by @peterschmidt85 in #2456
- Move cachetools to base deps by @r4victor in #2459
- [Blog] Prometheus by @peterschmidt85 in #2458
- Add optional bearer auth to metrics endpoint by @un-def in #2460
- [CLI] Rename
statscommand tometricsby @un-def in #2462 - Add SgLang Example by @Bihan in #2461
- Update NIM example with DeepSeek-R1-Distill by @Bihan in #2454
- [Blog] Supporting MPI and NCCL/RCCL tests by @peterschmidt85 in #2465
- Add Nebius backend by @jvstme in #2463
- [Feature] Show Run metrics on the UI by @olgenn in #2446
- [CLI] Divide CPU util by a number of vCPUs in
dstack metricsby @un-def in #2466
Full Changelog: 0.19.1...0.19.2
0.19.1
Metrics
With this update, we've added more metrics that you can export to Prometheus. The new metrics allow tracking job CPU and system memory utilization, user and project usage stats, success/error rate, and more.
Runs
| Name | Type | Description | Examples |
|---|---|---|---|
dstack_run_count_total |
counter | The total number of runs | 537 |
dstack_run_count_terminated_total |
counter | The number of terminated runs | 118 |
dstack_run_count_failed_total |
counter | The number of failed runs | 27 |
dstack_run_count_done_total |
counter | The number of successful runs | 218 |
Run jobs
| Name | Type | Description | Examples |
|---|---|---|---|
dstack_job_cpu_count |
gauge | Job CPU count | 32.0 |
dstack_job_cpu_time_seconds_total |
counter | Total CPU time consumed by the job, seconds | 11.727975 |
dstack_job_memory_total_bytes |
gauge | Total memory allocated for the job, bytes | 4009754624.0 |
dstack_job_memory_usage_bytes |
gauge | Memory used by the job (including cache), bytes | 339017728.0 |
dstack_job_memory_working_set_bytes |
gauge | Memory used by the job (not including cache), bytes | 147251200.0 |
For more details on metrics, check Metrics
Major bugfixes
Fixed a bug introduced in 0.19.0 where the working directory in the container was incorrectly set by default to / instead of /workflow.
What's changed
- Fix trying fleet instance offers by @jvstme in #2443
- Add job system metrics, run metrics by @un-def in #2445
- Fix default working dir in containers by @jvstme in #2449
- [Examples] Update nccl-tests by @un-def in #2451
Full changelog: 0.19.0...0.19.1
0.19.0
Contributing
Simplified backend integration
To provide best multi-cloud experience and GPU availability, dstack integrates with many cloud GPU providers including AWS, Azure, GCP, RunPod, Lambda, Vultr, and others. As we'd like to see even more GPU providers supported by dstack, this release comes with a major internal refactoring aimed to simplify the process of adding new integrations. See the Backend integration guide for more details. Join our Discord if have any questions about the integration process.
Examples
MPI workloads and NCCL tests
dstack now configures internode SSH connectivity for distributed tasks. You can log in to any node from any node via SSH with a simple ssh <node_ip> command. The out-of-the-box SSH connectivity also allows running mpirun. See the NCCL Tests example.
Monitoring
Cost and usage metrics
In addition to DCGM metrics, dstack now exports a set of Prometheus metrics for cost and usage tracking. Here's how it may look in the Grafana dashboard:
See the documentation for a full list of metrics and labels.
Cursor IDE support
dstack can now launch Cursor dev environments. Just specify ide: cursor in the run configuration:
type: dev-environment
ide: cursorDeprecations
- The Python API methods
get_plan(),exec_plan(), andsubmit()are deprecated in favor ofget_run_plan(),apply_plan(), andapply_configuration(). The deprecated methods had clumsy signatures with many top-level parameters. The new signatures align better with the CLI and HTTP API.
Breaking changes
The 0.19.0 release drops several previously deprecated or undocumented features. There are no other significant breaking changes. The 0.19.0 server continues to support 0.18.x CLI versions. But the 0.19.0 CLI does not work with older 0.18.x servers, so you should update the server first or the server and the clients simultaneously.
- Drop the
dstack runCLI command. - Drop the
--attachmode for thedstack logsCLI command. - Drop Pools functionality:
- The
dstack poolCLI commands. /api/project/{project_name}/runs/get_offers,/api/project/{project_name}/runs/create_instance,/api/pools/list_instances,/api/project/{project_name}/pool/*API endpoints.pool_nameandinstance_nameparameters in profiles and run configurations.
- The
- Remove
retry_policyfrom profiles. - Remove
termination_idle_timeandtermination_policyfrom profiles and fleet configurations. - Drop
RUN_NAMEandREPO_IDrun environment variables. - Drop the
/api/backends/config_valuesendpoint used for interactive configuration. - The API accepts and returns
azure_config["regions"]instead ofazure_config["locations"](unified withserver/config.yml).
What's Changed
- Fix gateways with a previously used IP address by @jvstme in #2388
- Simplify backend configurators and models by @r4victor in #2389
- Store BackendType as string instead of enum in the DB by @r4victor in #2393
- Introduce ComputeWith classes to detect compute features by @r4victor in #2392
- Move backend/compute configs from config.py to models.py by @r4victor in #2395
- Provide default run_job implementation for VM backends by @r4victor in #2396
- Configure inter-node SSH on multi-node tasks by @un-def in #2394
- [Blog] Using SSH fleets with TensorWave's private AMD cloud by @peterschmidt85 in #2391
- Add script to generate boilerplate code for new backend by @r4victor in #2397
- Add
datacenter-gpu-manager-4-proprietaryto CUDA images by @un-def in #2399 - Drop pools by @r4victor in #2401
- Transition high-level Python runs API to new methods by @r4victor in #2403
- Drop dstack run by @r4victor in #2404
- Drop dstack logs --attach by @r4victor in #2405
- Remove retry_policy from profiles by @r4victor in #2406
- Remove termination_idle_time and termination_policy by @r4victor in #2407
- Clean up models backward compatibility code by @r4victor in #2408
- Restore removed models fields for compatibility with 0.18 clients by @r4victor in #2414
- Clean up legacy repo fields by @jvstme in #2411
- Switch AWS gateways from t2.micro to t3.micro by @r4victor in #2416
- Remove old client excludes by @r4victor in #2417
- Use new JobTerminationReason values by @r4victor in #2418
- Drop RUN_NAME and REPO_ID env vars by @r4victor in #2419
- Drop irrelevant Nebius backend implementation by @jvstme in #2421
- [Feature]: Support the cursor IDE #2412 by @peterschmidt85 in #2413
- Simplify implementation of new backends #2372 by @olgenn in #2423
- Support multiple domains with Entra login by @r4victor in #2424
- Support setting project members by email by @r4victor in #2429
- Fix json schema reference and invalid properties errors by @r4victor in #2433
- [Blog]: DeepSeek R1 inference performance: MI300X vs. H200 by @peterschmidt85 in #2425
- Add new metrics by @un-def in #2434
- Add instance and job cost/usage Prometheus metrics by @un-def in #2432
- [Docker] Add dstackai/efa image by @un-def in #2422
- Restore fleet termination_policy for 0.18 backward compatibility by @r4victor in #2436
- [Bug]: Search over users doesn't work by @olgenn in #2439
- [Feature]: Support activating/deactivating users via the UI by @olgenn in #2440
- [Feature]: Display Assigned Gateway Information on Run Pages by @olgenn in #2438
- [Docs]: Update the
Metricsguide by @peterschmidt85 in #2441 - [Examples] Update nccl-tests by @un-def in #2415
Full Changelog: 0.18.44...0.19.0


