Skip to content

Tags: dstackai/dstack

Tags

0.20.17

Toggle 0.20.17's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Disable go-integration-tests for release (#3791)

0.20.16

Toggle 0.20.16's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Verda: make startup script and SSH key lifecycle per-instance with re…

…liable cleanup (#3718)

* Make Verda startup scripts and SSH keys lifecycle symmetric

* Fix Verda test imports for Python 3.9 collection

* Update src/dstack/_internal/core/backends/verda/compute.py

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

* Update src/dstack/_internal/core/backends/verda/compute.py

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

* Update src/dstack/_internal/core/backends/verda/compute.py

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

* Update src/dstack/_internal/core/backends/verda/compute.py

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

* Fix Verda terminate tests for merge-base API args

---------

Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

0.20.16rc2

Toggle 0.20.16rc2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Tests: bump pytest-asyncio>=0.25.2 (#3733)

Fixes "coroutine method 'aclose' of <async_generator> was never awaited"
warnings in pytest logs

See: pytest-dev/pytest-asyncio#759
See: pytest-dev/pytest-asyncio#1034

0.20.16rc1

Toggle 0.20.16rc1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix SELinux denials and "Text file busy" on SSH fleet provisioning (#…

…3712)

The shim binary download uses cp to copy from /tmp to /usr/local/bin/.
This causes two issues:

1. "Text file busy" (ETXTBSY) when re-provisioning without cleanup,
   because cp tries to write to a running executable. Revert to mv
   which atomically replaces the directory entry.

2. On SELinux-enforcing hosts (RHEL, Rocky), mv from /tmp preserves
   the user_tmp_t context. Add chcon to set the correct bin_t context.
   No-op on non-SELinux systems via 2>/dev/null || true.

Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0.20.15

Toggle 0.20.15's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Respect top-level `blocks` in SSH fleet configuration (#3700)

Fixes: #3278

0.20.14

Toggle 0.20.14's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[Azure] Add support for H100 NVL and H200 VM series; refactor instanc…

…e creation methods to cleanup failed instances (#3699)

Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>

0.20.13

Toggle 0.20.13's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix CLI compatibility with older servers (#3664)

0.20.12

Toggle 0.20.12's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add Crusoe Cloud backend (#3602)

* Add Crusoe Cloud backend

Add a VM-based Crusoe Cloud backend supporting single-node and
multi-node (cluster) provisioning with InfiniBand.

Key features:
- gpuhunt online provider for offers with project quota filtering
- HMAC-SHA256 authenticated REST API client
- Image selection based on GPU type (SXM/PCIe/ROCm/CPU)
- Storage: persistent data disk for types without ephemeral NVMe;
  auto-detects and RAID-0s NVMe for types with ephemeral storage;
  moves containerd storage so containers get the full disk space
- Cluster support via IB partitions
- Two-phase termination with data disk cleanup

Tested end-to-end:
- L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination
- A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination
- A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB
  and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock)
- Offers: quota enforcement, disk sizes correct per instance type

Not tested (no capacity/quota):
- H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available)
- CPU-only instances c1a/s1a (no quota)
- Spot provisioning (disabled in gpuhunt, see TODO)
- Full 2-node cluster with IB connectivity test

TODOs:
- Spot: disabled until Crusoe confirms how to request spot billing
  via the VM create API endpoint
- gpuhunt dependency: currently installed from PR branch; switch to
  pinned version after gpuhunt PR #211 is merged and released

AI Assistance: This implementation was developed with AI assistance.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fetch Crusoe locations dynamically instead of hardcoding

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix VM image selection for SXM instance types

The _get_image function checked gpu_type (e.g. 'A100') for 'SXM', but
gpuhunt normalizes GPU names and strips the SXM qualifier. Check the
instance type name instead (e.g. 'a100-80gb-sxm-ib.8x') which
preserves the '-sxm' indicator.

Without this fix, SXM-IB instances used the PCIe docker image which
lacks IB drivers, HPC-X, and NCCL topology files. Verified with a
2-node A100-SXM-IB NCCL all_reduce test: 193 GB/s bus bandwidth.

Made-with: Cursor

* Switch gpuhunt dependency from PR branch to main

Made-with: Cursor

* Add TODOs to pin gpuhunt and remove allow-direct-references before merging

Made-with: Cursor

* Pin gpuhunt==0.1.17 (matches master)

Made-with: Cursor

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

0.20.11

Toggle 0.20.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix concurrent indexes migration (#3591)

0.20.10

Toggle 0.20.10's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[runner] Check if repo dir exists before chown (#3589)

The check is added to avoid the following log message when
no repo specified or the repo is empty:

> Error while walking repo dir path=/workflow err=lstat /workflow:
> no such file or directory

In addition, walk/chown errors log level is changed to warning to
highlight possible issues.