Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b7d0c5b
change version to 2.53.0 (#59487)
aslonnie Dec 16, 2025
3332314
add missing cuda 12.9 ray-extra (#59495)
aslonnie Dec 17, 2025
1736619
[Data][Cherry-pick] Fix bug where `AutoscalingCoordinator` crashes if…
bveeramani Dec 17, 2025
0de2118
[Data] Concurrency cap backpressure with tuning (Disabled) (#59519)
srinathk10 Dec 17, 2025
25da710
[docker] Update latest Docker dependencies for 2.53.0 release (#59606)
khluu Dec 20, 2025
39df1b0
add vanilla train driver
TimothySeah Dec 19, 2025
5ef6975
draft
TimothySeah Dec 23, 2025
7838c78
basic script to copy
TimothySeah Jan 8, 2026
f1fbfb3
draft: in place workers
TimothySeah Jan 9, 2026
5e5fd88
try new copier.sh
TimothySeah Jan 9, 2026
7894af1
remove env
TimothySeah Jan 9, 2026
1ad9a5d
fixes - almost there except for shutdown
TimothySeah Jan 9, 2026
5f32482
all print metrics
TimothySeah Jan 9, 2026
8cccf76
fix env vars and parameters
TimothySeah Jan 10, 2026
436def7
try saving and loading dataloader state too
TimothySeah Jan 10, 2026
cfbee80
Revert "try saving and loading dataloader state too"
TimothySeah Jan 10, 2026
99f5b4e
1 epoch + zero grad first
TimothySeah Jan 10, 2026
1693d4b
use step-based training termination as recommended by torchft team
TimothySeah Jan 13, 2026
bd88ae2
copier.sh now copies from repo to site packages
TimothySeah Jan 19, 2026
5cb09f1
break not exit
TimothySeah Jan 19, 2026
9c3e4d6
change torchft primitives only
TimothySeah Feb 4, 2026
27b1a83
convert to replicagroup based
TimothySeah Feb 7, 2026
699ff1a
Get model parallel example working!
TimothySeah Feb 8, 2026
e24f93f
confirm still works in other cases
TimothySeah Feb 8, 2026
814f464
correctly accumulate gradients across replica groups
TimothySeah Feb 8, 2026
9c5c53f
try with register - ddp._comm_hook looks more complicated
TimothySeah Feb 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/build.rayci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ steps:
--platform cu12.1.1-cudnn8 --platform cu12.3.2-cudnn9
--platform cu12.4.1-cudnn --platform cu12.5.1-cudnn
--platform cu12.6.3-cudnn --platform cu12.8.1-cudnn
--platform cu12.9.1-cudnn
--platform cpu
--image-type ray-extra --upload
depends_on:
Expand Down
2 changes: 0 additions & 2 deletions ci/lint/pydoclint-baseline.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1941,8 +1941,6 @@ python/ray/train/v2/_internal/execution/worker_group/worker_group.py
DOC103: Method `WorkerGroup.execute_single_async`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [**fn_kwargs: , *fn_args: , fn: Callable[..., T], rank: int].
DOC101: Method `WorkerGroup.execute_single`: Docstring contains fewer arguments than in function signature.
DOC103: Method `WorkerGroup.execute_single`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [**fn_kwargs: , *fn_args: , fn: Callable[..., T], rank: int].
DOC101: Method `WorkerGroup._assign_worker_ranks`: Docstring contains fewer arguments than in function signature.
DOC103: Method `WorkerGroup._assign_worker_ranks`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [workers: List[Worker]].
DOC101: Method `WorkerGroup._decorate_worker_log_file_paths`: Docstring contains fewer arguments than in function signature.
DOC103: Method `WorkerGroup._decorate_worker_log_file_paths`: Docstring arguments are different from function arguments. (Or could be other formatting issues: https://jsh9.github.io/pydoclint/violation_codes.html#notes-on-doc103 ). Arguments in the function signature but not in the docstring: [workers: List[Worker]].
--------------------
Expand Down
2 changes: 1 addition & 1 deletion ci/ray_ci/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
GLOBAL_CONFIG_FILE = (
os.environ.get("RAYCI_GLOBAL_CONFIG") or "ci/ray_ci/oss_config.yaml"
)
RAY_VERSION = "3.0.0.dev0"
RAY_VERSION = "2.53.0"


def ci_init() -> None:
Expand Down
136 changes: 136 additions & 0 deletions copier.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
#!/usr/bin/env bash
set -uo pipefail

# Source base: ray repo on head node
SRC_BASE="/home/ray/default/ray/python"

# Destination base: ray's site-packages location
DST_BASE=$(python -c "import ray; import os; print(os.path.dirname(ray.__path__[0]))")

# Parse arguments - if none provided, copy entire directory
if [[ $# -eq 0 ]]; then
FILES=("") # Empty string means copy the whole SRC_BASE
else
FILES=("$@")
fi

PORT=2222
SSH_OPTS=(-p "$PORT" -o StrictHostKeyChecking=accept-new)
SCP_OPTS=(-P "$PORT" -o StrictHostKeyChecking=accept-new)

echo "Source base: $SRC_BASE"
echo "Destination base: $DST_BASE"
echo "Files to copy:"
for f in "${FILES[@]}"; do
if [[ -z "$f" ]]; then
echo " - (entire directory)"
else
echo " - $f"
fi
done
echo ""

# Get node IPs using Python/Ray
echo "==> Getting node IPs from Ray cluster..."
NODE_INFO=$(python -c "
import ray
ray.init(ignore_reinit_error=True)
nodes = ray.nodes()
head_ip = None
worker_ips = []
for node in nodes:
if node['Alive']:
ip = node['NodeManagerAddress']
if node.get('Resources', {}).get('node:__internal_head__'):
head_ip = ip
else:
worker_ips.append(ip)
# If no explicit head marker, use first node as head
if head_ip is None and worker_ips:
head_ip = worker_ips.pop(0)
print(f'HEAD:{head_ip or \"\"}')
for ip in worker_ips:
print(f'WORKER:{ip}')
")

if [[ -z "$NODE_INFO" ]]; then
echo "Error: Could not get node info from Ray. Is Ray running?"
exit 1
fi

echo "$NODE_INFO"
echo ""

# Parse head and worker IPs
HEAD_IP=$(echo "$NODE_INFO" | grep "^HEAD:" | cut -d: -f2)
WORKER_IPS=()
while IFS= read -r line; do
ip=$(echo "$line" | cut -d: -f2)
if [[ -n "$ip" ]]; then
WORKER_IPS+=("$ip")
fi
done <<< "$(echo "$NODE_INFO" | grep "^WORKER:")"

if [[ -z "$HEAD_IP" ]]; then
echo "Error: Could not determine head node IP"
exit 1
fi

echo "Head node IP: $HEAD_IP (this machine)"

echo "Worker node IPs: ${WORKER_IPS[*]:-none}"
echo ""

# Re-enable exit on error for the copy operations
set -e

# Copy from SRC to DST on head node (local)
echo "==> Copying to head node (local)"
for f in "${FILES[@]}"; do
if [[ -z "$f" ]]; then
src="$SRC_BASE"
dst="$DST_BASE"
else
src="$SRC_BASE/$f"
dst="$DST_BASE/$f"
fi
echo " $src -> $dst"
if [[ -e "$dst" ]]; then
cp -r "$dst" "${dst}.bak.$(date +%s)"
fi
mkdir -p "$(dirname "$dst")"
cp -r "$src" "$dst"
done
echo " Done copying to head node"

# Copy from SRC to DST on each worker node
for worker_ip in "${WORKER_IPS[@]}"; do
echo "==> Copying to worker ($worker_ip)"

for f in "${FILES[@]}"; do
if [[ -z "$f" ]]; then
src="$SRC_BASE"
dst="$DST_BASE"
else
src="$SRC_BASE/$f"
dst="$DST_BASE/$f"
fi
echo " $src -> $worker_ip:$dst"

# Note: $dst expands locally (intentional - we pass the value to remote)
# shellcheck disable=SC2029
ssh "${SSH_OPTS[@]}" "$worker_ip" "mkdir -p '$(dirname "$dst")'"

# Note: $dst expands locally, \$(date +%s) expands on remote (intentional)
# shellcheck disable=SC2029
ssh "${SSH_OPTS[@]}" "$worker_ip" "test -e '$dst' && cp -r '$dst' '${dst}.bak.\$(date +%s)' || true"

# Copy from head (local) to worker
scp -r "${SCP_OPTS[@]}" "$src" "$worker_ip:$dst"
done

echo " Done copying to $worker_ip"
done

echo ""
echo "All copies complete."
1 change: 1 addition & 0 deletions default_copier.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
./copier.sh ray/train/v2/_internal/constants.py ray/train/v2/_internal/execution/controller/controller.py ray/train/v2/_internal/execution/worker_group/worker.py ray/train/v2/_internal/execution/worker_group/worker_group.py ray/train/v2/api/data_parallel_trainer.py ray/train/torch/config.py ray/train/v2/_internal/callbacks/backend_setup.py ray/train/v2/_internal/execution/context.py ray/train/v2/api/context.py
2 changes: 1 addition & 1 deletion doc/source/ray-overview/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ We publish the dependencies that are installed in our ``ray`` Docker images for
.. tab-item:: ray (Python 3.10)
:sync: ray (Python 3.10)

Ray version: nightly (`ae94ff4 <https://github.com/ray-project/ray/commit/ae94ff496a308c52100cd99b1857836b739498e0>`_)
Ray version: 2.53.0 (`0de2118 <https://github.com/ray-project/ray/commit/0de211850589aea71f842873bc32574c702ab492>`_)

.. literalinclude:: ./pip_freeze_ray-py310-cpu.txt

Expand Down
51 changes: 28 additions & 23 deletions doc/source/ray-overview/pip_freeze_ray-py310-cpu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ aiohttp==3.11.16
aiohttp-cors==0.7.0
aiosignal==1.3.1
amqp==5.3.1
annotated-doc==0.0.4
annotated-types==0.6.0
anyio==3.7.1
archspec @ file:///home/conda/feedstock_root/build_artifacts/archspec_1737352602016/work
Expand All @@ -18,11 +19,11 @@ billiard==4.2.1
boltons @ file:///home/conda/feedstock_root/build_artifacts/boltons_1733827268945/work
boto3==1.29.7
botocore==1.32.7
Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1761591974641/work
Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1764016952863/work
cachetools==5.5.2
celery==5.5.3
certifi==2025.1.31
cffi==1.16.0
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1725560520483/work
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.1
Expand All @@ -31,20 +32,20 @@ click-repl==0.3.0
cloudpickle==2.2.0
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1733218098505/work
colorful==0.5.5
conda @ file:///home/conda/feedstock_root/build_artifacts/conda_1760108586898/work/conda-src
conda-libmamba-solver @ file:///home/conda/feedstock_root/build_artifacts/conda-libmamba-solver_1745834476052/work/src
conda @ file:///home/conda/feedstock_root/build_artifacts/conda_1765816446718/work/conda-src
conda-libmamba-solver @ file:///home/conda/feedstock_root/build_artifacts/conda-libmamba-solver_1764081326783/work/src
conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1736345463896/work
conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1729004031731/work
cryptography==44.0.3
cupy-cuda12x==13.1.0
cupy-cuda12x==13.4.0
Cython==0.29.37
distlib==0.3.7
distro @ file:///home/conda/feedstock_root/build_artifacts/distro_1734729835256/work
dm-tree==0.1.8
exceptiongroup==1.3.0
exceptiongroup==1.3.1
Farama-Notifications==0.0.4
fastapi==0.115.12
fastrlock==0.8.2
fastapi==0.121.0
fastrlock==0.8.3
filelock==3.17.0
flatbuffers==23.5.26
frozendict @ file:///home/conda/feedstock_root/build_artifacts/frozendict_1763082794572/work
Expand Down Expand Up @@ -74,17 +75,19 @@ isodate==0.6.1
Jinja2==3.1.6
jmespath==1.0.1
jsonpatch @ file:///home/conda/feedstock_root/build_artifacts/jsonpatch_1733814567314/work
jsonpointer @ file:///home/conda/feedstock_root/build_artifacts/jsonpointer_1756754132747/work
jsonpointer @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_jsonpointer_1765026384/work
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
kombu==5.5.4
libmambapy @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_libmambapy_1760729597/work/libmambapy
lz4==4.3.3
libmambapy @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_libmambapy_1764158555/work/libmambapy
linkify-it-py==2.0.3
lz4==4.4.5
markdown-it-py==2.2.0
MarkupSafe==2.1.3
mdit-py-plugins==0.3.5
mdurl==0.1.2
memray==1.10.0
menuinst @ file:///home/conda/feedstock_root/build_artifacts/menuinst_1761299740801/work
memray==1.19.1
menuinst @ file:///home/conda/feedstock_root/build_artifacts/menuinst_1765733081264/work
msal==1.28.1
msal-extensions==1.2.0b1
msgpack==1.0.7
Expand All @@ -98,7 +101,7 @@ opentelemetry-proto==1.27.0
opentelemetry-sdk==1.34.1
opentelemetry-semantic-conventions==0.55b1
ormsgpack==1.7.0
packaging==23.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1733203243479/work
pandas==1.5.3
platformdirs==3.11.0
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1733222765875/work
Expand All @@ -115,8 +118,8 @@ pyasn1==0.5.1
pyasn1-modules==0.3.0
pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1757744612102/work
pycparser==2.21
pydantic==2.11.7
pydantic_core==2.33.2
pydantic==2.12.4
pydantic_core==2.41.5
Pygments==2.18.0
PyJWT==2.8.0
pyOpenSSL==25.0.0
Expand All @@ -125,11 +128,11 @@ PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_173321723672
python-dateutil==2.8.2
python-dotenv==1.2.1
pytz==2022.7.1
PyYAML==6.0.1
ray @ file:///home/ray/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl#sha256=c096db8428166779cf10817e3698d3fdf50e11e45911cc95f6feeb2b44cdc481
PyYAML==6.0.3
ray @ file:///home/ray/ray-2.53.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=ec758f5aa71f01f090557a0fe8732689f7e2f8e49a1f39f4649ee9a7804c7514
referencing==0.36.2
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1733217035951/work
rich==13.3.2
requests==2.32.5
rich==13.7.1
rpds-py==0.22.3
rsa==4.7.2
ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1761160605807/work
Expand All @@ -139,13 +142,15 @@ scipy==1.11.4
six==1.16.0
smart-open==6.2.0
sniffio==1.3.1
starlette==0.46.2
starlette==0.49.1
tensorboardX==2.6.2.2
textual==4.0.0
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1735661334605/work
truststore @ file:///home/conda/feedstock_root/build_artifacts/truststore_1729762363021/work
typing-inspection==0.4.1
typing_extensions==4.12.2
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
uc-micro-py==1.0.3
uritemplate==4.1.1
urllib3==1.26.19
uvicorn==0.22.0
Expand Down
Loading