stress test working #55539

avigyabb · 2025-08-12T18:56:57Z

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: avigyabb <[email protected]>

gemini-code-assist

Code Review

This pull request addresses a race condition with duplicate GPU object references by changing the _gpu_object_store from a simple dictionary to a dictionary of deques. This allows multiple copies of an object to be queued and processed correctly. The addition of test_duplicate_objectref_transfer and test_duplicate_objectref_race_stress is great for verifying this fix. However, the changes to pop_object have introduced a critical issue: a memory leak where primary copy metadata is never cleaned up. I've also pointed out a couple of minor style issues.

gemini-code-assist · 2025-08-12T18:58:11Z

python/ray/experimental/gpu_object_manager/gpu_object_store.py

+            queue = self._gpu_object_store.get(obj_id)
+            assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store"
+            return queue.popleft()


The new implementation of pop_object introduces a memory leak. The previous version of this method was responsible for cleaning up an object's entry from _primary_gpu_object_ids when it was popped. This cleanup logic has been removed, meaning that _primary_gpu_object_ids will grow indefinitely as primary copies are created but never removed.

Additionally, when the queue for an obj_id becomes empty, the key will persist in _gpu_object_store because it's a defaultdict. This should also be cleaned up.

I suggest restoring the cleanup logic. The updated implementation below will remove the object from _primary_gpu_object_ids and _gpu_object_store once its queue is empty.

Suggested change

queue = self._gpu_object_store.get(obj_id)

assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store"

return queue.popleft()

queue = self._gpu_object_store.get(obj_id)

assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store"

tensors = queue.popleft()

if not queue:

# The queue is empty, clean up the entry.

del self._gpu_object_store[obj_id]

self._primary_gpu_object_ids.discard(obj_id)

return tensors

gemini-code-assist · 2025-08-12T18:58:11Z

python/ray/experimental/gpu_object_manager/gpu_object_store.py

@@ -68,6 +69,7 @@ def __ray_recv__(
 ):
    """Helper function that runs on the dst actor to receive tensors from the src actor."""
    from ray._private.worker import global_worker
+    # ~ signal that an object is arriving


This comment seems to be a temporary debug note and should be removed before merging.

python/ray/experimental/gpu_object_manager/gpu_object_store.py

Signed-off-by: avigyabb <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

<img width="1507" height="599" alt="image" src="https://github.com/user-attachments/assets/8f3d6f4b-cc82-4a4b-8a01-6a15e9e628b7" />  ## Why are these changes needed? - Adds pXX filters to reduce number of panels - Adds Task output backpressure time - Fixes metrics for task completion time - wall time for task completion without backpressure Reference PRs: - iamjustinhsu#1 - ray-project#55025  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

## Why are these changes needed? https://github.com/user-attachments/assets/402efd79-cda2-467e-b84a-774ccd69efa5  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: iamjustinhsu <[email protected]>

cpu bases above cuda bases. simpler case should be listed first. Signed-off-by: Lonnie Liu <[email protected]>

## Why are these changes needed? Filtering by Operator will be useful to see only a single operator at a time <img width="1370" height="891" alt="Screenshot 2025-08-11 at 3 32 32 PM" src="https://github.com/user-attachments/assets/90b083ed-cf92-4071-9504-e1b214f20724" />  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alan Guo <[email protected]>

## Why are these changes needed? When working with search and recommendation systems, datasets often contain numerous columns, resulting in large metadata overhead in Parquet files (sometimes a few MBs or more for each file). Currently, the driver fetches first all metadata, then simplifies and merges them to reduce memory usage. However, this process can cause memory peaks proportional to the number of fragments multiplied by their metadata size, potentially leading to OOM issues.  This PR addresses the problem by **simplifying** and **merging** the dataset metadata within each `_fetch_metadata` task before sending it back to the driver. This change helps lower memory consumption and reduces the risk of OOM errors.  Test script: ``` from pyarrow._fs import FileSystem from ray.data.datasource.file_meta_provider import _get_file_infos import os import ray import psutil hdfs_path = "hdfs://path/to/dataset" def list_files(remote_path): filesystem, remote_path = FileSystem.from_uri(remote_path) from ray.data.datasource.file_meta_provider import _get_file_infos files = _get_file_infos(remote_path, filesystem) return filesystem, files filesystem, files = list_files(hdfs_path) files = [f"{fs[0]}" for fs in files if fs[0].endswith(".parquet")] process = psutil.Process() print(f"total file_num: {len(files)}") start_mem = process.memory_info().rss / 1024**2 dataset = ray.data.read_parquet(paths=files[:500], filesystem=filesystem) end_mem = process.memory_info().rss / 1024**2 print(f"datasize: {dataset.count()}, col number: {len(dataset.columns())}") print(f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, end: {end_mem:.3f}MiB]") ``` Output before this PR: ``` total file_num: 2358 Metadata Fetch Progress 0: 100%|███████████████████████|500/500 [02:13<00:00, 3.75 task/s] Parquet Files Sample 0: 100%|███████████████████████| 5.00/5.00 [00:13<00:00, 2.62s/ file] datasize: 22630452, col number: 1200 mem diff 13727.605MiB [start: 570.617MiB, end: 14298.223MiB] ``` Output after this PR: ``` total file_num: 2358 Metadata Fetch Progress 0: 100%|███████████████████████| 500/500 [01:55<00:00, 4.35 task/s] Parquet Files Sample 0: 100%|███████████████████████| 5.00/5.00 [00:03<00:00, 1.56 file/s] datasize: 22630452, col number: 1200 mem diff 69.113MiB [start: 575.820MiB, end: 644.934MiB] ``` We can see the memory usage reduce from 13GBs to 69MBs. Note: This approach is most effective for large-scale datasets. If `len(fragments) < PARALLELIZE_META_FETCH_THRESHOLD`, there will be no performance improvements. ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: haotian <[email protected]> Signed-off-by: Howie Tien <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <[email protected]>

stephanie-wang

Thanks! Could you also update the PR title and description?

python/ray/experimental/gpu_object_manager/gpu_object_store.py

python/ray/tests/test_gpu_objects_gloo.py

…t#55522) ## Why are these changes needed? Perf can differ at various concurrencies. This makes it easy to configure / run at various concurrencies + max_ongoing_request combinations ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: akyang-anyscale <[email protected]>

Signed-off-by: avigyabb <[email protected]>

…tor object (ray-project#55538) Signed-off-by: Seiji Eicher <[email protected]>

…ject#55503)   ## Why are these changes needed? Currently, we use sum(shuffle_task.cpu_usage for all shuffling tasks). This can be very slow for large number of shuffling tasks. We can mitigate this by calculating the usage on every task submission  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <[email protected]>

…project#55525) - Shortening bazel workspace name: `s/com_github_ray_project_ray/io_ray` - Set `--output_base=c:/bzl` instead of `--output_user_root=c:/raytmp`, which saves ~20 characters in the path. --------- Signed-off-by: Edward Oakes <[email protected]>

## Why are these changes needed? Currently, the test is allocated only 100Mb, while it's trying to store 2Gb worth of data in the Object Store. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>

… tasks. (ray-project#54966)

Removing raylet_ip_address as it was never used (so it defaulted to node_ip_address) and is only use in private APIs. Originally introduced here: https://github.com/ray-project/ray/pull/7985/files --------- Signed-off-by: joshlee <[email protected]>

Currently the following pattern throws many lint errors as `ActorDemoRay.options(name="demo_ray")` returns an instance of `ActorOptionWrapper` which messes with the IDE's static type checker: ```python import ray from ray import ObjectRef from ray.actor import ActorProxy, ActorClass class DemoRay: def __init__(self, init: int): self.init = init @ray.method def calculate(self, v1: int, v2: int) -> int: return self.init + v1 + v2 ActorDemoRay: ActorClass[DemoRay] = ray.remote(DemoRay) def main(): p: ActorProxy[DemoRay] = ActorDemoRay.options(name="demo_ray").remote(1) actor: ActorProxy[DemoRay] = ray.get_actor("demo_ray") a = actor.calculate.remote(1, 2) print(ray.get(a)) return if __name__ == "__main__": main() ``` This PR changes ActorClass[T].options(...) to return a new instance of ActorClass[T] instead, allow IDEs to correct infer the type of subsequent `.remote(...)` calls ray-project#54149 --------- Signed-off-by: will.lin <[email protected]>

…oject#55564) they should be always building from release test pipeline directly we used to run release tests on postmerge; we are no longer doing it any more. also add oss tag for those steps. Signed-off-by: Lonnie Liu <[email protected]>

…t#55013)   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: will.lin <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

Signed-off-by: Rui Qiao <[email protected]>

…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Andrew Sy Kim <[email protected]> Co-authored-by: Andrew Sy Kim <[email protected]>

…etion of the execution (ray-project#55565)   ## Why are these changes needed? In 2.48 change introduced debouncing handling that disallows downscaling for Actor Pool for 30s after latest upscaling to give AP Operator enough time to start utilizing upscaled actor. However, that affected ability of the Actor Pool to downscale upon completion of the execution: when operator completes execution it should start downscaling immediately. This change addresses that. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…egator Agent (ray-project#55529) This PR improves the target http endpoint in the aggregator_agent.py: Merge the address and port as one env var to specify the target http endpoint Set the default value of the endpoint to be empty. And only when the endpoint is specified, we send the events out to the endpoint Update corresponding tests ----------- Signed-off-by: Mengjin Yan <[email protected]> Signed-off-by: myan <[email protected]>

…rt.md` (ray-project#55570) Signed-off-by: win5923 <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>

added incorrectly in a past change Signed-off-by: Lonnie Liu <[email protected]>

for easier use on ray cluster hosters like anyscale. Signed-off-by: Lonnie Liu <[email protected]>

…eed (ray-project#55076) This adds a call `ray.experimental.wait_tensor_freed` that allows user code to check when a tensor that it put into Ray's GPU object store has been freed. Unlike the normal Ray object store, the GPU object store is just a Python data structure on the actor, which allows us to avoid copying. This means that the actor can keep a reference to an object in its store. The API call allows the actor to check when the object has been freed from the store, so that it can safely write to the tensor again. Closes ray-project#52341. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>

in addition to current tags. first step to migrate to use rayci build id tags to stop release test jobs from cross-talking to each other Signed-off-by: Lonnie Liu <[email protected]>

…t#55599) Signed-off-by: Mengjin Yan <[email protected]>

…es (ray-project#55355)   ## Why are these changes needed? This PR is a revert of [ray-project#55333](ray-project#55333) and resolves conflict by [ray-project#55163](ray-project#55163) Original description: Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration.  Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata. ## Related issue number  ## Checks - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <[email protected]>

…oject#55218) Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Co-authored-by: angelinalg <[email protected]>

- skipping test task processor for windows to unblock Signed-off-by: harshit <[email protected]>

…at (ray-project#55588) Signed-off-by: Seiji Eicher <[email protected]>

…ale images (ray-project#55580) rather than using the hard-coded filename Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>

Signed-off-by: avigyabb <[email protected]>

…gyabb/ray into avigyabb/object-ref-send-fix

stress test working

610d7cb

Signed-off-by: avigyabb <[email protected]>

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

avigyabb and others added 3 commits August 12, 2025 12:01

fix primary object conditional in pop

8ac146b

Signed-off-by: avigyabb <[email protected]>

[Core] Install uv from test-requirements.txt (ray-project#55483)

6030ee7

Signed-off-by: Jiajun Yao <[email protected]>

stephanie-wang assigned stephanie-wang, kevin85421 and Qiaolin-Yu Aug 12, 2025

iamjustinhsu and others added 4 commits August 12, 2025 21:07

[ci] base image: move cpu base up (ray-project#55536)

5d7ff2d

cpu bases above cuda bases. simpler case should be listed first. Signed-off-by: Lonnie Liu <[email protected]>

stephanie-wang reviewed Aug 12, 2025

View reviewed changes

akyang-anyscale and others added 17 commits August 12, 2025 16:18

address comments

cb3324a

Signed-off-by: avigyabb <[email protected]>

address comments

6f17fa5

Signed-off-by: avigyabb <[email protected]>

[serve.llm] Serialize ChatCompletionRequest tool_calls ValidatorItera…

b4b2a89

…tor object (ray-project#55538) Signed-off-by: Seiji Eicher <[email protected]>

[RLlib] Add another curriculum example using Atari Pong and frameskip…

15d33bf

… tasks. (ray-project#54966)

[RLlib] Implicit Q-Learning. (ray-project#55304)

0ee95c8

[Serve.llm] Add missing data_parallel/__init__.py (ray-project#55573)

6d318ce

Signed-off-by: Rui Qiao <[email protected]>

win5923 and others added 14 commits August 14, 2025 16:09

[Doc][KubeRay] Add InteractiveMode description for `ray-job-quick-sta…

7518fd8

…rt.md` (ray-project#55570) Signed-off-by: win5923 <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>

[ci] remove unused use_base_extra (ray-project#55604)

ea27578

added incorrectly in a past change Signed-off-by: Lonnie Liu <[email protected]>

[image] add base-extra for aarch64 images (ray-project#55586)

f0b0aad

for easier use on ray cluster hosters like anyscale. Signed-off-by: Lonnie Liu <[email protected]>

[ci] add rayci build id tags for release test images (ray-project#55605)

c4d990c

in addition to current tags. first step to migrate to use rayci build id tags to stop release test jobs from cross-talking to each other Signed-off-by: Lonnie Liu <[email protected]>

[Core] Add Logic to Check Label Selector in PG Scheduling (ray-projec…

6a9938a

…t#55599) Signed-off-by: Mengjin Yan <[email protected]>

[serve.llm][docs] Documentation for prefix cache-aware router (ray-pr…

400ea77

…oject#55218) Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Co-authored-by: angelinalg <[email protected]>

skip test task processor for windows (ray-project#55616)

37158a2

- skipping test task processor for windows to unblock Signed-off-by: harshit <[email protected]>

[Serve.llm] Improve PrefixCacheAffinityRouter text normalization comp…

49d336c

…at (ray-project#55588) Signed-off-by: Seiji Eicher <[email protected]>

[ci] read gce credentials file from global config when building anysc…

61bc2e8

…ale images (ray-project#55580) rather than using the hard-coded filename Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>

fixed tests

f476aca

Signed-off-by: avigyabb <[email protected]>

fixed tests

2a306c3

Signed-off-by: avigyabb <[email protected]>

Merge branch 'avigyabb/object-ref-send-fix' of https://github.com/avi…

b62522e

…gyabb/ray into avigyabb/object-ref-send-fix

avigyabb requested review from a team, richardliaw, edoakes, aslonnie, SongGuyang, raulchen, kfstorm and WangTaoTheTonic as code owners August 14, 2025 22:18

avigyabb closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stress test working #55539

stress test working #55539

Uh oh!

avigyabb commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

Uh oh!

stephanie-wang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-            queue = self._gpu_object_store.get(obj_id)
-            assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store"
-            return queue.popleft()
+            queue = self._gpu_object_store.get(obj_id)
+            assert queue and len(queue) > 0, f"obj_id={obj_id} not found in GPU object store"
+            tensors = queue.popleft()
+            if not queue:
+                # The queue is empty, clean up the entry.
+                del self._gpu_object_store[obj_id]
+                self._primary_gpu_object_ids.discard(obj_id)
+            return tensors

stress test working #55539

stress test working #55539

Uh oh!

Conversation

avigyabb commented Aug 12, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!