[data] Add `load_from_uris` #55554

omatthew98 · 2025-08-12T22:47:17Z

Why are these changes needed?

WIP, will add more description later.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Owen <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a new load_from_uris method to the Dataset class. The intention is to load data from a column of URIs. My review focuses on several critical issues in the implementation. The use of take_all() can lead to driver OOM for large datasets. The core logic for applying the decode_fn is incorrect due to a type mismatch with Dataset.map(). There are also several other medium to high severity issues related to potential runtime errors, style, and use of hardcoded values. I've provided suggestions to address these points.

gemini-code-assist · 2025-08-12T22:48:49Z

python/ray/data/dataset.py

+        # print(f"sampled_size: {sampled_size}")
+        # print(f"in_memory_size_estimate: {in_memory_size_estimate}")
+        # print(f"repartitioning to {num_partitions} partitions")
+        return self.repartition(num_partitions).map(decode_fn)


The decode_fn is being passed to Dataset.map(), which expects a function that operates on individual rows (i.e., Dict[str, Any]). However, the type hint and apparent intent of decode_fn is to operate on a file-like object (io.RawIOBase). This will cause a TypeError at runtime.

You likely need to use map_batches() and within the provided function, open and read the file for each URI in the batch, then apply the decode_fn.

gemini-code-assist · 2025-08-12T22:48:49Z

python/ray/data/dataset.py

+
+        # Step 2: Sample paths to estimate total in memory size after reading.
+        sampled_paths = [
+            row[uri_column] for row in random.sample(paths, RANDOM_SAMPLE_SIZE)


random.sample(population, k) will raise a ValueError if k > len(population). This can happen here if the number of paths is less than RANDOM_SAMPLE_SIZE. You should guard against this by taking the minimum of the number of paths and the sample size.

Suggested change

row[uri_column] for row in random.sample(paths, RANDOM_SAMPLE_SIZE)

row[uri_column] for row in random.sample(paths, min(len(paths), RANDOM_SAMPLE_SIZE))

gemini-code-assist · 2025-08-12T22:48:49Z

python/ray/data/dataset.py

+        RANDOM_SAMPLE_SIZE = 10
+        import math
+        import random


For better code style and to avoid re-declarations, RANDOM_SAMPLE_SIZE should be a module-level constant, and import math, import random should be at the top of the file.

gemini-code-assist · 2025-08-12T22:48:49Z

python/ray/data/dataset.py

+            def default_decode_fn(f: io.RawIOBase) -> Iterator[DataBatch]:
+                return f


The default implementation of decode_fn returns the file-like object f itself. However, the type hint for decode_fn specifies a return type of Iterator[DataBatch]. This mismatch should be corrected. The default behavior should probably be to read the file and return it as a binary chunk in a DataBatch.

gemini-code-assist · 2025-08-12T22:48:49Z

python/ray/data/dataset.py

+        num_partitions = math.ceil(
+            in_memory_size_estimate / (1024 * 1024 * 1024)
+        )  # (Each task reads 1 GiB)


The number of partitions is calculated using a hardcoded 1 GiB block size. It's better to use DataContext.get_current().target_max_block_size to respect the configured block size for the dataset, making the behavior more consistent and configurable.

Suggested change

num_partitions = math.ceil(

in_memory_size_estimate / (1024 * 1024 * 1024)

) # (Each task reads 1 GiB)

target_max_block_size = DataContext.get_current().target_max_block_size or (

1024 * 1024 * 1024

)

num_partitions = math.ceil(

in_memory_size_estimate / target_max_block_size

)

For consistency with the Datasource implementation, agree target_max_block_size might be better here

gemini-code-assist · 2025-08-12T22:48:49Z

python/ray/data/dataset.py

+        # print(f"num_rows: {len(paths)}")
+        # print(f"total_sampled_size: {total_sampled_size}")
+        # print(f"sampled_size: {sampled_size}")
+        # print(f"in_memory_size_estimate: {in_memory_size_estimate}")
+        # print(f"repartitioning to {num_partitions} partitions")


These commented-out print statements appear to be for debugging and should be removed from the final code.

bveeramani · 2025-08-13T16:26:42Z

python/ray/data/dataset.py

+        self,
+        uri_column: str,
+        *,
+        decode_fn: Optional[Callable[[io.RawIOBase], Iterator[DataBatch]]] = None,


To avoid blocking this PR on discussion about decode_fn, I think it might be better to leave it out for now

Think this is fine

bveeramani · 2025-08-13T16:27:02Z

python/ray/data/dataset.py

+        import math
+        import random


Move imports to top of file?

bveeramani · 2025-08-13T16:27:54Z

python/ray/data/dataset.py

+        total_sampled_size = (
+            ray.data.read_binary_files(sampled_paths)
+            .map(lambda r: {"size_bytes": len(r["bytes"])})
+            .sum("size_bytes")
+        )


Can we use FileMetadataProvider to get the file sizes? It'll be much cheaper than actually reading the data

And actually, if we're just listing all the file sizes with FileMetadataProvider, I think we can just take the mean.

bveeramani · 2025-08-13T16:30:26Z

python/ray/data/dataset.py

+        num_partitions = math.ceil(
+            in_memory_size_estimate / (1024 * 1024 * 1024)
+        )  # (Each task reads 1 GiB)


For consistency with the Datasource implementation, agree target_max_block_size might be better here

bveeramani · 2025-08-13T16:31:04Z

python/ray/data/dataset.py

+        # print(f"sampled_size: {sampled_size}")
+        # print(f"in_memory_size_estimate: {in_memory_size_estimate}")
+        # print(f"repartitioning to {num_partitions} partitions")
+        return self.repartition(num_partitions).map(decode_fn)


richardliaw · 2025-08-13T19:06:18Z

python/ray/data/dataset.py

+        """Load binary data from a column of URIs.
+
+        Args:
+            uri_column: The name of the column containing the URIs.


Please be sure to document what is the returned column type, and what is the name of the new column?

omatthew98 · 2025-08-13T21:20:55Z

Going to go with a different approach using the expressions API.

adding in basic prototype

7a7679f

Signed-off-by: Matthew Owen <[email protected]>

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

bveeramani reviewed Aug 13, 2025

View reviewed changes

ray-project deleted a comment from gemini-code-assist bot Aug 13, 2025

richardliaw reviewed Aug 13, 2025

View reviewed changes

omatthew98 closed this Aug 13, 2025

	row[uri_column] for row in random.sample(paths, RANDOM_SAMPLE_SIZE)
	row[uri_column] for row in random.sample(paths, min(len(paths), RANDOM_SAMPLE_SIZE))

		def default_decode_fn(f: io.RawIOBase) -> Iterator[DataBatch]:
		return f

[data] Add load_from_uris #55554

[data] Add load_from_uris #55554

Uh oh!

Conversation

omatthew98 commented Aug 12, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omatthew98 commented Aug 13, 2025

Uh oh!

Uh oh!

[data] Add `load_from_uris` #55554

[data] Add `load_from_uris` #55554