Skip to content

Conversation

omatthew98
Copy link
Contributor

Why are these changes needed?

WIP, will add more description later.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Matthew Owen <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new load_from_uris method to the Dataset class. The intention is to load data from a column of URIs. My review focuses on several critical issues in the implementation. The use of take_all() can lead to driver OOM for large datasets. The core logic for applying the decode_fn is incorrect due to a type mismatch with Dataset.map(). There are also several other medium to high severity issues related to potential runtime errors, style, and use of hardcoded values. I've provided suggestions to address these points.

# print(f"sampled_size: {sampled_size}")
# print(f"in_memory_size_estimate: {in_memory_size_estimate}")
# print(f"repartitioning to {num_partitions} partitions")
return self.repartition(num_partitions).map(decode_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The decode_fn is being passed to Dataset.map(), which expects a function that operates on individual rows (i.e., Dict[str, Any]). However, the type hint and apparent intent of decode_fn is to operate on a file-like object (io.RawIOBase). This will cause a TypeError at runtime.

You likely need to use map_batches() and within the provided function, open and read the file for each URI in the batch, then apply the decode_fn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


# Step 2: Sample paths to estimate total in memory size after reading.
sampled_paths = [
row[uri_column] for row in random.sample(paths, RANDOM_SAMPLE_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

random.sample(population, k) will raise a ValueError if k > len(population). This can happen here if the number of paths is less than RANDOM_SAMPLE_SIZE. You should guard against this by taking the minimum of the number of paths and the sample size.

Suggested change
row[uri_column] for row in random.sample(paths, RANDOM_SAMPLE_SIZE)
row[uri_column] for row in random.sample(paths, min(len(paths), RANDOM_SAMPLE_SIZE))

Comment on lines +472 to +474
RANDOM_SAMPLE_SIZE = 10
import math
import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code style and to avoid re-declarations, RANDOM_SAMPLE_SIZE should be a module-level constant, and import math, import random should be at the top of the file.

Comment on lines +478 to +479
def default_decode_fn(f: io.RawIOBase) -> Iterator[DataBatch]:
return f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default implementation of decode_fn returns the file-like object f itself. However, the type hint for decode_fn specifies a return type of Iterator[DataBatch]. This mismatch should be corrected. The default behavior should probably be to read the file and return it as a binary chunk in a DataBatch.

Comment on lines +499 to +501
num_partitions = math.ceil(
in_memory_size_estimate / (1024 * 1024 * 1024)
) # (Each task reads 1 GiB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number of partitions is calculated using a hardcoded 1 GiB block size. It's better to use DataContext.get_current().target_max_block_size to respect the configured block size for the dataset, making the behavior more consistent and configurable.

Suggested change
num_partitions = math.ceil(
in_memory_size_estimate / (1024 * 1024 * 1024)
) # (Each task reads 1 GiB)
target_max_block_size = DataContext.get_current().target_max_block_size or (
1024 * 1024 * 1024
)
num_partitions = math.ceil(
in_memory_size_estimate / target_max_block_size
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the Datasource implementation, agree target_max_block_size might be better here

Comment on lines +504 to +508
# print(f"num_rows: {len(paths)}")
# print(f"total_sampled_size: {total_sampled_size}")
# print(f"sampled_size: {sampled_size}")
# print(f"in_memory_size_estimate: {in_memory_size_estimate}")
# print(f"repartitioning to {num_partitions} partitions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These commented-out print statements appear to be for debugging and should be removed from the final code.

self,
uri_column: str,
*,
decode_fn: Optional[Callable[[io.RawIOBase], Iterator[DataBatch]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid blocking this PR on discussion about decode_fn, I think it might be better to leave it out for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is fine

Comment on lines +473 to +474
import math
import random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move imports to top of file?

Comment on lines +490 to +494
total_sampled_size = (
ray.data.read_binary_files(sampled_paths)
.map(lambda r: {"size_bytes": len(r["bytes"])})
.sum("size_bytes")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use FileMetadataProvider to get the file sizes? It'll be much cheaper than actually reading the data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually, if we're just listing all the file sizes with FileMetadataProvider, I think we can just take the mean.

Comment on lines +499 to +501
num_partitions = math.ceil(
in_memory_size_estimate / (1024 * 1024 * 1024)
) # (Each task reads 1 GiB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the Datasource implementation, agree target_max_block_size might be better here

# print(f"sampled_size: {sampled_size}")
# print(f"in_memory_size_estimate: {in_memory_size_estimate}")
# print(f"repartitioning to {num_partitions} partitions")
return self.repartition(num_partitions).map(decode_fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ray-project ray-project deleted a comment from gemini-code-assist bot Aug 13, 2025
Comment on lines +462 to +465
"""Load binary data from a column of URIs.

Args:
uri_column: The name of the column containing the URIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be sure to document what is the returned column type, and what is the name of the new column?

@omatthew98
Copy link
Contributor Author

Going to go with a different approach using the expressions API.

@omatthew98 omatthew98 closed this Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants