Skip to content

Conversation

kunwp1
Copy link
Contributor

@kunwp1 kunwp1 commented May 31, 2025

This PR introduces a new attribute type, LARGE_BINARY, to enable efficient handling of large binary fields by storing them externally in S3 rather than embedding them directly within tuples. It also includes various enhancements to support this new data type across the system.


Motivation

Storing entire tuples in external storage adds unnecessary complexity and hinders direct access to small fields. Instead, we now store only individual large fields externally, which simplifies implementation and improves flexibility. Users can still access smaller fields directly in memory while large binary data is managed separately and efficiently.


Design Overview

The new large_binary attribute type is distinct from the existing binary type:

  • binary: Stores raw byte arrays directly in the tuple.
  • large_binary: Stores a URI reference to an external binary object.

Lifecycle of large_binary fields

  • Creation: Before emitting a tuple, the operator uploads the binary object to external S3 storage.
  • Transfer: Tuples remain lightweight by storing only the URI.
  • Read: Downstream operators use a utility API to resolve the URI and fetch the binary content.
  • Deletion: Reference counting is used to manage deletion. When the count reaches zero, the binary object is deleted from S3.

Implementation Details

  • S3 buckets for storing large binary objects are created when the computing unit master is launched. The bucket name is defined in storage.conf.
  • The S3StorageClient class was moved to core/workflow-core to be accessible from the computing unit master.
  • LARGE_BINARY type is:
    • Stored as a string in Iceberg.
    • Distinguished from regular strings using a magic prefix: "TEXERA_LARGE_BINARY:" added to the attribute name.
    • If a string attribute uses the "TEXERA_LARGE_BINARY:" prefix, it will fail schema propagation, preventing accidental misinterpretation of types.
  • A new PostgreSQL table stores the reference count for each URI for reference count tracking.
  • Transactions are used for concurrency control to ensure reference count consistency under concurrent insertions and deletions.
  • The result export logic for cell data has been updated to resolve URIs back to actual binary content.
  • File uploads to S3 use multipart upload to ensure speed and reliability. Tested with a 3GB file, which took ~16 seconds depending on network conditions.
  • The FileScan operator throws an error if a BINARY type is used with files larger than 2GB.

Scope

This PR currently supports only the FileScan Java Native Operator.


Migration Notice

After merging this PR, you must run the following SQL script to apply the necessary schema changes for reference count tracking:

core/scripts/sql/updates/08.sql

This script creates the required PostgreSQL table for storing reference counts of large binary URIs. Failing to run this script will result in runtime errors when handling LARGE_BINARY attributes.


TODOs

  • Add support for UDF operators
  • Investigate and update other Java Native Operators as needed
Screenshot 2025-06-05 at 11 29 53 PM Screenshot 2025-06-05 at 11 30 17 PM Screenshot 2025-06-05 at 11 30 06 PM

@kunwp1 kunwp1 requested a review from bobbai00 May 31, 2025 08:07
@kunwp1 kunwp1 self-assigned this May 31, 2025
@kunwp1 kunwp1 marked this pull request as ready for review May 31, 2025 08:57
Copy link
Contributor

@bobbai00 bobbai00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@kunwp1
Copy link
Contributor Author

kunwp1 commented Jun 6, 2025

@bobbai00 I've addressed your comments and also added support for concurrency control. Additionally, the UI has been updated to hide the URI from the user. The PR description has been updated accordingly. Here are the key highlights:

  • A new PostgreSQL table is used to track the reference count for each URI.
  • Transactions are employed to ensure concurrency control and maintain reference count consistency during concurrent insertions and deletions.

Please review the PR once more. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants