-
Notifications
You must be signed in to change notification settings - Fork 96
Add LARGE_BINARY
Attribute Type for External Storage of Large Fields
#3457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
core/amber/src/main/scala/edu/uci/ics/texera/web/ComputingUnitMaster.scala
Show resolved
Hide resolved
core/workflow-core/src/main/scala/edu/uci/ics/amber/core/tuple/AttributeType.java
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/web/service/ResultExportService.scala
Outdated
Show resolved
Hide resolved
core/workflow-core/src/main/scala/edu/uci/ics/amber/util/IcebergUtil.scala
Outdated
Show resolved
Hide resolved
...ow-operator/src/main/scala/edu/uci/ics/amber/operator/source/scan/FileScanSourceOpExec.scala
Show resolved
Hide resolved
@bobbai00 I've addressed your comments and also added support for concurrency control. Additionally, the UI has been updated to hide the URI from the user. The PR description has been updated accordingly. Here are the key highlights:
Please review the PR once more. Thanks. |
This PR introduces a new attribute type,
LARGE_BINARY
, to enable efficient handling of large binary fields by storing them externally in S3 rather than embedding them directly within tuples. It also includes various enhancements to support this new data type across the system.Motivation
Storing entire tuples in external storage adds unnecessary complexity and hinders direct access to small fields. Instead, we now store only individual large fields externally, which simplifies implementation and improves flexibility. Users can still access smaller fields directly in memory while large binary data is managed separately and efficiently.
Design Overview
The new
large_binary
attribute type is distinct from the existingbinary
type:binary
: Stores raw byte arrays directly in the tuple.large_binary
: Stores a URI reference to an external binary object.Lifecycle of
large_binary
fieldsImplementation Details
storage.conf
.S3StorageClient
class was moved tocore/workflow-core
to be accessible from the computing unit master.LARGE_BINARY
type is:"TEXERA_LARGE_BINARY:"
added to the attribute name."TEXERA_LARGE_BINARY:"
prefix, it will fail schema propagation, preventing accidental misinterpretation of types.BINARY
type is used with files larger than 2GB.Scope
This PR currently supports only the
FileScan
Java Native Operator.Migration Notice
After merging this PR, you must run the following SQL script to apply the necessary schema changes for reference count tracking:
This script creates the required PostgreSQL table for storing reference counts of large binary URIs. Failing to run this script will result in runtime errors when handling
LARGE_BINARY
attributes.TODOs