-
Notifications
You must be signed in to change notification settings - Fork 123
Feature/s3 tables support #682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NikhilSuthar
wants to merge
19
commits into
duckdb:master
Choose a base branch
from
NikhilSuthar:feature/s3-tables-support
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Feature/s3 tables support #682
NikhilSuthar
wants to merge
19
commits into
duckdb:master
from
NikhilSuthar:feature/s3-tables-support
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… support - Implement PyIceberg upsert for incremental models on partitioned tables - Support partition transforms: identity, day, month, year, hour (bucket and truncate unsupported by AWS S3 Tables) - Add automatic PyIceberg path detection when partition_by is configured - Add manual override with use_pyiceberg_writes=true flag - Preserve fast DuckDB SQL path for non-partitioned tables - Update documentation with partition transform examples and limitations - Add pyiceberg_incremental_write() wrapper method to adapter - Update incremental.sql with conditional PyIceberg logic
- Use cursor() method to get cursor from DuckDBConnectionWrapper - Fix 'DuckDBConnectionWrapper' object has no attribute 'execute' error - Properly access cursor.execute().arrow() for PyIceberg upsert
- Call read_all() on RecordBatchReader to get PyArrow Table - Fix 'RecordBatchReader' object has no attribute 'num_rows' error - Properly convert DuckDB arrow result to PyArrow Table for PyIceberg
- Create case-insensitive mapping of PyArrow column names - Match unique_key columns case-insensitively - Fix 'customerid' not found when column is 'customerid' in PyArrow - Provide better error message with available columns
- Check if table has identifier_field_ids set - Use update_schema().set_identifier_fields() to set them if missing - Reload table after schema update - Fix 'Join columns could not be found' error - Support upsert on tables created by DuckDB without identifier fields
- Use case-insensitive search to find fields in Iceberg table schema - Get actual schema column names (e.g., 'customerid' not 'CustomerID') - Pass correct schema column names to set_identifier_fields() - Fix 'Could not find field with name CustomerID, case_sensitive=True' error - Handle mismatch between PyArrow column names and Iceberg schema names
…them - PyIceberg requires identifier fields to be marked as required (NOT NULL) - Added logic to check if unique_key columns are required in Iceberg schema - Automatically call make_column_required() for non-required identifier fields - Then set identifier fields using set_identifier_fields() - Fixes validation error: 'Identifier field invalid: not a required field' - Added detailed logging for troubleshooting
- Added setup_iceberg_identifier_fields() function to set identifier fields after table creation - Added adapter method setup_s3_tables_identifier_fields() wrapper - Updated table.sql to set identifier fields when unique_key is provided - Updated incremental.sql to set identifier fields during full refresh - Simplified pyiceberg_incremental_write() to just set identifier fields if missing - Identifier fields are now set up front during table creation, not during upsert - This matches PyIceberg's recommended pattern from documentation
… tables - Remove setup_iceberg_identifier_fields() function and all calls to it - Implement pyiceberg_incremental_write() using DELETE + INSERT instead of upsert - DELETE uses PyIceberg expressions (In for single key, EqualTo/And for composite keys) - INSERT uses PyIceberg append() method - No longer requires identifier fields or NOT NULL constraints - Works with tables created via DuckDB's CREATE TABLE AS SELECT - Handles both single and composite unique keys - Returns rows_deleted and rows_inserted counts
- Align PyArrow table schema with Iceberg schema before append - Reorder columns to match Iceberg schema order - Rename columns to match Iceberg schema names (case-sensitive) - Handle missing columns by adding NULL arrays - Fixes error: 'PyArrow table contains more columns... Update the schema first'
…output 1. Transaction Safety: - Wrap DELETE + INSERT in PyIceberg transaction for atomicity - If INSERT fails, DELETE is automatically rolled back - Ensures data consistency 2. Row Count Accuracy: - Estimate deleted rows based on unique key count - PyIceberg delete() doesn't return count, so we estimate - Shows 'X deleted, Y inserted' in logs 3. Better Compiled SQL Output: - Replace dummy query with meaningful SQL representation - Shows what operation was performed (as SQL comments) - Includes actual row counts in comments - Helps with debugging and documentation 4. Code Organization: - Extract schema alignment to _align_arrow_schema() helper function - Cleaner transaction handling with context manager - Better error messages with rollback notification
- Added comprehensive configuration parameter table for S3 Tables - Documented parameters for both table and incremental materializations - Included partition transform reference (identity, day, month, year, hour) - Added 4 practical examples covering common use cases - Clarified mandatory vs optional parameters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.