@@ -7,7 +7,7 @@ parent: Developer Overview
77
88# Substrait Proto Diff Analysis
99
10- ** Date:** 2025-11-05
10+ ** Date:** 2025-11-05 (Updated: 2025-12-12)
1111** Gluten Base Version:** Substrait v0.23.0 (with custom modifications)
1212** Official Latest Version:** v0.77.0
1313** Version Gap:** 54 releases behind
@@ -16,9 +16,17 @@ parent: Developer Overview
1616
1717This document provides a detailed, line-by-line analysis of the differences between Gluten's forked Substrait proto files and the official Substrait v0.23.0 release. This serves as the baseline for any unfork effort.
1818
19+ ### Migration Progress
20+
21+ ** Completed Migrations:**
22+ - ✅ ParquetReadOptions.enable_row_group_maxmin_index → Removed (PR #11277 )
23+ - ✅ RelRoot.output_schema → Replaced with ProjectRel (PR #11278 )
24+
25+ ** Remaining Custom Modifications:** ~ 200 lines
26+
1927### Files Modified
2028
21- - ✅ ** algebra.proto** : ** 222 lines of diff** - Significant custom modifications
29+ - ✅ ** algebra.proto** : ** ~ 200 lines of diff remaining ** (down from 222)
2230- ✅ ** type.proto** : ** 40 lines of diff** - Medium custom modifications
2331- ➕ ** ddl.proto** : ** NEW FILE** - Not in official v0.23.0
2432- ✅ ** plan.proto** : No changes (identical to official)
@@ -34,7 +42,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
3442
3543### 1. algebra.proto Modifications
3644
37- #### 1.1 Enhanced ParquetReadOptions (Lines 121-134)
45+ #### 1.1 ~~ Enhanced ParquetReadOptions~~ ✅ COMPLETED
3846``` diff
3947+ message ParquetReadOptions {
4048+ bool enable_row_group_maxmin_index = 1;
@@ -43,6 +51,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
4351** Purpose:** Enable row group min/max index for better Parquet filtering
4452** Impact:** Low - Optional optimization field
4553** Upstream Status:** Not in v0.77.0
54+ ** Migration Status:** ✅ ** REMOVED in PR #11277 ** - Field was unused and has been deleted
4655
4756#### 1.2 Added TextReadOptions (Lines 138-148)
4857``` diff
@@ -145,7 +154,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
145154** Impact:** MEDIUM - Required for advanced GROUP BY operations
146155** Upstream Status:** ✅ ** UPSTREAMED to v0.77.0!** This should work directly.
147156
148- #### 1.8 Added output_schema in RelRoot (Lines 418-423)
157+ #### 1.8 ~~ Added output_schema in RelRoot~~ ✅ COMPLETED
149158``` diff
150159 message RelRoot {
151160 Rel input = 1;
@@ -156,6 +165,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
156165** Purpose:** Provide complete type schema at root, not just field names
157166** Impact:** MEDIUM - Useful for schema validation
158167** Upstream Status:** Not in v0.77.0
168+ ** Migration Status:** ✅ ** REPLACED in PR #11278 ** - Now uses explicit ProjectRel for type enforcement instead of implicit output_schema field
159169
160170#### 1.9 Added GenerateRel (Lines 1252+)
161171``` diff
@@ -274,62 +284,135 @@ message Dll {
274284
275285---
276286
277- ## Migration Priority Matrix
287+ ## Migration Progress Tracker
278288
279- ### Critical (Must Address)
280- - ** TextReadOptions/JsonReadOptions** → Use DelimiterSeparatedTextReadOptions or AdvancedExtension
281- - ** column_types in NamedStruct** → Strong candidate for upstreaming to Substrait
282- - ** WindowRel** → Evaluate ConsistentPartitionWindowRel or use AdvancedExtension
283- - ** GenerateRel** → Propose upstreaming or use AdvancedExtension
284- - ** partition_columns in FileOrFiles** → Propose upstreaming or use AdvancedExtension
289+ ### ✅ Completed (2 items)
290+ - ** ParquetReadOptions.enable_row_group_maxmin_index** → REMOVED (PR #11277 )
291+ - ** output_schema in RelRoot** → REPLACED with ProjectRel (PR #11278 )
285292
286- ### Medium (Should Address)
287- - ** output_schema in RelRoot** → May not be needed, or use AdvancedExtension
288- - ** Nothing type** → Evaluate if actually used, may not be needed
289- - ** WindowFunction modifications** → Review if needed in modern Substrait
293+ ### 🔄 Next Steps - Updated Priority Matrix
290294
291- ### Low (Nice to Have)
292- - ** enable_row_group_maxmin_index** → Optional optimization
293- - ** window metadata fields** → Optional metadata
294- - ** ddl.proto** → May be deprecated in favor of WriteRel
295+ ### Critical (Must Address) - 5 items
296+ 1 . ** TextReadOptions/JsonReadOptions** → Use DelimiterSeparatedTextReadOptions or AdvancedExtension
297+ 2 . ** WindowRel** → Evaluate ConsistentPartitionWindowRel or use AdvancedExtension
298+ 3 . ** GenerateRel** → Propose upstreaming or use AdvancedExtension
299+ 4 . ** partition_columns in FileOrFiles** → Propose upstreaming or use AdvancedExtension
300+ 5 . ** column_types in NamedStruct** → Strong candidate for upstreaming to Substrait
295301
296- ---
302+ ### Medium (Should Address) - 4 items
303+ 1 . ** schema field in FileOrFiles** → May be redundant with ReadRel.base_schema
304+ 2 . ** WindowFunction modifications (window_type, column_name, Unbounded)** → Used by ClickHouse backend
305+ 3 . ** Nothing type** → Used in multiple places, needs investigation
306+ 4 . ** ddl.proto** → Used for INSERT operations
297307
298- ## Recommended First Step
308+ ### Low (Nice to Have) - 1 item
309+ 1 . ** names in Struct** → Already upstreamed to v0.77.0, get free by upgrading
299310
300- The ** easiest and most impactful first step ** is:
311+ ---
301312
302- ### ✅ Document Current Usage
313+ ## Updated Recommendations (Post PR # 11277 , # 11278 )
303314
304- Create an inventory of which Gluten code actually uses each custom field:
315+ ### 🎯 Best Next Steps (Ranked by Effort/Impact)
305316
306- ``` bash
307- # Search for WindowRel usage
308- grep -r " WindowRel " --include= " *.scala " --include= " *.java "
317+ #### Option 1: Upgrade to Substrait v0.77.0 (HIGHEST IMPACT)
318+ ** Effort: ** 6-8 hours
319+ ** Impact: ** Eliminates 2 modifications for free
309320
310- # Search for GenerateRel usage
311- grep -r " GenerateRel" --include=" *.scala" --include=" *.java"
321+ ** What you get:**
322+ - ✅ ExpandRel (already upstreamed)
323+ - ✅ names in Struct (already upstreamed)
324+ - ✅ Better foundation for future migrations
325+ - ✅ Reduces diff from ~ 200 lines to ~ 170 lines
312326
313- # Search for TextReadOptions usage
314- grep -r " TextReadOptions" --include=" *.scala" --include=" *.java"
315- ```
327+ ** Recommendation:** Do this first to maximize compatibility
316328
317- This will:
318- 1 . Be completely non-invasive (no code changes)
319- 2 . Show which features are actually critical vs unused
320- 3 . Help prioritize the migration work
321- 4 . Take only 1-2 hours
322- 5 . Provide concrete data for planning
329+ ---
330+
331+ #### Option 2: Tackle Individual Modifications (INCREMENTAL)
332+
333+ Listed in order of easiest → hardest:
334+
335+ ** 1. JOIN_TYPE enum verification (30 min)**
336+ - Verify if LEFT_SEMI/RIGHT_SEMI changes are actually custom or already in v0.23.0
337+ - May not need any migration
338+
339+ ** 2. column_types in NamedStruct (2-3 hours)**
340+ - Single enum + field
341+ - Clear purpose (partition vs data columns)
342+ - Good candidate for upstreaming to Substrait
343+ - Can use AdvancedExtension as interim
344+
345+ ** 3. WindowFunction metadata fields (2-3 hours)**
346+ - ` window_type ` , ` column_name ` fields
347+ - ` Unbounded_Preceding/Following ` split
348+ - Isolated to window function handling
349+ - Can use AdvancedExtension
350+
351+ ** 4. Nothing type (3-4 hours)**
352+ - Used in multiple places but not pervasive
353+ - Investigation needed on whether it's truly required
354+ - May be able to use existing nullable semantics
355+
356+ ** 5. schema field in FileOrFiles (3-4 hours)**
357+ - Currently used by ExcelTextFormatFile
358+ - May be redundant with ReadRel.base_schema
359+ - Needs careful analysis
360+
361+ ** 6. ddl.proto (4-6 hours)**
362+ - Entire custom file
363+ - Used for INSERT operations
364+ - May be replaceable with WriteRel
365+ - Requires coordination across backends
366+
367+ ** 7. partition_columns in FileOrFiles (4-6 hours)**
368+ - Critical for Hive-style partitioning
369+ - Strong upstreaming candidate
370+ - Complex due to wide usage
371+
372+ ** 8. TextReadOptions/JsonReadOptions (6-8 hours)**
373+ - High complexity, widely used
374+ - May align with DelimiterSeparatedTextReadOptions
375+ - Critical for CSV/JSON file support
376+
377+ ** 9. WindowRel (8-12 hours)**
378+ - Large custom message
379+ - Check if ConsistentPartitionWindowRel can replace it
380+ - High complexity migration
381+
382+ ** 10. GenerateRel (8-12 hours)**
383+ - Large custom message
384+ - Table-generating functions (EXPLODE, etc.)
385+ - High complexity, propose upstreaming
323386
324387---
325388
326- ## Next Steps
389+ ### 🚀 Recommended Path Forward
390+
391+ ** Phase 1: Quick Wins (8-10 hours)**
392+ 1 . Upgrade to v0.77.0
393+ 2 . Verify JOIN_TYPE changes
394+ 3 . Migrate column_types to AdvancedExtension
395+
396+ ** Phase 2: Medium Effort (12-16 hours)**
397+ 4 . Migrate WindowFunction metadata
398+ 5 . Investigate Nothing type
399+ 6 . Analyze schema field redundancy
400+
401+ ** Phase 3: Major Migrations (20-30 hours)**
402+ 7 . Propose upstreaming: column_types, partition_columns, GenerateRel
403+ 8 . Migrate file format options (Text/Json)
404+ 9 . Evaluate WindowRel vs ConsistentPartitionWindowRel
405+ 10 . Handle ddl.proto
406+
407+ ---
327408
328- After documenting current usage:
409+ ### Progress Metrics
329410
330- 1 . ** Short term:** Update SubstraitModifications.md with this detailed analysis
331- 2 . ** Medium term:** Propose upstreaming high-value features to Substrait community
332- 3 . ** Long term:** Migrate to official Substrait with AdvancedExtension for non-upstreamed features
411+ ** Original Diff:** 262 lines
412+ ** After PR #11277 , #11278 :** ~ 200 lines
413+ ** After v0.77.0 upgrade:** ~ 170 lines (estimated)
414+ ** After Phase 1:** ~ 150 lines (estimated)
415+ ** Target:** < 100 lines or all in AdvancedExtension
333416
334417---
335418
0 commit comments