Skip to content

Commit 836199f

Browse files
committed
[DOCS] Update Substrait unfork status and provide next steps
Updates documentation to reflect completed PRs: - PR apache#11277: Remove unused enable_row_group_maxmin_index - PR apache#11278: Replace output_schema with ProjectRel Changes: 1. SubstraitDiffAnalysis.md: - Mark completed migrations with ✅ - Update diff count: 262 → ~200 lines - Reorganize priority matrix into completed/pending - Add updated recommendations post-PR completion - Add progress metrics tracking 2. SubstraitUnfork-NextSteps.md (NEW): - Actionable next steps ranked by effort/impact - Recommended path: Upgrade to v0.77.0 first - Incremental alternatives with time estimates - 10 specific tasks with step-by-step guidance - Decision framework for upgrade vs incremental - Progress tracker table - Success criteria checklist Next recommended actions: 1. Verify JOIN_TYPE changes (30 min quick win) 2. Upgrade to v0.77.0 for free wins (6-8 hours) 3. Migrate column_types to AdvancedExtension (2-3 hours) Estimated remaining effort: 40-60 hours for complete unfork Target: <100 line diff or all modifications in AdvancedExtension
1 parent 3574f1b commit 836199f

File tree

2 files changed

+413
-43
lines changed

2 files changed

+413
-43
lines changed

docs/developers/SubstraitDiffAnalysis.md

Lines changed: 126 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ parent: Developer Overview
77

88
# Substrait Proto Diff Analysis
99

10-
**Date:** 2025-11-05
10+
**Date:** 2025-11-05 (Updated: 2025-12-12)
1111
**Gluten Base Version:** Substrait v0.23.0 (with custom modifications)
1212
**Official Latest Version:** v0.77.0
1313
**Version Gap:** 54 releases behind
@@ -16,9 +16,17 @@ parent: Developer Overview
1616

1717
This document provides a detailed, line-by-line analysis of the differences between Gluten's forked Substrait proto files and the official Substrait v0.23.0 release. This serves as the baseline for any unfork effort.
1818

19+
### Migration Progress
20+
21+
**Completed Migrations:**
22+
- ✅ ParquetReadOptions.enable_row_group_maxmin_index → Removed (PR #11277)
23+
- ✅ RelRoot.output_schema → Replaced with ProjectRel (PR #11278)
24+
25+
**Remaining Custom Modifications:** ~200 lines
26+
1927
### Files Modified
2028

21-
-**algebra.proto**: **222 lines of diff** - Significant custom modifications
29+
-**algebra.proto**: **~200 lines of diff remaining** (down from 222)
2230
-**type.proto**: **40 lines of diff** - Medium custom modifications
2331
-**ddl.proto**: **NEW FILE** - Not in official v0.23.0
2432
-**plan.proto**: No changes (identical to official)
@@ -34,7 +42,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
3442

3543
### 1. algebra.proto Modifications
3644

37-
#### 1.1 Enhanced ParquetReadOptions (Lines 121-134)
45+
#### 1.1 ~~Enhanced ParquetReadOptions~~ ✅ COMPLETED
3846
```diff
3947
+ message ParquetReadOptions {
4048
+ bool enable_row_group_maxmin_index = 1;
@@ -43,6 +51,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
4351
**Purpose:** Enable row group min/max index for better Parquet filtering
4452
**Impact:** Low - Optional optimization field
4553
**Upstream Status:** Not in v0.77.0
54+
**Migration Status:****REMOVED in PR #11277** - Field was unused and has been deleted
4655

4756
#### 1.2 Added TextReadOptions (Lines 138-148)
4857
```diff
@@ -145,7 +154,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
145154
**Impact:** MEDIUM - Required for advanced GROUP BY operations
146155
**Upstream Status:****UPSTREAMED to v0.77.0!** This should work directly.
147156

148-
#### 1.8 Added output_schema in RelRoot (Lines 418-423)
157+
#### 1.8 ~~Added output_schema in RelRoot~~ ✅ COMPLETED
149158
```diff
150159
message RelRoot {
151160
Rel input = 1;
@@ -156,6 +165,7 @@ This document provides a detailed, line-by-line analysis of the differences betw
156165
**Purpose:** Provide complete type schema at root, not just field names
157166
**Impact:** MEDIUM - Useful for schema validation
158167
**Upstream Status:** Not in v0.77.0
168+
**Migration Status:****REPLACED in PR #11278** - Now uses explicit ProjectRel for type enforcement instead of implicit output_schema field
159169

160170
#### 1.9 Added GenerateRel (Lines 1252+)
161171
```diff
@@ -274,62 +284,135 @@ message Dll {
274284

275285
---
276286

277-
## Migration Priority Matrix
287+
## Migration Progress Tracker
278288

279-
### Critical (Must Address)
280-
- **TextReadOptions/JsonReadOptions** → Use DelimiterSeparatedTextReadOptions or AdvancedExtension
281-
- **column_types in NamedStruct** → Strong candidate for upstreaming to Substrait
282-
- **WindowRel** → Evaluate ConsistentPartitionWindowRel or use AdvancedExtension
283-
- **GenerateRel** → Propose upstreaming or use AdvancedExtension
284-
- **partition_columns in FileOrFiles** → Propose upstreaming or use AdvancedExtension
289+
### ✅ Completed (2 items)
290+
- **ParquetReadOptions.enable_row_group_maxmin_index** → REMOVED (PR #11277)
291+
- **output_schema in RelRoot** → REPLACED with ProjectRel (PR #11278)
285292

286-
### Medium (Should Address)
287-
- **output_schema in RelRoot** → May not be needed, or use AdvancedExtension
288-
- **Nothing type** → Evaluate if actually used, may not be needed
289-
- **WindowFunction modifications** → Review if needed in modern Substrait
293+
### 🔄 Next Steps - Updated Priority Matrix
290294

291-
### Low (Nice to Have)
292-
- **enable_row_group_maxmin_index** → Optional optimization
293-
- **window metadata fields** → Optional metadata
294-
- **ddl.proto** → May be deprecated in favor of WriteRel
295+
### Critical (Must Address) - 5 items
296+
1. **TextReadOptions/JsonReadOptions** → Use DelimiterSeparatedTextReadOptions or AdvancedExtension
297+
2. **WindowRel** → Evaluate ConsistentPartitionWindowRel or use AdvancedExtension
298+
3. **GenerateRel** → Propose upstreaming or use AdvancedExtension
299+
4. **partition_columns in FileOrFiles** → Propose upstreaming or use AdvancedExtension
300+
5. **column_types in NamedStruct** → Strong candidate for upstreaming to Substrait
295301

296-
---
302+
### Medium (Should Address) - 4 items
303+
1. **schema field in FileOrFiles** → May be redundant with ReadRel.base_schema
304+
2. **WindowFunction modifications (window_type, column_name, Unbounded)** → Used by ClickHouse backend
305+
3. **Nothing type** → Used in multiple places, needs investigation
306+
4. **ddl.proto** → Used for INSERT operations
297307

298-
## Recommended First Step
308+
### Low (Nice to Have) - 1 item
309+
1. **names in Struct** → Already upstreamed to v0.77.0, get free by upgrading
299310

300-
The **easiest and most impactful first step** is:
311+
---
301312

302-
### ✅ Document Current Usage
313+
## Updated Recommendations (Post PR #11277, #11278)
303314

304-
Create an inventory of which Gluten code actually uses each custom field:
315+
### 🎯 Best Next Steps (Ranked by Effort/Impact)
305316

306-
```bash
307-
# Search for WindowRel usage
308-
grep -r "WindowRel" --include="*.scala" --include="*.java"
317+
#### Option 1: Upgrade to Substrait v0.77.0 (HIGHEST IMPACT)
318+
**Effort:** 6-8 hours
319+
**Impact:** Eliminates 2 modifications for free
309320

310-
# Search for GenerateRel usage
311-
grep -r "GenerateRel" --include="*.scala" --include="*.java"
321+
**What you get:**
322+
- ✅ ExpandRel (already upstreamed)
323+
- ✅ names in Struct (already upstreamed)
324+
- ✅ Better foundation for future migrations
325+
- ✅ Reduces diff from ~200 lines to ~170 lines
312326

313-
# Search for TextReadOptions usage
314-
grep -r "TextReadOptions" --include="*.scala" --include="*.java"
315-
```
327+
**Recommendation:** Do this first to maximize compatibility
316328

317-
This will:
318-
1. Be completely non-invasive (no code changes)
319-
2. Show which features are actually critical vs unused
320-
3. Help prioritize the migration work
321-
4. Take only 1-2 hours
322-
5. Provide concrete data for planning
329+
---
330+
331+
#### Option 2: Tackle Individual Modifications (INCREMENTAL)
332+
333+
Listed in order of easiest → hardest:
334+
335+
**1. JOIN_TYPE enum verification (30 min)**
336+
- Verify if LEFT_SEMI/RIGHT_SEMI changes are actually custom or already in v0.23.0
337+
- May not need any migration
338+
339+
**2. column_types in NamedStruct (2-3 hours)**
340+
- Single enum + field
341+
- Clear purpose (partition vs data columns)
342+
- Good candidate for upstreaming to Substrait
343+
- Can use AdvancedExtension as interim
344+
345+
**3. WindowFunction metadata fields (2-3 hours)**
346+
- `window_type`, `column_name` fields
347+
- `Unbounded_Preceding/Following` split
348+
- Isolated to window function handling
349+
- Can use AdvancedExtension
350+
351+
**4. Nothing type (3-4 hours)**
352+
- Used in multiple places but not pervasive
353+
- Investigation needed on whether it's truly required
354+
- May be able to use existing nullable semantics
355+
356+
**5. schema field in FileOrFiles (3-4 hours)**
357+
- Currently used by ExcelTextFormatFile
358+
- May be redundant with ReadRel.base_schema
359+
- Needs careful analysis
360+
361+
**6. ddl.proto (4-6 hours)**
362+
- Entire custom file
363+
- Used for INSERT operations
364+
- May be replaceable with WriteRel
365+
- Requires coordination across backends
366+
367+
**7. partition_columns in FileOrFiles (4-6 hours)**
368+
- Critical for Hive-style partitioning
369+
- Strong upstreaming candidate
370+
- Complex due to wide usage
371+
372+
**8. TextReadOptions/JsonReadOptions (6-8 hours)**
373+
- High complexity, widely used
374+
- May align with DelimiterSeparatedTextReadOptions
375+
- Critical for CSV/JSON file support
376+
377+
**9. WindowRel (8-12 hours)**
378+
- Large custom message
379+
- Check if ConsistentPartitionWindowRel can replace it
380+
- High complexity migration
381+
382+
**10. GenerateRel (8-12 hours)**
383+
- Large custom message
384+
- Table-generating functions (EXPLODE, etc.)
385+
- High complexity, propose upstreaming
323386

324387
---
325388

326-
## Next Steps
389+
### 🚀 Recommended Path Forward
390+
391+
**Phase 1: Quick Wins (8-10 hours)**
392+
1. Upgrade to v0.77.0
393+
2. Verify JOIN_TYPE changes
394+
3. Migrate column_types to AdvancedExtension
395+
396+
**Phase 2: Medium Effort (12-16 hours)**
397+
4. Migrate WindowFunction metadata
398+
5. Investigate Nothing type
399+
6. Analyze schema field redundancy
400+
401+
**Phase 3: Major Migrations (20-30 hours)**
402+
7. Propose upstreaming: column_types, partition_columns, GenerateRel
403+
8. Migrate file format options (Text/Json)
404+
9. Evaluate WindowRel vs ConsistentPartitionWindowRel
405+
10. Handle ddl.proto
406+
407+
---
327408

328-
After documenting current usage:
409+
### Progress Metrics
329410

330-
1. **Short term:** Update SubstraitModifications.md with this detailed analysis
331-
2. **Medium term:** Propose upstreaming high-value features to Substrait community
332-
3. **Long term:** Migrate to official Substrait with AdvancedExtension for non-upstreamed features
411+
**Original Diff:** 262 lines
412+
**After PR #11277, #11278:** ~200 lines
413+
**After v0.77.0 upgrade:** ~170 lines (estimated)
414+
**After Phase 1:** ~150 lines (estimated)
415+
**Target:** < 100 lines or all in AdvancedExtension
333416

334417
---
335418

0 commit comments

Comments
 (0)