Support from_protobuf expression by thirtiseven · Pull Request #14354 · NVIDIA/spark-rapids

thirtiseven · 2026-03-03T08:48:19Z

Fixes #14069.
Depends on NVIDIA/spark-rapids-jni#4107

The code is ready, but it's too long to review. Feel free to review it directly or review first part of it #14419 first.

Description (⚠️ from AI)

This PR adds GPU acceleration for Spark's from_protobuf() expression by replacing ProtobufDataToCatalyst with GpuFromProtobuf at query planning time. It bridges the Spark SQL catalyst layer to the CUDA protobuf decode kernels (provided by the companion JNI PR in spark-rapids-jni) through a clean three-layer architecture: reflection-based compatibility, typed metadata extraction/validation, and GPU expression execution.

The implementation spans ~7,800 lines of new Scala/Python/Shell code across 27 files, including ~2,900 lines of Scala plugin code, ~4,400 lines of Python integration tests and data generators, and ~450 lines of shell/proto infrastructure.

Key capabilities

Full from_protobuf() replacement: Transparent GPU override of ProtobufDataToCatalyst — no user-facing API changes
All scalar protobuf types: int32, int64, uint32, uint64, sint32/sint64 (zigzag), fixed32/sfixed32/fixed64/sfixed64, float, double, bool, string, bytes
Nested messages: Up to 10 levels deep, with recursive schema flattening
Repeated fields: Both packed and unpacked encoding, repeated scalars and repeated messages (ArrayType(StructType))
Enum-as-string: Configurable via enums.as.ints option — integer mode or validated string name mode
Default values: Per-field defaults for all scalar types, strings, bytes, and enum (preserving both numeric and display name)
Required field validation: Proto2-style required field checks
PERMISSIVE / FAILFAST modes: Configurable error handling via mode option
Schema projection: Two-level pruning — top-level field pruning + nested child pruning — reduces GPU decode work to only fields referenced by downstream operators
Post-project batch coalesce: Optional post-decode coalesce to avoid small-batch overhead from schema projection
Spark 3.4+ / 3.5+ compatibility: Reflection-based compat layer handles path-based and bytes-based descriptor APIs across Spark versions
Proto2 only: Proto3 and editions syntax are explicitly rejected with CPU fallback

Architecture

Spark from_protobuf (ProtobufDataToCatalyst)
    │
    │ GpuOverrides rule registration (Spark340PlusNonDBShims)
    ▼
ProtobufExprShims.fromProtobufRule (tagExprForGpu → convertToGpu)
    │
    │ reflection
    ▼
SparkProtobufCompat (extractExprInfo, resolveMessageDescriptor, parsePlannerOptions)
    │
    │ typed metadata
    ▼
ProtobufSchemaExtractor.analyzeAllFields → Map[String, ProtobufFieldInfo]
    │
    │ schema projection analysis
    ▼
analyzeRequiredFields → nestedFieldRequirements → prunedFieldsMap
    │
    │ flatten + validate
    ▼
ProtobufSchemaValidator (toFlattenedFieldDescriptor → validateFlattenedSchema → toFlattenedSchemaArrays)
    │
    │ parallel arrays
    ▼
GpuFromProtobuf(decodedSchema, flatArrays..., child)
    │
    │ JNI call
    ▼
Protobuf.decodeToStruct → CUDA decode → cuDF STRUCT column

File structure

New files (core)

File	Lines	Description
`GpuFromProtobuf.scala`	203	GPU expression: JNI call, null propagation, type mapping, value equality
`ProtobufExprShims.scala`	845	GpuOverrides rule, schema projection analysis, flatten orchestration, `convertToGpu`
`SparkProtobufCompat.scala`	371	Reflection layer: extract expr info, resolve descriptors, parse options, Spark 3.4/3.5 compat
`ProtobufSchemaModel.scala`	182	Typed metadata: `ProtobufExprInfo`, `ProtobufFieldInfo`, `ProtobufDefaultValue`, `FlattenedFieldDescriptor`, etc.
`ProtobufSchemaExtractor.scala`	236	Field analysis: type/encoding mapping, support checks, wire type resolution
`ProtobufSchemaValidator.scala`	183	Flatten-time validation: enum metadata, defaults, parent-child consistency, JNI array construction

New files (tests)

File	Lines	Description
`protobuf_test.py`	3,989	57 Python integration tests: GPU vs CPU correctness for all protobuf features
`ProtobufExprShimsSuite.scala`	616	22 Scala unit tests: compat layer, extractor, validator, ordinal remapping, semantic equality
`ProtobufBatchMergeSuite.scala`	115	4 Scala unit tests: post-project coalesce detection and config
`data_gen.py` (additions)	~385	Protobuf wire-format encoder: `ProtobufMessageGen`, `PbScalar`/`PbNested`/`PbRepeated`/`PbRepeatedMessage`

New files (infrastructure)

File	Lines	Description
`main_log.proto`	103	Main test proto: enums, required fields, multi-level nesting, cross-file imports
`module_a_res.proto`	92	External proto: repeated messages, defaults, nested repeated
`module_b_res.proto`	29	External proto: repeated scalars, block structures
`predictor_schema.proto`	82	External proto: deep multi-level nesting, empty messages
`device_req.proto`	11	External proto: bytes field
`gen_nested_proto_data.sh`	34	Proto compilation script
`main_log.desc`	(binary)	Compiled `FileDescriptorSet`

Modified files

File	Change summary
`GpuOverrides.scala`	`GetStructField` → `GpuGetStructFieldMeta`, `GetArrayStructFields` → `GpuGetArrayStructFieldsMeta`
`complexTypeExtractors.scala`	New `GpuStructFieldOrdinalTag`, `GpuGetStructFieldMeta`, `GpuGetArrayStructFieldsMeta` with `PRUNED_ORDINAL_TAG` support
`basicPhysicalOperators.scala`	`GpuProjectExecMeta` detects protobuf extraction, new `forcePostProjectCoalesce` parameter on `GpuProjectExec`
`RapidsConf.scala`	New config `spark.rapids.sql.protobuf.batchMergeAfterProject.enabled`
`Spark340PlusNonDBShims.scala`	Merges `ProtobufExprShims.exprs` into expression rules
`GpuBoundAttribute.scala`	Minor: pruned struct types propagate correctly through binding
`DeltaProviderBase.scala`	Pattern match updates for new `GpuProjectExec` signature (`forcePostProjectCoalesce` param)
`run_pyspark_from_build.sh`	Auto-download `spark-protobuf` + `protobuf-java` JARs, driver classpath setup
`spark_init_internal.py`	Driver classpath support for optional protobuf module
`supported_ops.md`	Remove BINARY from unsupported child types for struct extractors

Design decisions

1. Reflection-isolated compatibility layer

All Spark expression and protobuf-java descriptor reflection is confined to SparkProtobufCompat.scala. This isolates version-specific API differences (Spark 3.4 path-based vs 3.5+ bytes-based buildDescriptor) from planning logic. Reflection failures produce explicit CPU fallback reasons rather than silent degradation.

2. Typed metadata over raw reflection

Instead of passing raw Option[Any] protobuf defaults or untyped descriptor fields through the planning pipeline, the code uses a typed metadata model (ProtobufExprInfo, ProtobufFieldInfo, ProtobufDefaultValue, ProtobufEnumMetadata). Enum defaults preserve both numeric value (defaultInt) and display name (defaultString), avoiding the ClassCastException: EnumValueDescriptor cannot be cast to String pitfall.

3. Two-level schema projection

Schema projection reduces GPU decode work by only processing fields referenced downstream:

Top-level pruning: Only decode top-level fields referenced in downstream ProjectExec/FilterExec/AggregateExec/SortExec/WindowExec
Nested pruning: Only decode children of nested messages that are actually accessed — applies uniformly to both StructType (non-repeated) and ArrayType(StructType) (repeated) nested fields

The analysis walks GetStructField / GetArrayStructFields chains upward through the plan to determine required fields. Expression identity uses semantic equality (not just reference equality) to handle Catalyst optimizer creating duplicate instances.

4. Ordinal remapping via TreeNodeTag

When schema projection prunes nested struct children, downstream GetStructField / GetArrayStructFields expressions must use remapped ordinals pointing into the pruned struct. This is done via PRUNED_ORDINAL_TAG (TreeNodeTag[Int]), set during convertToGpu and read by GpuGetStructFieldMeta.convertToGpu / GpuGetArrayStructFieldsMeta.convertToGpu.

Design principle: Operator-specific logic (ordinal remapping) stays in the Meta layer. The runtime classes (GpuGetStructField, GpuGetArrayStructFields, GpuCanonicalize) remain generic and untouched.

5. Post-project batch coalesce

Schema projection can produce narrower batches. When enabled via spark.rapids.sql.protobuf.batchMergeAfterProject.enabled, GpuProjectExecMeta detects projects that extract from protobuf decode and sets forcePostProjectCoalesce=true, which inserts a post-project coalesce and prevents the optimizer from removing it (via outputBatching = null).

6. Value equality for array-carrying types

Any type storing raw arrays that participates in expression equality (GpuFromProtobuf, FlattenedFieldDescriptor, ProtobufDescriptorSource.DescriptorBytes, ProtobufDefaultValue.BinaryValue) overrides equals/hashCode with java.util.Arrays content-based semantics. This prevents semantically identical metadata from comparing unequal by JVM identity.

7. Optional integration with graceful degradation

ProtobufExprShims.exprs loads ProtobufDataToCatalyst by reflection. If the class is not on the classpath (no spark-protobuf JAR), it returns an empty map — no error, no GPU override. Class-loading failures (ExceptionInInitializerError, LinkageError) are caught at the Error level to prevent crashing query planning.

Test coverage

Python integration tests (57 tests)

Category	Tests	What is covered
Scalar types	5	All scalar types, random scalars, bytes, duplicate fields, all null input
Integer encodings	2	Signed integers (zigzag), fixed integers
Bool encoding	2	Non-canonical varint for bool (scalar + repeated)
Nested messages	6	1-level, 3-level deep, 5-level deep, nested field access, batch merge, nested-with-repeated
Repeated fields	6	int32, string, all types, large array, packed, repeated-with-nested
Repeated messages	1	Repeated message decode
Enum	10	Enum cases, nested enum PERMISSIVE, sibling null propagation, defaults, repeated enum, enum-in-repeated-message, nested repeated enum-as-string
Required fields	3	Present, missing FAILFAST, nested missing PERMISSIVE
Default values	2	Scalar defaults, nested child defaults
Schema projection	7	Simple field pruning, parametrized cases, alias boundary, withColumn boundary, deep pruning (3/5 level, mixed, sibling, whole-struct)
Error handling	3	FAILFAST malformed, PERMISSIVE malformed null, all-null input
Complex / customer	2	Heavy nested proto (customer-realistic), Parquet round-trip
Bug regressions	4	Name collision, filter jump, unrelated struct collision, max depth
Cross-expression	1	Different messages on same binary column
API / options	2	Legacy signature preservation, packed repeated fixed encoding

Scala unit tests (26 tests)

Suite	Tests	What is covered
`ProtobufExprShimsSuite`	22	SparkProtobufCompat reflection (path/bytes/Spark 3.4/3.5), planner options, unsupported options, proto3 rejection, ProtobufSchemaExtractor (typed enum defaults, reflection failures, type mismatch, FLOAT/DOUBLE widening), ProtobufSchemaValidator (enum-string encoding, missing metadata rejection, incompatible defaults, non-STRUCT parent), ordinal remapping, GpuFromProtobuf semantic equality, binary default equality, FlattenedFieldDescriptor equality
`ProtobufBatchMergeSuite`	4	ProjectExec protobuf extraction detection (child project, same project), config default/enable, output batching drop for post-project merge

Total: 83 tests

Configurations

Config key	Default	Description
`spark.rapids.sql.protobuf.batchMergeAfterProject.enabled`	`false`	Enable post-project coalesce for projects extracting from schema-pruned `from_protobuf` decode

Review Guide

This PR is large (~7,800 lines) but has a well-defined layered architecture. New Scala production code is ~2,020 lines across 6 files; the rest is tests, proto schemas, and shell infrastructure. This guide provides a recommended reading order and key areas to focus on.

Key review areas by priority

P0: Correctness-critical

ProtobufExprShims.tagExprForGpu (ProtobufExprShims.scala): This is the most complex function in the PR. It orchestrates the full planning pipeline: extract expression info → validate options → resolve descriptor → analyze fields → compute required fields → flatten schema → validate → set ordinal tags → override data type. Key things to verify:
- All willNotWorkOnGpu paths produce meaningful fallback reasons
- If step 5 (flatten) encounters an error, later steps (validate, registerPrunedOrdinals, overrideDataType) are NOT executed on partial state
- analyzeAllFields failure triggers CPU fallback, not silent empty schema
analyzeRequiredFields + collectStructFieldReferences (ProtobufExprShims.scala): Plan traversal for schema projection. Verify:
- Upward walk handles ProjectExec, FilterExec, AggregateExec, SortExec, WindowExec; unknown nodes disable pruning
- isProtobufStructReference uses semantic equality as fallback (class + semanticEquals on children), not just eq
- resolveFieldAccessChain correctly walks GetStructField chains to the protobuf root expression
- GetArrayStructFields with empty parent path disables pruning rather than registering a fake top-level requirement
registerPrunedOrdinals (ProtobufExprShims.scala): Sets PRUNED_ORDINAL_TAG on Spark expressions. Verify:
- Tags are set only during convertToGpu (not during analysis/tagging when CPU fallback is still possible)
- The remapped ordinal is correctly computed from the pruned schema's child position
- Both GetStructField and GetArrayStructFields are handled
GpuGetStructFieldMeta.convertToGpu / GpuGetArrayStructFieldsMeta.convertToGpu (complexTypeExtractors.scala): Read PRUNED_ORDINAL_TAG and pass effective ordinal. Verify:
- Falls back to expr.ordinal when tag is not present (non-protobuf usage)
- effectiveNumFields for GetArrayStructFields uses the child struct's actual field count after pruning
GpuFromProtobuf.doColumnar (GpuFromProtobuf.scala): JNI call + null propagation. Verify:
- Input null mask is merged via mergeAndSetValidity(BinaryOp.BITWISE_AND, input.getBase) — only when input has nulls
- FAILFAST wraps CudfException in SparkException; PERMISSIVE logs and rethrows
- ProtobufSchemaDescriptor is lazily initialized (transient) for serialization safety

P1: Robustness

SparkProtobufCompat (SparkProtobufCompat.scala): Reflection layer. Verify:
- extractExprInfo extracts messageName, descriptorSource, options with proper error handling
- resolveMessageDescriptor handles both path-based (Spark 3.4) and bytes-based (Spark 3.5+) descriptor construction, with retry-on-ClassCastException for Spark 3.5 path→bytes conversion
- isGpuSupportedProtoSyntax rejects proto3, editions, null, and empty strings
- Reflection failures always return Left/None, never throw through to the optimizer
ProtobufSchemaValidator (ProtobufSchemaValidator.scala): Flatten-time validation. Verify:
- ENC_ENUM_STRING fields must have non-empty enumValidValues and enumNames with matching lengths
- Repeated fields with defaults are rejected
- Parent index validity: parent must be a STRUCT field with depth = child.depth - 1
- encodeDefaultValue handles all ProtobufDefaultValue variants including EnumValue (both numeric and string)
Value equality (multiple files): Verify equals/hashCode overrides use Arrays.equals/Arrays.deepEquals for:
- GpuFromProtobuf (all 16 schema arrays)
- FlattenedFieldDescriptor (defaultString, enumValidValues, enumNames)
- ProtobufDescriptorSource.DescriptorBytes
- ProtobufDefaultValue.BinaryValue

P2: Performance & Integration

Post-project coalesce (basicPhysicalOperators.scala): Verify:
- GpuProjectExecMeta.shouldCoalesceAfterProject correctly detects ProtobufDataToCatalyst by class name via isProtobufDecodeExpr
- forcePostProjectCoalesce=true causes outputBatching to return null (not TargetSize), preventing removal by transition rules
- Config isProtobufBatchMergeAfterProjectEnabled defaults to false
DeltaProviderBase.scala changes: Pattern match updates for new GpuProjectExec 4-parameter signature. Verify the mergeIdenticalProjects correctly OR's forcePostProjectCoalesce flags when merging.
Integration test infrastructure (run_pyspark_from_build.sh): Verify:
- spark-protobuf and protobuf-java JAR download uses correct Maven coordinates
- Version detection from $SPARK_HOME/jars/protobuf-java-*.jar with fallback mapping
- PROTOBUF_JARS_AVAILABLE is exported so protobuf_test.py can skip when deps are missing
- Driver classpath is correctly set via PYSP_TEST_spark_driver_extraClassPath

Things to watch for

Expression identity: isProtobufStructReference must handle both reference equality (eq) and semantic equality for duplicate Catalyst instances created by SimplifyExtractValueOps or PySpark-JVM serialization. Without this, schema projection splits one decode into N separate single-field decodes.
Optional integration: ProtobufExprShims.exprs catches Error (not just Exception) to handle ExceptionInInitializerError from missing protobuf JAR. Returning empty map means no GPU override attempt.
Cross-precision rejection: ProtobufSchemaExtractor.checkScalarEncoding explicitly rejects DoubleType mapped to protobuf FLOAT (and vice versa) rather than silently coercing.
Proto2 only: isGpuSupportedProtoSyntax returns true only for "proto2" — proto3 and editions are not yet supported on GPU, triggering CPU fallback with an explicit reason.
Ordinal tag timing: PRUNED_ORDINAL_TAG is set only in convertToGpu, never during tagExprForGpu. If it were set during tagging, a subsequent CPU fallback would leave stale tags on Spark expressions.

Mapping review to test coverage

If you're reviewing...	Verify these tests pass
`tagExprForGpu` planning pipeline	All 57 integration tests (they all flow through planning)
Schema projection analysis	`test_from_protobuf_schema_projection_`, `test_deep_pruning_`, `test_from_protobuf_projection_across_*`
Ordinal remapping	`test_from_protobuf_nested_message_field_access`, `test_deep_pruning_`, `array struct field meta uses pruned child field count`
SparkProtobufCompat reflection	`compat extracts `, `compat invokes `, `compat retries `, `compat distinguishes `
ProtobufSchemaExtractor	`extractor preserves `, `extractor records `, `extractor gives *`
ProtobufSchemaValidator	`validator encodes `, `validator rejects `, `validator returns *`
Enum-as-string	`test_from_protobuf_enum_cases`, `test_from_protobuf__enum_` (10 tests)
Default values	`test_from_protobuf_default_values_cases`, `test_from_protobuf_nested_child_default_values`
Error handling	`test_from_protobuf_failfast_malformed_data`, `test_from_protobuf_permissive_malformed_returns_null`
Post-project coalesce	`ProtobufBatchMergeSuite` (4 tests), `test_from_protobuf_nested_message_field_access_with_batch_merge`
Value equality	`GpuFromProtobuf semantic equality `, `protobuf binary defaults `, `flattened field descriptor *`
JNI integration	All integration tests (they all call `Protobuf.decodeToStruct` through JNI)

Checklists

This PR has added documentation for new or modified features or behaviors.
This PR has added new tests or modified existing tests to cover new code paths.
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-04T02:46:34Z

@greptileai full review

greptile-apps · 2026-03-04T02:54:51Z

Greptile Summary

This PR adds GPU acceleration for Spark's from_protobuf() expression (ProtobufDataToCatalyst) through a three-layer architecture: reflection-based Spark/protobuf-java compatibility (SparkProtobufCompat), typed metadata extraction and validation (ProtobufSchemaExtractor / ProtobufSchemaValidator), and GPU expression execution (GpuFromProtobuf via JNI). It implements proto2-only support with schema projection pruning, ordinal remapping via PRUNED_ORDINAL_TAG, and an optional post-project coalesce for schema-pruned batches. The implementation is ~2,900 lines of new Scala backed by 57 Python integration tests and 26 Scala unit tests.

Previous-thread fixes confirmed in this diff:

willNotWorkOnGpu guards throughout addFieldWithChildren/addChildFieldsFromStruct (no silent partial-schema issues)
step5Failed flag short-circuits the flatten loop and prevents registerPrunedOrdinals / overrideDataType on partial state
analyzeRequiredFields traverses through ProjectExec without stopping, enabling two-level alias pruning
Proto3/editions/empty-string/"null" syntax all correctly rejected
encodeDefaultValue returns Either (no throws escaping tagExprForGpu)
toDefaultValue returns Left for unknown types (no throws)
hasDefaultValue correctly uses defaultValue.isDefined instead of the raw descriptor flag
PRUNED_ORDINAL_TAG is set only after all willNotWorkOnGpu paths pass
Error subtypes caught in exprs loading path
count=1 and lambda replacement in re.sub; comma-only JAR path splitting; deduplication against existing classpath
PROTOBUF_JARS_AVAILABLE exported only when both JARs are confirmed present

Remaining concerns (flagged inline):

FlattenedSchemaArrays still lacks content-based equals/hashCode overrides (developer acknowledged as "not yet pushed"), inconsistent with the other four array-carrying types in this PR
buildDecodedSchema forces nullable = true only on top-level fields; nested required fields remain nullable = false, which can cause downstream null-handling failures in PERMISSIVE mode
When enablePreSplit = true and forcePostProjectCoalesce = true, the pre-split batch-size cap is silently bypassed, potentially causing memory pressure for large-row, wide-schema protobuf decodes

Confidence Score: 3/5

This PR is a large, complex GPU feature with multiple correctness risks remaining before it is safe to merge.
The previous review round caught and the author addressed ~25 substantive bugs (proto3 bypass, encodeDefaultValue throws, hasDefaultValue flag, willNotWorkOnGpu gaps, reflection failures, PERMISSIVE-mode safety, etc.). The current diff correctly implements the majority of those fixes. However, three issues remain open: (1) FlattenedSchemaArrays equality is acknowledged as not yet pushed, (2) nested-field nullable = true propagation in buildDecodedSchema was acknowledged as addressed but the fix is not visible in this diff, and (3) the pre-split bypass for forcePostProjectCoalesce is a latent memory-pressure risk. Given the complexity (~2,900 lines of new Scala, JNI integration, reflection across Spark 3.4–4.1) and the open items, confidence is moderate.
sql-plugin/.../ProtobufExprShims.scala (buildDecodedSchema nested nullable), sql-plugin/.../ProtobufSchemaModel.scala (FlattenedSchemaArrays equality), sql-plugin/.../basicPhysicalOperators.scala (pre-split bypass)

Important Files Changed

Filename	Overview
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala	Core planning orchestration: 845 lines implementing tagExprForGpu, schema projection analysis, flatten orchestration, and convertToGpu. Most previous-thread bugs addressed (step5Failed guard, willNotWorkOnGpu fallbacks, registerPrunedOrdinals placement); a few deferred fixes (FlattenedSchemaArrays equality, BinaryValue equality, nested nullable propagation) acknowledged but not yet pushed into this diff.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFromProtobuf.scala	GPU expression: JNI call, null propagation, content-based equals/hashCode for all 16 array fields, Logging mixin for PERMISSIVE-mode CudfException, lazy transient ProtobufSchemaDescriptor with explanatory comment. Well-constructed; no new issues identified.
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala	Reflection compatibility layer: handles Spark 3.4 (path-based) and 3.5+ (bytes-based) descriptor APIs, rejects proto3/editions/empty syntax, null-safe typeName helper, invokeBuildDescriptor retry with proper error context. extractBytes/extractNumber throw in their fallback branches but this is safely caught by the outer Try in defaultValueResult.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaModel.scala	Typed metadata model: DescriptorBytes and BinaryValue have content-based equals/hashCode; FlattenedFieldDescriptor has content-based equals/hashCode. FlattenedSchemaArrays still lacks equals/hashCode overrides (acknowledged by developer as "not yet pushed"), creating a latent test-equality pitfall. ProtobufDefaultValue.BinaryValue equality fix also acknowledged but not yet in this diff.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaExtractor.scala	Field analysis layer: analyzeAllFields uses soft-failure (unsupported field info) instead of hard-fail for reflection errors, preserving primary unsupported reason from checkFieldSupport. Cross-precision rejections (DoubleType/FLOAT, FloatType/DOUBLE) now emit actionable messages. getWireType returns Left for unknown types.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaValidator.scala	Flatten-time validation: toFlattenedFieldDescriptor returns Either, encodeDefaultValue returns Either (no more throws), validateFlattenedSchema checks non-STRUCT parent invariant, enum-string metadata consistency, and parent-child depth consistency. Well-structured.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala	GpuProjectExecMeta: adds outputTypeMetas override that applies to ALL project nodes (not just protobuf), forcePostProjectCoalesce flag, shouldCoalesceAfterProject detection. GpuProjectExec and GpuProjectAstExec updated with the new parameter. When enablePreSplit=true AND forcePostProjectCoalesce=true, pre-split is skipped (intentional but noteworthy).
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala	New GpuGetStructFieldMeta and GpuGetArrayStructFieldsMeta with PRUNED_ORDINAL_TAG support; effectiveNumFields derived from post-pruning child schema for GpuGetArrayStructFields. GpuGetStructField formatting refactored. Clean implementation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala	GetStructField now uses GpuGetStructFieldMeta; Alias.typeMeta delegates to child's typeMeta to propagate schema-projection overrides; AttributeReference.convertToGpuImpl resolves type from child plan output. Note added that matched.dataType may carry un-pruned type at plan time; correct propagation is deferred to GpuBoundAttribute.
integration_tests/src/main/python/spark_init_internal.py	_add_driver_classpath: merges existing --driver-class-path in PYSPARK_SUBMIT_ARGS using regex + lambda replacement (backslash-safe), splits on comma only, deduplicates against both PYSP_TEST_spark_driver_extraClassPath and the current PYSPARK_SUBMIT_ARGS value, count=1 for re.sub.
integration_tests/run_pyspark_from_build.sh	Protobuf JAR download: version auto-detected from SPARK_HOME bundled jar (sort -V

Sequence Diagram

sequenceDiagram
    participant Spark as Spark Optimizer
    participant Shims as ProtobufExprShims
    participant Compat as SparkProtobufCompat
    participant Extractor as ProtobufSchemaExtractor
    participant Validator as ProtobufSchemaValidator
    participant GPU as GpuFromProtobuf
    participant JNI as Protobuf.decodeToStruct (JNI)

    Spark->>Shims: tagExprForGpu(ProtobufDataToCatalyst)
    Shims->>Compat: extractExprInfo(expr)
    Compat-->>Shims: ProtobufExprInfo (messageName, descriptorSource, options)
    Shims->>Compat: parsePlannerOptions(options)
    Compat-->>Shims: ProtobufPlannerOptions (enumsAsInts, failOnErrors)
    Shims->>Compat: resolveMessageDescriptor(exprInfo)
    Compat-->>Shims: ProtobufMessageDescriptor (proto2 syntax check)
    Shims->>Extractor: analyzeAllFields(schema, msgDesc, enumsAsInts)
    Extractor-->>Shims: Map[String, ProtobufFieldInfo]
    Shims->>Shims: analyzeRequiredFields(allFieldNames)<br/>(walk ProjectExec/FilterExec/Agg/Sort/Window upward)
    Shims->>Shims: addFieldWithChildren + addChildFieldsFromStruct<br/>(flatten schema to FlattenedFieldDescriptor list)
    Shims->>Validator: validateFlattenedSchema(flatFields)
    Validator-->>Shims: Right(()) or Left(reason)
    Shims->>Validator: toFlattenedSchemaArrays(flatFields)
    Validator-->>Shims: FlattenedSchemaArrays (16 parallel arrays)
    Shims->>Shims: registerPrunedOrdinals → PRUNED_ORDINAL_TAG on GetStructField/GetArrayStructFields
    Shims->>Shims: overrideDataType(buildDecodedSchema)
    Spark->>Shims: convertToGpu(child)
    Shims-->>Spark: GpuFromProtobuf(decodedSchema, flatArrays..., failOnErrors, child)

    Note over Spark,JNI: Runtime (per batch)
    Spark->>GPU: doColumnar(GpuColumnVector[BinaryType])
    GPU->>JNI: Protobuf.decodeToStruct(input, ProtobufSchemaDescriptor, failOnErrors)
    JNI-->>GPU: cudf STRUCT ColumnVector (pruned schema)
    GPU->>GPU: mergeAndSetValidity(BITWISE_AND, input) if hasNulls
    GPU-->>Spark: GpuColumnVector[StructType(pruned)]

_{Last reviewed commit: "apply jni refactor"}

integration_tests/src/main/python/spark_init_internal.py

integration_tests/src/main/python/protobuf_test.py

integration_tests/run_pyspark_from_build.sh

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T02:58:31Z

@greptileai full review

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFromProtobuf.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T04:43:07Z

@greptileai full review again

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T05:38:17Z

@greptileai full review again

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-05T06:02:18Z

@greptileai full review again

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

integration_tests/run_pyspark_from_build.sh

integration_tests/src/main/python/spark_init_internal.py

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

integration_tests/src/main/python/spark_init_internal.py

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-14T06:56:16Z

@greptile review

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-14T12:16:36Z

@greptile review

integration_tests/run_pyspark_from_build.sh

integration_tests/src/main/python/spark_init_internal.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-14T14:43:13Z

@greptile review

integration_tests/run_pyspark_from_build.sh

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-15T08:26:33Z

@greptile review

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

integration_tests/run_pyspark_from_build.sh

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-15T13:55:16Z

@greptile review

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaValidator.scala

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

integration_tests/src/main/python/spark_init_internal.py

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-16T05:23:32Z

@greptile review

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaModel.scala

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven · 2026-03-19T03:11:49Z

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven added 8 commits February 26, 2026 15:15

from_protobuf

65e1ff5

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address self review comments

a225f9a

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

fix schema projection

e517e43

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

bc1bee7

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

bug fix and clean up

62679af

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

bug fix and clean up

f7d9551

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address cc comments

802488d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

codex review and address

030fdf8

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven self-assigned this Mar 3, 2026

thirtiseven added 5 commits March 3, 2026 17:21

verify and fix shim build error

73ce21b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Update copyright year in pom.xml

685c885

Update copyright year in pom.xml

781e639

clean up

43a4c08

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Merge branch 'main' into from_protobuf_nested

47897e9

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

integration_tests/src/main/python/spark_init_internal.py Show resolved Hide resolved

integration_tests/src/main/python/protobuf_test.py Show resolved Hide resolved

integration_tests/run_pyspark_from_build.sh Show resolved Hide resolved

address comments

c1ef9fb

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

address comments

8a2c007

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Outdated Show resolved Hide resolved

address comments

035fd1d

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Show resolved Hide resolved

fix shim

c812754

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Outdated Show resolved Hide resolved

integration_tests/run_pyspark_from_build.sh Show resolved Hide resolved

integration_tests/src/main/python/spark_init_internal.py Outdated Show resolved Hide resolved

address comments

4551f28

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 14, 2026

View reviewed changes

integration_tests/src/main/python/spark_init_internal.py Outdated Show resolved Hide resolved

address comments

ce1555b

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 14, 2026

View reviewed changes

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Show resolved Hide resolved

address comments

f8dd5bc

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 14, 2026

View reviewed changes

comments

fc3bf46

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 14, 2026

View reviewed changes

integration_tests/run_pyspark_from_build.sh Show resolved Hide resolved

integration_tests/run_pyspark_from_build.sh Show resolved Hide resolved

comments

44ae86f

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

comments

e25f798

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

comments

a13cd5c

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Show resolved Hide resolved

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala Show resolved Hide resolved

comments

a3d0e70

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven marked this pull request as ready for review March 16, 2026 07:20

thirtiseven marked this pull request as draft March 16, 2026 07:20

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaModel.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Show resolved Hide resolved

comments and style

99b899f

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven mentioned this pull request Mar 16, 2026

from_protobuf (part 0): Add integration tests marked as XFAIL #14419

Open

3 tasks

backport integration test change

3425256

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

thirtiseven mentioned this pull request Mar 18, 2026

Add a protocol buffer decode kernel NVIDIA/spark-rapids-jni#4107

Open

apply jni refactor

4b05958

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

address comments

a44aee1

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Order	File	Focus	Time
1	`ProtobufSchemaModel.scala`	Data types: `ProtobufExprInfo`, `ProtobufFieldInfo`, `ProtobufDefaultValue`, `FlattenedFieldDescriptor`, `FlattenedSchemaArrays`. Value equality for array fields.	10 min
2	`ProtobufSchemaExtractor.scala`	Type/encoding mapping: `checkScalarEncoding` maps Spark types to protobuf encodings. Cross-precision FLOAT/DOUBLE rejection. Wire type resolution.	15 min
3	`ProtobufSchemaValidator.scala`	Flatten-time validation: `toFlattenedFieldDescriptor` converts typed metadata to JNI format. `validateFlattenedSchema` checks parent-child consistency, enum metadata. `toFlattenedSchemaArrays` builds parallel arrays.	15 min
4	`SparkProtobufCompat.scala`	Reflection layer: `PbReflect` cached method lookups, `extractExprInfo` reads expression fields, `resolveMessageDescriptor` with Spark 3.4/3.5 retry logic, `ReflectiveMessageDescriptor`/`ReflectiveFieldDescriptor` wrappers.	20 min
5	`GpuFromProtobuf.scala`	GPU expression: `doColumnar` JNI call + null propagation, `sparkTypeToCudfIdOpt` type mapping, `equals`/`hashCode` with deep array equality.	10 min
6	`complexTypeExtractors.scala` (changes)	`GpuStructFieldOrdinalTag`, `GpuGetStructFieldMeta`, `GpuGetArrayStructFieldsMeta` — PRUNED_ORDINAL_TAG read + effective ordinal computation.	10 min
7	`ProtobufExprShims.scala` §1: rule + tag	Lines 1–200: `exprs`, `fromProtobufRule`, `tagExprForGpu` top half — extract info, validate options, resolve descriptor, analyze fields.	20 min
8	`ProtobufExprShims.scala` §2: schema projection	Lines 200–500: `analyzeRequiredFields`, `collectStructFieldReferences`, `resolveFieldAccessChain`, `isProtobufStructReference` — plan traversal and field reference collection.	25 min
9	`ProtobufExprShims.scala` §3: flatten + convert	Lines 500–845: `addFieldWithChildren`, `addChildFieldsFromStruct`, `buildPrunedFieldsMap`, `buildDecodedSchema`, `registerPrunedOrdinals`, `convertToGpu`.	20 min
10	`basicPhysicalOperators.scala` (changes)	`GpuProjectExecMeta`: protobuf detection, `shouldCoalesceAfterProject`, `forcePostProjectCoalesce` wiring. `GpuProjectExecLike` trait: `coalesceAfter`, `outputBatching`.	15 min
11	`GpuOverrides.scala` + `RapidsConf.scala` + `Spark340PlusNonDBShims.scala` (changes)	Expression registration, config, rule merging — small diffs.	5 min
12	`ProtobufExprShimsSuite.scala`	Unit tests — skim by section: compat, extractor, validator, equality.	15 min
13	`ProtobufBatchMergeSuite.scala`	Batch merge tests — straightforward.	5 min
14	`protobuf_test.py`	Integration tests — skim by category. Focus on enum, schema projection, deep pruning, error handling tests.	20 min
15	`data_gen.py` (additions)	Wire-format encoder: `ProtobufMessageGen`, encoding functions.	10 min
16	Proto files + shell scripts	Test schema structure, JAR download automation.	10 min

Conversation

thirtiseven commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description (⚠️ from AI)

Key capabilities

Architecture

File structure

New files (core)

New files (tests)

New files (infrastructure)

Modified files

Design decisions

1. Reflection-isolated compatibility layer

2. Typed metadata over raw reflection

3. Two-level schema projection

4. Ordinal remapping via TreeNodeTag

5. Post-project batch coalesce

6. Value equality for array-carrying types

7. Optional integration with graceful degradation

Test coverage

Python integration tests (57 tests)

Scala unit tests (26 tests)

Total: 83 tests

Configurations

Review Guide

Recommended reading order

Key review areas by priority

P0: Correctness-critical

P1: Robustness

P2: Performance & Integration

Things to watch for

Mapping review to test coverage

Checklists

Uh oh!

thirtiseven commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Mar 5, 2026

Uh oh!

Uh oh!

thirtiseven commented Mar 5, 2026

Uh oh!

Uh oh!

thirtiseven commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Mar 14, 2026

Uh oh!

Uh oh!

Uh oh!

thirtiseven commented Mar 15, 2026

Uh oh!

Uh oh!

thirtiseven commented Mar 3, 2026 •

edited

Loading

greptile-apps bot commented Mar 4, 2026 •

edited

Loading