Skip to content

Support from_protobuf expression#14354

Draft
thirtiseven wants to merge 54 commits intoNVIDIA:mainfrom
thirtiseven:from_protobuf_nested
Draft

Support from_protobuf expression#14354
thirtiseven wants to merge 54 commits intoNVIDIA:mainfrom
thirtiseven:from_protobuf_nested

Conversation

@thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Mar 3, 2026

Fixes #14069.
Depends on NVIDIA/spark-rapids-jni#4107

The code is ready, but it's too long to review. Feel free to review it directly or review first part of it #14419 first.

Description (⚠️ from AI)

This PR adds GPU acceleration for Spark's from_protobuf() expression by replacing ProtobufDataToCatalyst with GpuFromProtobuf at query planning time. It bridges the Spark SQL catalyst layer to the CUDA protobuf decode kernels (provided by the companion JNI PR in spark-rapids-jni) through a clean three-layer architecture: reflection-based compatibility, typed metadata extraction/validation, and GPU expression execution.

The implementation spans ~7,800 lines of new Scala/Python/Shell code across 27 files, including ~2,900 lines of Scala plugin code, ~4,400 lines of Python integration tests and data generators, and ~450 lines of shell/proto infrastructure.

Key capabilities

  • Full from_protobuf() replacement: Transparent GPU override of ProtobufDataToCatalyst — no user-facing API changes
  • All scalar protobuf types: int32, int64, uint32, uint64, sint32/sint64 (zigzag), fixed32/sfixed32/fixed64/sfixed64, float, double, bool, string, bytes
  • Nested messages: Up to 10 levels deep, with recursive schema flattening
  • Repeated fields: Both packed and unpacked encoding, repeated scalars and repeated messages (ArrayType(StructType))
  • Enum-as-string: Configurable via enums.as.ints option — integer mode or validated string name mode
  • Default values: Per-field defaults for all scalar types, strings, bytes, and enum (preserving both numeric and display name)
  • Required field validation: Proto2-style required field checks
  • PERMISSIVE / FAILFAST modes: Configurable error handling via mode option
  • Schema projection: Two-level pruning — top-level field pruning + nested child pruning — reduces GPU decode work to only fields referenced by downstream operators
  • Post-project batch coalesce: Optional post-decode coalesce to avoid small-batch overhead from schema projection
  • Spark 3.4+ / 3.5+ compatibility: Reflection-based compat layer handles path-based and bytes-based descriptor APIs across Spark versions
  • Proto2 only: Proto3 and editions syntax are explicitly rejected with CPU fallback

Architecture

Spark from_protobuf (ProtobufDataToCatalyst)
    │
    │ GpuOverrides rule registration (Spark340PlusNonDBShims)
    ▼
ProtobufExprShims.fromProtobufRule (tagExprForGpu → convertToGpu)
    │
    │ reflection
    ▼
SparkProtobufCompat (extractExprInfo, resolveMessageDescriptor, parsePlannerOptions)
    │
    │ typed metadata
    ▼
ProtobufSchemaExtractor.analyzeAllFields → Map[String, ProtobufFieldInfo]
    │
    │ schema projection analysis
    ▼
analyzeRequiredFields → nestedFieldRequirements → prunedFieldsMap
    │
    │ flatten + validate
    ▼
ProtobufSchemaValidator (toFlattenedFieldDescriptor → validateFlattenedSchema → toFlattenedSchemaArrays)
    │
    │ parallel arrays
    ▼
GpuFromProtobuf(decodedSchema, flatArrays..., child)
    │
    │ JNI call
    ▼
Protobuf.decodeToStruct → CUDA decode → cuDF STRUCT column

File structure

New files (core)

File Lines Description
GpuFromProtobuf.scala 203 GPU expression: JNI call, null propagation, type mapping, value equality
ProtobufExprShims.scala 845 GpuOverrides rule, schema projection analysis, flatten orchestration, convertToGpu
SparkProtobufCompat.scala 371 Reflection layer: extract expr info, resolve descriptors, parse options, Spark 3.4/3.5 compat
ProtobufSchemaModel.scala 182 Typed metadata: ProtobufExprInfo, ProtobufFieldInfo, ProtobufDefaultValue, FlattenedFieldDescriptor, etc.
ProtobufSchemaExtractor.scala 236 Field analysis: type/encoding mapping, support checks, wire type resolution
ProtobufSchemaValidator.scala 183 Flatten-time validation: enum metadata, defaults, parent-child consistency, JNI array construction

New files (tests)

File Lines Description
protobuf_test.py 3,989 57 Python integration tests: GPU vs CPU correctness for all protobuf features
ProtobufExprShimsSuite.scala 616 22 Scala unit tests: compat layer, extractor, validator, ordinal remapping, semantic equality
ProtobufBatchMergeSuite.scala 115 4 Scala unit tests: post-project coalesce detection and config
data_gen.py (additions) ~385 Protobuf wire-format encoder: ProtobufMessageGen, PbScalar/PbNested/PbRepeated/PbRepeatedMessage

New files (infrastructure)

File Lines Description
main_log.proto 103 Main test proto: enums, required fields, multi-level nesting, cross-file imports
module_a_res.proto 92 External proto: repeated messages, defaults, nested repeated
module_b_res.proto 29 External proto: repeated scalars, block structures
predictor_schema.proto 82 External proto: deep multi-level nesting, empty messages
device_req.proto 11 External proto: bytes field
gen_nested_proto_data.sh 34 Proto compilation script
main_log.desc (binary) Compiled FileDescriptorSet

Modified files

File Change summary
GpuOverrides.scala GetStructFieldGpuGetStructFieldMeta, GetArrayStructFieldsGpuGetArrayStructFieldsMeta
complexTypeExtractors.scala New GpuStructFieldOrdinalTag, GpuGetStructFieldMeta, GpuGetArrayStructFieldsMeta with PRUNED_ORDINAL_TAG support
basicPhysicalOperators.scala GpuProjectExecMeta detects protobuf extraction, new forcePostProjectCoalesce parameter on GpuProjectExec
RapidsConf.scala New config spark.rapids.sql.protobuf.batchMergeAfterProject.enabled
Spark340PlusNonDBShims.scala Merges ProtobufExprShims.exprs into expression rules
GpuBoundAttribute.scala Minor: pruned struct types propagate correctly through binding
DeltaProviderBase.scala Pattern match updates for new GpuProjectExec signature (forcePostProjectCoalesce param)
run_pyspark_from_build.sh Auto-download spark-protobuf + protobuf-java JARs, driver classpath setup
spark_init_internal.py Driver classpath support for optional protobuf module
supported_ops.md Remove BINARY from unsupported child types for struct extractors

Design decisions

1. Reflection-isolated compatibility layer

All Spark expression and protobuf-java descriptor reflection is confined to SparkProtobufCompat.scala. This isolates version-specific API differences (Spark 3.4 path-based vs 3.5+ bytes-based buildDescriptor) from planning logic. Reflection failures produce explicit CPU fallback reasons rather than silent degradation.

2. Typed metadata over raw reflection

Instead of passing raw Option[Any] protobuf defaults or untyped descriptor fields through the planning pipeline, the code uses a typed metadata model (ProtobufExprInfo, ProtobufFieldInfo, ProtobufDefaultValue, ProtobufEnumMetadata). Enum defaults preserve both numeric value (defaultInt) and display name (defaultString), avoiding the ClassCastException: EnumValueDescriptor cannot be cast to String pitfall.

3. Two-level schema projection

Schema projection reduces GPU decode work by only processing fields referenced downstream:

  • Top-level pruning: Only decode top-level fields referenced in downstream ProjectExec/FilterExec/AggregateExec/SortExec/WindowExec
  • Nested pruning: Only decode children of nested messages that are actually accessed — applies uniformly to both StructType (non-repeated) and ArrayType(StructType) (repeated) nested fields

The analysis walks GetStructField / GetArrayStructFields chains upward through the plan to determine required fields. Expression identity uses semantic equality (not just reference equality) to handle Catalyst optimizer creating duplicate instances.

4. Ordinal remapping via TreeNodeTag

When schema projection prunes nested struct children, downstream GetStructField / GetArrayStructFields expressions must use remapped ordinals pointing into the pruned struct. This is done via PRUNED_ORDINAL_TAG (TreeNodeTag[Int]), set during convertToGpu and read by GpuGetStructFieldMeta.convertToGpu / GpuGetArrayStructFieldsMeta.convertToGpu.

Design principle: Operator-specific logic (ordinal remapping) stays in the Meta layer. The runtime classes (GpuGetStructField, GpuGetArrayStructFields, GpuCanonicalize) remain generic and untouched.

5. Post-project batch coalesce

Schema projection can produce narrower batches. When enabled via spark.rapids.sql.protobuf.batchMergeAfterProject.enabled, GpuProjectExecMeta detects projects that extract from protobuf decode and sets forcePostProjectCoalesce=true, which inserts a post-project coalesce and prevents the optimizer from removing it (via outputBatching = null).

6. Value equality for array-carrying types

Any type storing raw arrays that participates in expression equality (GpuFromProtobuf, FlattenedFieldDescriptor, ProtobufDescriptorSource.DescriptorBytes, ProtobufDefaultValue.BinaryValue) overrides equals/hashCode with java.util.Arrays content-based semantics. This prevents semantically identical metadata from comparing unequal by JVM identity.

7. Optional integration with graceful degradation

ProtobufExprShims.exprs loads ProtobufDataToCatalyst by reflection. If the class is not on the classpath (no spark-protobuf JAR), it returns an empty map — no error, no GPU override. Class-loading failures (ExceptionInInitializerError, LinkageError) are caught at the Error level to prevent crashing query planning.

Test coverage

Python integration tests (57 tests)

Category Tests What is covered
Scalar types 5 All scalar types, random scalars, bytes, duplicate fields, all null input
Integer encodings 2 Signed integers (zigzag), fixed integers
Bool encoding 2 Non-canonical varint for bool (scalar + repeated)
Nested messages 6 1-level, 3-level deep, 5-level deep, nested field access, batch merge, nested-with-repeated
Repeated fields 6 int32, string, all types, large array, packed, repeated-with-nested
Repeated messages 1 Repeated message decode
Enum 10 Enum cases, nested enum PERMISSIVE, sibling null propagation, defaults, repeated enum, enum-in-repeated-message, nested repeated enum-as-string
Required fields 3 Present, missing FAILFAST, nested missing PERMISSIVE
Default values 2 Scalar defaults, nested child defaults
Schema projection 7 Simple field pruning, parametrized cases, alias boundary, withColumn boundary, deep pruning (3/5 level, mixed, sibling, whole-struct)
Error handling 3 FAILFAST malformed, PERMISSIVE malformed null, all-null input
Complex / customer 2 Heavy nested proto (customer-realistic), Parquet round-trip
Bug regressions 4 Name collision, filter jump, unrelated struct collision, max depth
Cross-expression 1 Different messages on same binary column
API / options 2 Legacy signature preservation, packed repeated fixed encoding

Scala unit tests (26 tests)

Suite Tests What is covered
ProtobufExprShimsSuite 22 SparkProtobufCompat reflection (path/bytes/Spark 3.4/3.5), planner options, unsupported options, proto3 rejection, ProtobufSchemaExtractor (typed enum defaults, reflection failures, type mismatch, FLOAT/DOUBLE widening), ProtobufSchemaValidator (enum-string encoding, missing metadata rejection, incompatible defaults, non-STRUCT parent), ordinal remapping, GpuFromProtobuf semantic equality, binary default equality, FlattenedFieldDescriptor equality
ProtobufBatchMergeSuite 4 ProjectExec protobuf extraction detection (child project, same project), config default/enable, output batching drop for post-project merge

Total: 83 tests

Configurations

Config key Default Description
spark.rapids.sql.protobuf.batchMergeAfterProject.enabled false Enable post-project coalesce for projects extracting from schema-pruned from_protobuf decode

Review Guide

This PR is large (~7,800 lines) but has a well-defined layered architecture. New Scala production code is ~2,020 lines across 6 files; the rest is tests, proto schemas, and shell infrastructure. This guide provides a recommended reading order and key areas to focus on.

Recommended reading order

Read bottom-up from the metadata model to the planning orchestration:

Order File Focus Time
1 ProtobufSchemaModel.scala Data types: ProtobufExprInfo, ProtobufFieldInfo, ProtobufDefaultValue, FlattenedFieldDescriptor, FlattenedSchemaArrays. Value equality for array fields. 10 min
2 ProtobufSchemaExtractor.scala Type/encoding mapping: checkScalarEncoding maps Spark types to protobuf encodings. Cross-precision FLOAT/DOUBLE rejection. Wire type resolution. 15 min
3 ProtobufSchemaValidator.scala Flatten-time validation: toFlattenedFieldDescriptor converts typed metadata to JNI format. validateFlattenedSchema checks parent-child consistency, enum metadata. toFlattenedSchemaArrays builds parallel arrays. 15 min
4 SparkProtobufCompat.scala Reflection layer: PbReflect cached method lookups, extractExprInfo reads expression fields, resolveMessageDescriptor with Spark 3.4/3.5 retry logic, ReflectiveMessageDescriptor/ReflectiveFieldDescriptor wrappers. 20 min
5 GpuFromProtobuf.scala GPU expression: doColumnar JNI call + null propagation, sparkTypeToCudfIdOpt type mapping, equals/hashCode with deep array equality. 10 min
6 complexTypeExtractors.scala (changes) GpuStructFieldOrdinalTag, GpuGetStructFieldMeta, GpuGetArrayStructFieldsMeta — PRUNED_ORDINAL_TAG read + effective ordinal computation. 10 min
7 ProtobufExprShims.scala §1: rule + tag Lines 1–200: exprs, fromProtobufRule, tagExprForGpu top half — extract info, validate options, resolve descriptor, analyze fields. 20 min
8 ProtobufExprShims.scala §2: schema projection Lines 200–500: analyzeRequiredFields, collectStructFieldReferences, resolveFieldAccessChain, isProtobufStructReference — plan traversal and field reference collection. 25 min
9 ProtobufExprShims.scala §3: flatten + convert Lines 500–845: addFieldWithChildren, addChildFieldsFromStruct, buildPrunedFieldsMap, buildDecodedSchema, registerPrunedOrdinals, convertToGpu. 20 min
10 basicPhysicalOperators.scala (changes) GpuProjectExecMeta: protobuf detection, shouldCoalesceAfterProject, forcePostProjectCoalesce wiring. GpuProjectExecLike trait: coalesceAfter, outputBatching. 15 min
11 GpuOverrides.scala + RapidsConf.scala + Spark340PlusNonDBShims.scala (changes) Expression registration, config, rule merging — small diffs. 5 min
12 ProtobufExprShimsSuite.scala Unit tests — skim by section: compat, extractor, validator, equality. 15 min
13 ProtobufBatchMergeSuite.scala Batch merge tests — straightforward. 5 min
14 protobuf_test.py Integration tests — skim by category. Focus on enum, schema projection, deep pruning, error handling tests. 20 min
15 data_gen.py (additions) Wire-format encoder: ProtobufMessageGen, encoding functions. 10 min
16 Proto files + shell scripts Test schema structure, JAR download automation. 10 min

Total estimated review time: ~3.5–4 hours for a thorough review.

Key review areas by priority

P0: Correctness-critical
  1. ProtobufExprShims.tagExprForGpu (ProtobufExprShims.scala): This is the most complex function in the PR. It orchestrates the full planning pipeline: extract expression info → validate options → resolve descriptor → analyze fields → compute required fields → flatten schema → validate → set ordinal tags → override data type. Key things to verify:

    • All willNotWorkOnGpu paths produce meaningful fallback reasons
    • If step 5 (flatten) encounters an error, later steps (validate, registerPrunedOrdinals, overrideDataType) are NOT executed on partial state
    • analyzeAllFields failure triggers CPU fallback, not silent empty schema
  2. analyzeRequiredFields + collectStructFieldReferences (ProtobufExprShims.scala): Plan traversal for schema projection. Verify:

    • Upward walk handles ProjectExec, FilterExec, AggregateExec, SortExec, WindowExec; unknown nodes disable pruning
    • isProtobufStructReference uses semantic equality as fallback (class + semanticEquals on children), not just eq
    • resolveFieldAccessChain correctly walks GetStructField chains to the protobuf root expression
    • GetArrayStructFields with empty parent path disables pruning rather than registering a fake top-level requirement
  3. registerPrunedOrdinals (ProtobufExprShims.scala): Sets PRUNED_ORDINAL_TAG on Spark expressions. Verify:

    • Tags are set only during convertToGpu (not during analysis/tagging when CPU fallback is still possible)
    • The remapped ordinal is correctly computed from the pruned schema's child position
    • Both GetStructField and GetArrayStructFields are handled
  4. GpuGetStructFieldMeta.convertToGpu / GpuGetArrayStructFieldsMeta.convertToGpu (complexTypeExtractors.scala): Read PRUNED_ORDINAL_TAG and pass effective ordinal. Verify:

    • Falls back to expr.ordinal when tag is not present (non-protobuf usage)
    • effectiveNumFields for GetArrayStructFields uses the child struct's actual field count after pruning
  5. GpuFromProtobuf.doColumnar (GpuFromProtobuf.scala): JNI call + null propagation. Verify:

    • Input null mask is merged via mergeAndSetValidity(BinaryOp.BITWISE_AND, input.getBase) — only when input has nulls
    • FAILFAST wraps CudfException in SparkException; PERMISSIVE logs and rethrows
    • ProtobufSchemaDescriptor is lazily initialized (transient) for serialization safety

P1: Robustness

  1. SparkProtobufCompat (SparkProtobufCompat.scala): Reflection layer. Verify:

    • extractExprInfo extracts messageName, descriptorSource, options with proper error handling
    • resolveMessageDescriptor handles both path-based (Spark 3.4) and bytes-based (Spark 3.5+) descriptor construction, with retry-on-ClassCastException for Spark 3.5 path→bytes conversion
    • isGpuSupportedProtoSyntax rejects proto3, editions, null, and empty strings
    • Reflection failures always return Left/None, never throw through to the optimizer
  2. ProtobufSchemaValidator (ProtobufSchemaValidator.scala): Flatten-time validation. Verify:

    • ENC_ENUM_STRING fields must have non-empty enumValidValues and enumNames with matching lengths
    • Repeated fields with defaults are rejected
    • Parent index validity: parent must be a STRUCT field with depth = child.depth - 1
    • encodeDefaultValue handles all ProtobufDefaultValue variants including EnumValue (both numeric and string)
  3. Value equality (multiple files): Verify equals/hashCode overrides use Arrays.equals/Arrays.deepEquals for:

    • GpuFromProtobuf (all 16 schema arrays)
    • FlattenedFieldDescriptor (defaultString, enumValidValues, enumNames)
    • ProtobufDescriptorSource.DescriptorBytes
    • ProtobufDefaultValue.BinaryValue

P2: Performance & Integration

  1. Post-project coalesce (basicPhysicalOperators.scala): Verify:

    • GpuProjectExecMeta.shouldCoalesceAfterProject correctly detects ProtobufDataToCatalyst by class name via isProtobufDecodeExpr
    • forcePostProjectCoalesce=true causes outputBatching to return null (not TargetSize), preventing removal by transition rules
    • Config isProtobufBatchMergeAfterProjectEnabled defaults to false
  2. DeltaProviderBase.scala changes: Pattern match updates for new GpuProjectExec 4-parameter signature. Verify the mergeIdenticalProjects correctly OR's forcePostProjectCoalesce flags when merging.

  3. Integration test infrastructure (run_pyspark_from_build.sh): Verify:

    • spark-protobuf and protobuf-java JAR download uses correct Maven coordinates
    • Version detection from $SPARK_HOME/jars/protobuf-java-*.jar with fallback mapping
    • PROTOBUF_JARS_AVAILABLE is exported so protobuf_test.py can skip when deps are missing
    • Driver classpath is correctly set via PYSP_TEST_spark_driver_extraClassPath

Things to watch for

  • Expression identity: isProtobufStructReference must handle both reference equality (eq) and semantic equality for duplicate Catalyst instances created by SimplifyExtractValueOps or PySpark-JVM serialization. Without this, schema projection splits one decode into N separate single-field decodes.
  • Optional integration: ProtobufExprShims.exprs catches Error (not just Exception) to handle ExceptionInInitializerError from missing protobuf JAR. Returning empty map means no GPU override attempt.
  • Cross-precision rejection: ProtobufSchemaExtractor.checkScalarEncoding explicitly rejects DoubleType mapped to protobuf FLOAT (and vice versa) rather than silently coercing.
  • Proto2 only: isGpuSupportedProtoSyntax returns true only for "proto2" — proto3 and editions are not yet supported on GPU, triggering CPU fallback with an explicit reason.
  • Ordinal tag timing: PRUNED_ORDINAL_TAG is set only in convertToGpu, never during tagExprForGpu. If it were set during tagging, a subsequent CPU fallback would leave stale tags on Spark expressions.

Mapping review to test coverage

If you're reviewing... Verify these tests pass
tagExprForGpu planning pipeline All 57 integration tests (they all flow through planning)
Schema projection analysis test_from_protobuf_schema_projection_*, test_deep_pruning_*, test_from_protobuf_projection_across_*
Ordinal remapping test_from_protobuf_nested_message_field_access*, test_deep_pruning_*, array struct field meta uses pruned child field count
SparkProtobufCompat reflection compat extracts *, compat invokes *, compat retries *, compat distinguishes *
ProtobufSchemaExtractor extractor preserves *, extractor records *, extractor gives *
ProtobufSchemaValidator validator encodes *, validator rejects *, validator returns *
Enum-as-string test_from_protobuf_enum_cases, test_from_protobuf_*_enum_* (10 tests)
Default values test_from_protobuf_default_values_cases, test_from_protobuf_nested_child_default_values
Error handling test_from_protobuf_failfast_malformed_data, test_from_protobuf_permissive_malformed_returns_null
Post-project coalesce ProtobufBatchMergeSuite (4 tests), test_from_protobuf_nested_message_field_access_with_batch_merge
Value equality GpuFromProtobuf semantic equality *, protobuf binary defaults *, flattened field descriptor *
JNI integration All integration tests (they all call Protobuf.decodeToStruct through JNI)

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Mar 3, 2026
@thirtiseven
Copy link
Collaborator Author

@greptileai full review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR adds GPU acceleration for Spark's from_protobuf() expression (ProtobufDataToCatalyst) through a three-layer architecture: reflection-based Spark/protobuf-java compatibility (SparkProtobufCompat), typed metadata extraction and validation (ProtobufSchemaExtractor / ProtobufSchemaValidator), and GPU expression execution (GpuFromProtobuf via JNI). It implements proto2-only support with schema projection pruning, ordinal remapping via PRUNED_ORDINAL_TAG, and an optional post-project coalesce for schema-pruned batches. The implementation is ~2,900 lines of new Scala backed by 57 Python integration tests and 26 Scala unit tests.

Previous-thread fixes confirmed in this diff:

  • willNotWorkOnGpu guards throughout addFieldWithChildren/addChildFieldsFromStruct (no silent partial-schema issues)
  • step5Failed flag short-circuits the flatten loop and prevents registerPrunedOrdinals / overrideDataType on partial state
  • analyzeRequiredFields traverses through ProjectExec without stopping, enabling two-level alias pruning
  • Proto3/editions/empty-string/"null" syntax all correctly rejected
  • encodeDefaultValue returns Either (no throws escaping tagExprForGpu)
  • toDefaultValue returns Left for unknown types (no throws)
  • hasDefaultValue correctly uses defaultValue.isDefined instead of the raw descriptor flag
  • PRUNED_ORDINAL_TAG is set only after all willNotWorkOnGpu paths pass
  • Error subtypes caught in exprs loading path
  • count=1 and lambda replacement in re.sub; comma-only JAR path splitting; deduplication against existing classpath
  • PROTOBUF_JARS_AVAILABLE exported only when both JARs are confirmed present

Remaining concerns (flagged inline):

  • FlattenedSchemaArrays still lacks content-based equals/hashCode overrides (developer acknowledged as "not yet pushed"), inconsistent with the other four array-carrying types in this PR
  • buildDecodedSchema forces nullable = true only on top-level fields; nested required fields remain nullable = false, which can cause downstream null-handling failures in PERMISSIVE mode
  • When enablePreSplit = true and forcePostProjectCoalesce = true, the pre-split batch-size cap is silently bypassed, potentially causing memory pressure for large-row, wide-schema protobuf decodes

Confidence Score: 3/5

  • This PR is a large, complex GPU feature with multiple correctness risks remaining before it is safe to merge.
  • The previous review round caught and the author addressed ~25 substantive bugs (proto3 bypass, encodeDefaultValue throws, hasDefaultValue flag, willNotWorkOnGpu gaps, reflection failures, PERMISSIVE-mode safety, etc.). The current diff correctly implements the majority of those fixes. However, three issues remain open: (1) FlattenedSchemaArrays equality is acknowledged as not yet pushed, (2) nested-field nullable = true propagation in buildDecodedSchema was acknowledged as addressed but the fix is not visible in this diff, and (3) the pre-split bypass for forcePostProjectCoalesce is a latent memory-pressure risk. Given the complexity (~2,900 lines of new Scala, JNI integration, reflection across Spark 3.4–4.1) and the open items, confidence is moderate.
  • sql-plugin/.../ProtobufExprShims.scala (buildDecodedSchema nested nullable), sql-plugin/.../ProtobufSchemaModel.scala (FlattenedSchemaArrays equality), sql-plugin/.../basicPhysicalOperators.scala (pre-split bypass)

Important Files Changed

Filename Overview
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/ProtobufExprShims.scala Core planning orchestration: 845 lines implementing tagExprForGpu, schema projection analysis, flatten orchestration, and convertToGpu. Most previous-thread bugs addressed (step5Failed guard, willNotWorkOnGpu fallbacks, registerPrunedOrdinals placement); a few deferred fixes (FlattenedSchemaArrays equality, BinaryValue equality, nested nullable propagation) acknowledged but not yet pushed into this diff.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFromProtobuf.scala GPU expression: JNI call, null propagation, content-based equals/hashCode for all 16 array fields, Logging mixin for PERMISSIVE-mode CudfException, lazy transient ProtobufSchemaDescriptor with explanatory comment. Well-constructed; no new issues identified.
sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/SparkProtobufCompat.scala Reflection compatibility layer: handles Spark 3.4 (path-based) and 3.5+ (bytes-based) descriptor APIs, rejects proto3/editions/empty syntax, null-safe typeName helper, invokeBuildDescriptor retry with proper error context. extractBytes/extractNumber throw in their fallback branches but this is safely caught by the outer Try in defaultValueResult.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaModel.scala Typed metadata model: DescriptorBytes and BinaryValue have content-based equals/hashCode; FlattenedFieldDescriptor has content-based equals/hashCode. FlattenedSchemaArrays still lacks equals/hashCode overrides (acknowledged by developer as "not yet pushed"), creating a latent test-equality pitfall. ProtobufDefaultValue.BinaryValue equality fix also acknowledged but not yet in this diff.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaExtractor.scala Field analysis layer: analyzeAllFields uses soft-failure (unsupported field info) instead of hard-fail for reflection errors, preserving primary unsupported reason from checkFieldSupport. Cross-precision rejections (DoubleType/FLOAT, FloatType/DOUBLE) now emit actionable messages. getWireType returns Left for unknown types.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/protobuf/ProtobufSchemaValidator.scala Flatten-time validation: toFlattenedFieldDescriptor returns Either, encodeDefaultValue returns Either (no more throws), validateFlattenedSchema checks non-STRUCT parent invariant, enum-string metadata consistency, and parent-child depth consistency. Well-structured.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala GpuProjectExecMeta: adds outputTypeMetas override that applies to ALL project nodes (not just protobuf), forcePostProjectCoalesce flag, shouldCoalesceAfterProject detection. GpuProjectExec and GpuProjectAstExec updated with the new parameter. When enablePreSplit=true AND forcePostProjectCoalesce=true, pre-split is skipped (intentional but noteworthy).
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala New GpuGetStructFieldMeta and GpuGetArrayStructFieldsMeta with PRUNED_ORDINAL_TAG support; effectiveNumFields derived from post-pruning child schema for GpuGetArrayStructFields. GpuGetStructField formatting refactored. Clean implementation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala GetStructField now uses GpuGetStructFieldMeta; Alias.typeMeta delegates to child's typeMeta to propagate schema-projection overrides; AttributeReference.convertToGpuImpl resolves type from child plan output. Note added that matched.dataType may carry un-pruned type at plan time; correct propagation is deferred to GpuBoundAttribute.
integration_tests/src/main/python/spark_init_internal.py _add_driver_classpath: merges existing --driver-class-path in PYSPARK_SUBMIT_ARGS using regex + lambda replacement (backslash-safe), splits on comma only, deduplicates against both PYSP_TEST_spark_driver_extraClassPath and the current PYSPARK_SUBMIT_ARGS value, count=1 for re.sub.
integration_tests/run_pyspark_from_build.sh Protobuf JAR download: version auto-detected from SPARK_HOME bundled jar (sort -V

Sequence Diagram

sequenceDiagram
    participant Spark as Spark Optimizer
    participant Shims as ProtobufExprShims
    participant Compat as SparkProtobufCompat
    participant Extractor as ProtobufSchemaExtractor
    participant Validator as ProtobufSchemaValidator
    participant GPU as GpuFromProtobuf
    participant JNI as Protobuf.decodeToStruct (JNI)

    Spark->>Shims: tagExprForGpu(ProtobufDataToCatalyst)
    Shims->>Compat: extractExprInfo(expr)
    Compat-->>Shims: ProtobufExprInfo (messageName, descriptorSource, options)
    Shims->>Compat: parsePlannerOptions(options)
    Compat-->>Shims: ProtobufPlannerOptions (enumsAsInts, failOnErrors)
    Shims->>Compat: resolveMessageDescriptor(exprInfo)
    Compat-->>Shims: ProtobufMessageDescriptor (proto2 syntax check)
    Shims->>Extractor: analyzeAllFields(schema, msgDesc, enumsAsInts)
    Extractor-->>Shims: Map[String, ProtobufFieldInfo]
    Shims->>Shims: analyzeRequiredFields(allFieldNames)<br/>(walk ProjectExec/FilterExec/Agg/Sort/Window upward)
    Shims->>Shims: addFieldWithChildren + addChildFieldsFromStruct<br/>(flatten schema to FlattenedFieldDescriptor list)
    Shims->>Validator: validateFlattenedSchema(flatFields)
    Validator-->>Shims: Right(()) or Left(reason)
    Shims->>Validator: toFlattenedSchemaArrays(flatFields)
    Validator-->>Shims: FlattenedSchemaArrays (16 parallel arrays)
    Shims->>Shims: registerPrunedOrdinals → PRUNED_ORDINAL_TAG on GetStructField/GetArrayStructFields
    Shims->>Shims: overrideDataType(buildDecodedSchema)
    Spark->>Shims: convertToGpu(child)
    Shims-->>Spark: GpuFromProtobuf(decodedSchema, flatArrays..., failOnErrors, child)

    Note over Spark,JNI: Runtime (per batch)
    Spark->>GPU: doColumnar(GpuColumnVector[BinaryType])
    GPU->>JNI: Protobuf.decodeToStruct(input, ProtobufSchemaDescriptor, failOnErrors)
    JNI-->>GPU: cudf STRUCT ColumnVector (pruned schema)
    GPU->>GPU: mergeAndSetValidity(BITWISE_AND, input) if hasNulls
    GPU-->>Spark: GpuColumnVector[StructType(pruned)]
Loading

Last reviewed commit: "apply jni refactor"

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptileai full review again

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven marked this pull request as ready for review March 16, 2026 07:20
@thirtiseven thirtiseven marked this pull request as draft March 16, 2026 07:20
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support from_protobuf

2 participants