Skip to content

from_protobuf (part 0): Add integration tests marked as XFAIL#14419

Open
thirtiseven wants to merge 18 commits intoNVIDIA:mainfrom
thirtiseven:protobuf_integration_tests
Open

from_protobuf (part 0): Add integration tests marked as XFAIL#14419
thirtiseven wants to merge 18 commits intoNVIDIA:mainfrom
thirtiseven:protobuf_integration_tests

Conversation

@thirtiseven
Copy link
Collaborator

Part 0 of #14354

Description

This PR adds integration tests for from_protobuf. When #14354 is merged in parts, each small PR can enable some tests.

At the framework level:

  • from_protobuf needs a external jar spark-protobuf to run, so I edited run_pyspark_from_build.sh to download the jar when it is not found.
  • Also added a ProtobufMessageGen to generate random protocol buffer data.

Everything is generated by cursor.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

@thirtiseven thirtiseven self-assigned this Mar 16, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR adds a comprehensive integration test suite for the from_protobuf Spark SQL function as part of the staged GPU implementation (spark-rapids#14354). All tests are marked @_xfail_gpu_protobuf (or CPU-only) so they act as a baseline verification suite that can be progressively un-XFAIL'd as each GPU feature lands.

Key additions:

  • protobuf_data_gen.py — a new pure-Python schema-first binary encoder and ProtobufRowGen DataGen that generates paired (logical Spark columns + serialized proto bytes) rows for round-trip testing. Covers all scalar wire types, zigzag, fixed-width, packed/unpacked repeated, enums, nested messages, and proto2 defaults.
  • protobuf_test.py — ~2900-line test file with ~50 parameterised test functions covering scalars, nested structs, repeated fields, enums (int/string/unknown), proto2 required fields, default values, schema projection/pruning, plan-boundary projection, FAILFAST/PERMISSIVE error modes, and several named regression tests (name collision, filter-jump, bool uint8 truncation, deep nesting fallback).
  • spark_session.py — adds is_protobuf_runtime_available() which gates the test fixture on JVM class-loading success so tests skip cleanly when the spark-protobuf jar is absent.
  • run_pyspark_from_build.sh — adds PROTOBUF_JARS env-var handling: validates each path, requires at least one spark-protobuf_* jar, and folds them into ALL_JARS.
  • Documentation and marker registration updated consistently.

Findings:

  • is_protobuf_runtime_available() contains a redundant final fallback (jvm.org.apache.spark.sql.protobuf.functions.from_protobuf attribute access) that is unreachable once both Class.forName candidates fail.
  • PROTOBUF_EXTRA_JARS="${VALID_PROTOBUF_JARS[*]}" space-joins the array, which would mishandle jar paths containing spaces (same pre-existing limitation as AVRO_JARS).
  • test_from_protobuf_bug4_max_depth only validates the CPU baseline with a weak not-null assertion; GPU fallback verification is deferred to follow-up PRs, which is fine given the XFAIL framing but worth tracking.

Confidence Score: 4/5

  • Safe to merge — all tests are XFAIL or CPU-only, so they cannot regress the build, and the infrastructure changes (shell script, session helper) are additive and guarded.
  • All new tests are marked XFAIL pending the GPU implementation, meaning they document expected-failure baselines rather than asserting currently-working behaviour. The supporting data-generator and descriptor-builder code is self-contained pure Python with no JVM dependencies at import time. Shell script changes are guarded by the PROTOBUF_JARS env var. Minor issues found (redundant fallback in is_protobuf_runtime_available, weak assertion in the max-depth test, space-join limitation in bash) are non-blocking.
  • integration_tests/src/main/python/spark_session.py (redundant fallback block) and integration_tests/run_pyspark_from_build.sh (space-joined jar array).

Important Files Changed

Filename Overview
integration_tests/src/main/python/protobuf_test.py New 2872-line integration test file covering from_protobuf across scalar types, nested/repeated messages, enums, defaults, schema projection, edge cases, and regression tests — all XFAIL pending GPU implementation. Well-structured with clear descriptor helpers. Minor: test_from_protobuf_bug4_max_depth only validates CPU path with a weak assertion.
integration_tests/src/main/python/protobuf_data_gen.py New pure-Python protobuf schema-first data generator. Implements varint/zigzag/fixed encoders, PbMessageSpec DSL via pb builder singleton, and ProtobufRowGen (DataGen subclass) for generating paired (logical columns + binary blob) Spark rows. Guard against negative varints added. Logic is correct.
integration_tests/src/main/python/spark_session.py Adds is_protobuf_runtime_available() helper that checks for the spark-protobuf jar at both the PySpark import and JVM class-loading levels. The final fallback via jvm.org.apache.spark.sql.protobuf.functions.from_protobuf is redundant given the preceding Class.forName loop.
integration_tests/run_pyspark_from_build.sh Adds optional PROTOBUF_JARS env-var handling: validates jar paths, confirms a spark-protobuf jar is present, and appends them to ALL_JARS for executor/local-mode distribution. Space-joining array elements could mishandle paths with spaces (same limitation as existing AVRO_JARS).
integration_tests/README.md Documents how to supply PROTOBUF_JARS and run the new protobuf_test marker; accurately reflects the updated script behaviour (no auto-download).
integration_tests/src/main/python/marks.py Adds protobuf_test = pytest.mark.protobuf_test alongside the matching entry in pytest.ini; straightforward one-liner addition.
integration_tests/src/test/resources/protobuf_test/gen_nested_proto_data.sh Convenience script to compile the checked-in .proto files into main_log.desc using protoc. Straightforward and well-commented.

Sequence Diagram

sequenceDiagram
    participant User as Test Runner
    participant Script as run_pyspark_from_build.sh
    participant pytest
    participant Fixture as from_protobuf_fn fixture
    participant Helper as is_protobuf_runtime_available()
    participant Test as protobuf_test.py tests
    participant CPU as CPU Spark
    participant GPU as GPU Spark (XFAIL)

    User->>Script: PROTOBUF_JARS=... ./run_pyspark_from_build.sh -m protobuf_test
    Script->>Script: Validate jars, check spark-protobuf_* present
    Script->>Script: Set PROTOBUF_JARS_AVAILABLE=true
    Script->>Script: Add jars to ALL_JARS (--jars / --driver-class-path)
    Script->>pytest: Launch pytest with PROTOBUF_JARS_AVAILABLE env var

    pytest->>Test: Collect module (pytestmark skipif checks env var)
    pytest->>Fixture: Invoke from_protobuf_fn (scope=module)
    Fixture->>Helper: is_protobuf_runtime_available()
    Helper->>Helper: import pyspark.sql.protobuf.functions
    Helper->>Helper: Class.forName("...functions$")
    Helper-->>Fixture: True
    Fixture->>Fixture: import from_protobuf, return fn

    loop Each XFAIL test
        pytest->>Test: Run test
        Test->>Test: _setup_protobuf_desc → build descriptor bytes via JVM
        Test->>Test: write descriptor to HDFS temp path
        Test->>Test: encode test rows via ProtobufRowGen / encode_pb_message
        Test->>CPU: assert_gpu_and_cpu_are_equal_collect (CPU path)
        CPU-->>Test: CPU result
        Test->>GPU: assert_gpu_and_cpu_are_equal_collect (GPU path)
        GPU-->>Test: XFAIL (GPU plugin not merged yet)
    end
Loading

Last reviewed commit: "style"

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven force-pushed the protobuf_integration_tests branch from da03440 to 5213199 Compare March 17, 2026 07:01
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants