Skip to content

Comments

[SPARK-51654][BUILD] Add a dev script to compare SBT and Maven builds#54371

Open
fangchenli wants to merge 21 commits intoapache:masterfrom
fangchenli:compare-builds-script
Open

[SPARK-51654][BUILD] Add a dev script to compare SBT and Maven builds#54371
fangchenli wants to merge 21 commits intoapache:masterfrom
fangchenli:compare-builds-script

Conversation

@fangchenli
Copy link
Contributor

What changes were proposed in this pull request?

Add a dev script to compare SBT and Maven builds. Pure Python, no dependency.

Why are the changes needed?

Currently, the Jars produced by Maven and SBT differ; we need to be able to inspect those differences. This is also the precursor for native SBT build. We can answer the question in the original Jira issue:

python dev/compare-builds.py --compare \
  assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm_2.13-4.2.0-SNAPSHOT.jar \
  assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm-assembly-4.2.0-SNAPSHOT.jar

Output:

Comparing JARs
────────────────────────────────────────────────────────────────────────
  JAR 1: assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm_2.13-4.2.0-SNAPSHOT.jar
         26,035,401 bytes, 12182 classes, 95 resources, 5 services
  JAR 2: assembly/target/scala-2.13/jars/connect-repl/spark-connect-client-jvm-assembly-4.2.0-SNAPSHOT.jar
         63,882,318 bytes, 21289 classes, 4392 resources, 20 services
  Size:  JAR 2 is 2x larger
────────────────────────────────────────────────────────────────────────
Summary: 3119 identical, 9052 matched after de-shading, 11 only in JAR 1, 9118 only in JAR 2, 23 service diffs, 1 service contents differ

De-shading Analysis
────────────────────────────────────────────────────────────────────────
  ✓ 9052 classes are the same original class under different shading prefixes

  Classes truly only in JAR 1 (11):
    org/sparkproject/com/google/gson/internal/bind/ (5 classes)
    org/sparkproject/com/google/gson/internal/ (4 classes)
    org/sparkproject/com/google/gson/ (1 classes)
    org/apache/spark/unused/ (1 classes)

  Classes truly only in JAR 2 (9118):
    com/ibm/icu/text/ (544 classes)
    com/ibm/icu/impl/ (388 classes)
    org/json4s/ (278 classes)
    com/ibm/icu/util/ (190 classes)
    com/esotericsoftware/kryo/serializers/ (143 classes)
    com/twitter/chill/ (123 classes)
    org/apache/logging/log4j/core/pattern/ (113 classes)
    org/json4s/scalap/scalasig/ (105 classes)
    org/apache/logging/log4j/layout/template/json/resolver/ (93 classes)
    org/apache/logging/log4j/core/appender/ (91 classes)
    com/fasterxml/jackson/databind/deser/std/ (90 classes)
    com/fasterxml/jackson/module/scala/deser/ (88 classes)
    org/antlr/v4/runtime/atn/ (87 classes)
    org/apache/commons/lang3/ (82 classes)
    scala/xml/ (82 classes)
    org/apache/logging/log4j/core/tools/picocli/ (80 classes)
    com/fasterxml/jackson/databind/ser/std/ (79 classes)
    org/apache/logging/log4j/core/layout/ (79 classes)
    org/apache/ivy/ant/ (77 classes)
    com/fasterxml/jackson/databind/introspect/ (76 classes)
    ... and 399 more packages

  Services only in JAR 1 (4):
    META-INF/services/org.sparkproject.io.grpc.LoadBalancerProvider
    META-INF/services/org.sparkproject.io.grpc.ManagedChannelProvider
    META-INF/services/org.sparkproject.io.grpc.NameResolverProvider
    META-INF/services/org.sparkproject.io.grpc.ServerProvider

  Services only in JAR 2 (19):
    META-INF/services/com.fasterxml.jackson.core.JsonFactory
    META-INF/services/com.fasterxml.jackson.core.ObjectCodec
    META-INF/services/com.fasterxml.jackson.databind.Module
    META-INF/services/exec
    META-INF/services/ffm
    META-INF/services/jansi
    META-INF/services/javax.annotation.processing.Processor
    META-INF/services/jna
    META-INF/services/jni
    META-INF/services/org.apache.commons.logging.LogFactory
    META-INF/services/org.apache.logging.log4j.core.util.ContextDataProvider
    META-INF/services/org.apache.logging.log4j.message.ThreadDumpMessage$ThreadInfoFactory
    META-INF/services/org.apache.logging.log4j.spi.Provider
    META-INF/services/org.apache.logging.log4j.util.PropertySource
    META-INF/services/org.slf4j.spi.SLF4JServiceProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.LoadBalancerProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.ManagedChannelProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.NameResolverProvider
    META-INF/services/org.sparkproject.connect.client.io.grpc.ServerProvider

  Services with different content (1):
    META-INF/services/reactor.blockhound.integration.BlockHoundIntegration

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The script includes a partial self-test. But to further test this script, we need more user feedback and to investigate the differences it found.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

@Yicong-Huang
Copy link
Contributor

That's a great finding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants