Low performance loading from parquet file #13395

alephonea · 2025-08-29T07:49:07Z

alephonea
Aug 29, 2025

Hello,

I tested spark-rapids against synthetic TPCH dataset, and noticed that the performance isn't great. Specifically, query time is
about 10 times higher compared to using CUDF directly on the same data.

Looking at the execution graph and logs, it seems that loading data from parquet files is an issue. Compared to using CUDF loader, reading parquet files, and decoding GPU buffers takes 10 times longer.
Input parquet files are fully cached in OS page cache.

I've instrumented code with some log lines, and found that decoding GPU representation is performed by ai.rapids.cudf.ParquetChunkReader. Calls to this class take about 2 seconds to copy and decode data necessary to TPCH query 1 (SF=20), compared to about 200ms for the CUDF loader.

Also, by instrumenting code I found that reading files is done by PerfIO.readToHostMemory(). All call to this library together take about 2 seconds to read 1.2GB of data - this is 10 times slower than expected when reading file data from OS page cache.

Do you know what could be the reason for such performance effects?

Here's the setup:

spark version 3.5.6
plugin is built from branch-25.08 using the following command:

$ mvn -Dbuildver=356 -DskipTests -Drat.skip=true -Dmaven.scaladoc.skip -Dmaven.scalastyle.skip=true package

Then, dist/target/rapids-4-spark_2.12-25.08.0-SNAPSHOT-cuda12.jar is provided to pyspark session that does the following:

    plugin_jar='spark-rapids/dist/target/rapids-4-spark_2.12-25.08.0-SNAPSHOT-cuda12.jar'
    conf = SparkConf() \
        .setAppName("TPC-H Queries") \
        .setMaster("spark://{}:7077".format(socket.gethostname())) \
        .set("spark.sql.adaptive.enabled", "true") \
        .set("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .set("spark.sql.files.maxPartitionBytes", '4g') \
        .set("spark.sql.files.maxPartitionNum", '1') \
        .set("spark.executor.extraClassPath", plugin_jar) \
        .set("spark.driver.extraClassPath", plugin_jar) \
        .set("spark.executor.resource.gpu.amount", "1") \
        .set("spark.executor.memory", "64g") \
        .set("spark.plugins", "com.nvidia.spark.SQLPlugin") \
        .set("spark.rapids.sql.format.parquet.reader.type", "PERFILE") \
        .set("spark.jars", args.plugin_jar)
    
    spark = SparkSession.builder \
        .config(conf=conf) \
        .getOrCreate()

    # Read a 4GB parquet file
    df = spark.read.parquet('/data/lineitem.parquet')
    df.createOrReplaceTempView('lineitem')

    sql_query=open('tpch-queries/1.sql').read()
    result = spark.sql(sql_query)
    rows = result.collect()
    result.show(20)

Configuration:
1xNVIDIA H200
44-core AMD EPYC 9654
178GB RAM
openjdk 17
scala 12

abellina · 2025-08-29T14:40:32Z

abellina
Aug 29, 2025
Maintainer

Thanks for the report @alephonea. You mentioned a CUDF example, mind sharing that one as well? Also, if you wouldn't mind doing a result.explain() here to see what the plan looks like.

0 replies

abellina · 2025-08-29T16:11:04Z

abellina
Aug 29, 2025
Maintainer

For the cuDF case https://gist.github.com/alephonea/85c455918e6930e1f65ca55ad8d912de, I'd like to know how you are launching cuDF itself (RMM pool size, if any).

For the Spark-RAPIDS case, did you measure a single run? Or are you measuring several runs and averaging the runtime? Spark, using the JVM, will require time to JIT compile code and that's something that will hit pretty hard the first (cold) iteration.

Spark and Spark-RAPIDS are set for scale-out. They are not optimized for the use case described here. If you have a bigger dataset (we usually run 1TB+), where we are running with ~16 threads, all loading from filesystem, that's the type of workload that Spark and Spark-RAPIDS is more optimized for.

0 replies

gerashegalov · 2025-08-29T19:48:28Z

gerashegalov
Aug 29, 2025
Collaborator

What is the HEAD commit hash did you build it from?

PerfIO is only for S3 and is disabled by default. If the input path is indeed a literal /data/lineitem.parquet, not starting with s3 it should immediately return None so the code will take the getOrElse execution path ever if PerfIO were enabled. Since you mention OS Page cache, I would like to check if we have a performance bug

Do you observe a difference in your measurements after rebuilding with this patch?

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala
index da80757e74..c4f0d70c0f 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala
@@ -551,8 +551,7 @@ private case class GpuParquetFileFilterHandler(
   private def readFooterBuffer(
       filePath: Path,
       conf: Configuration): HostMemoryBuffer = {
-    PerfIO.readParquetFooterBuffer(filePath, conf, verifyParquetMagic)
-      .getOrElse(readFooterBufUsingHadoop(filePath, conf))
+    readFooterBufUsingHadoop(filePath, conf)
   }
 
   private def readFooterBufUsingHadoop(filePath: Path, conf: Configuration): HostMemoryBuffer = {
@@ -1869,10 +1868,7 @@ trait ParquetPartitionReaderBase extends Logging with ScanWithMetrics
 
     val coalescedRanges = coalesceReads(remoteCopies)
 
-    val totalBytesCopied = PerfIO.readToHostMemory(
-        conf, out.buffer, filePath.toUri,
-        coalescedRanges.map(r => IntRangeWithOffset(r.offset, r.length, r.outputOffset))
-      ).getOrElse {
+    val totalBytesCopied = {
         withResource(filePath.getFileSystem(conf).open(filePath)) { in =>
           val copyBuffer: Array[Byte] = new Array[Byte](copyBufferSize)
           coalescedRanges.foldLeft(0L) { (acc, blockCopy) =>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low performance loading from parquet file #13395

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Low performance loading from parquet file #13395

Uh oh!

Uh oh!

alephonea Aug 29, 2025

Replies: 3 comments

Uh oh!

abellina Aug 29, 2025 Maintainer

Uh oh!

abellina Aug 29, 2025 Maintainer

Uh oh!

Uh oh!

gerashegalov Aug 29, 2025 Collaborator

alephonea
Aug 29, 2025

abellina
Aug 29, 2025
Maintainer

abellina
Aug 29, 2025
Maintainer

gerashegalov
Aug 29, 2025
Collaborator