Support Hadoop SequenceFiles Scan#14061
Conversation
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
This reverts commit 2ab5557.
There was a problem hiding this comment.
Pull request overview
This PR adds support for reading Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. It registers a new file format sequencefilebinary that reads SequenceFile key/value pairs as raw BinaryType columns on the GPU.
Key Changes
- Introduces
SequenceFileBinaryFileFormatas a new DataSource that reads SequenceFiles and exposes key/value as BinaryType columns - Implements GPU-accelerated reading via
GpuReadSequenceFileBinaryFormatand associated partition readers - Integrates the new format into
GpuFileSourceScanExecfor GPU execution path routing
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala | CPU-side FileFormat implementation with row-based reader for SequenceFiles |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala | GPU-enabled FileFormat wrapper with metadata support and multi-file reader factory |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala | Core GPU partition readers with host-side buffering and device column materialization |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala | Integration point registering SequenceFileBinary format in GPU scan execution |
| tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala | Test suite in tests module for wildcard discovery |
| sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala | Duplicate test suite in sql-plugin module |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Show resolved
Hide resolved
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@Greptile full review |
Greptile SummaryThis PR introduces GPU-accelerated reading of Hadoop SequenceFiles by replacing Key issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Planner as Spark Planner
participant Meta as GpuSequenceFileSerializeFromObjectExecMeta
participant Exec as GpuSequenceFileSerializeFromObjectExec
participant Factory as GpuSequenceFileMultiFilePartitionReaderFactory
participant Reader as MultiFileCloudSequenceFilePartitionReader
participant HadoopFS as Hadoop FileSystem
participant GPU as GPU (cuDF)
Planner->>Meta: tagPlanForGpu()
Meta->>Meta: isSimpleSequenceFileRDD() via reflection
Meta->>HadoopFS: findAnyFile() — sample first file per path
HadoopFS-->>Meta: first file
Meta->>HadoopFS: isCompressedSequenceFile(firstFile)
HadoopFS-->>Meta: compressed? → fall back to CPU
Meta->>Meta: collectInputPaths() via reflection
Meta-->>Planner: convertToGpu() → GpuSequenceFileSerializeFromObjectExec
Planner->>Exec: internalDoExecuteColumnar()
Exec->>HadoopFS: filePartitions — glob + list all files
HadoopFS-->>Exec: FilePartition[]
Exec->>Factory: create GpuSequenceFileMultiFilePartitionReaderFactory
Factory->>Reader: buildBaseColumnarReaderForCloud(files)
loop Per file (multithreaded)
Reader->>HadoopFS: SequenceFile.Reader.nextRaw()
HadoopFS-->>Reader: key/value bytes
Reader->>Reader: HostBinaryListBufferer.addBytes()
Reader->>Reader: SequenceFileChunk (host pinned memory)
end
Reader->>Reader: combineHMBs() — zero-copy chunk collection
Reader->>GPU: buildDeviceColumnFromChunks() — H2D transfer
GPU->>GPU: ColumnVector.concatenate() (combine mode)
GPU-->>Exec: ColumnarBatch (LIST<UINT8>)
|
There was a problem hiding this comment.
Additional Comments (4)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 125-128 (link)logic: check for potential INT32 overflow before it happens
the check happens after
dataLocationhas already grown beyondInt.MaxValue, which could cause issues during the buffer growth operations inaddBytesoraddValueBytes. move the overflow check earlier in those methods before updatingdataLocation. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 73-83 (link)logic: potential integer overflow in row capacity calculation
rowsAllocated * 2can overflow whenrowsAllocatedis close toInt.MaxValue / 2. this causes the allocation to wrap to negative or small values. -
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 32-136 (link)style: missing test coverage for key scenarios
add tests for:
- compressed SequenceFiles (should throw
UnsupportedOperationException) - multi-file reads to verify the multi-file reader path
- large batches that exceed
maxRowsPerBatchormaxBytesPerBatch - partition columns
- reading only key or only value (not both)
- empty files
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- compressed SequenceFiles (should throw
-
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 102-136 (link)style: tests only verify the CPU reader path, not GPU
this test uses
SequenceFileBinaryFileFormatwhich is the CPU fallback. to test the GPU path withGpuReadSequenceFileBinaryFormat, you'd need to enable the Rapids plugin configuration and verify GPU execution.
5 files reviewed, 4 comments
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
105cf2a to
9b1162e
Compare
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
6fd0076 to
3fe2cd6
Compare
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
1 similar comment
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Closes #14065
Description
This PR adds GPU-accelerated reading of Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. SequenceFile is commonly used as the storage format for serialized protobuf payloads (via
from_protobuf). Previously, reading SequenceFiles was only possible through Spark's CPU-based RDD API (sc.newAPIHadoopFile), which becomes the I/O bottleneck when downstream decode (e.g., protobuf) runs on the GPU.The implementation introduces a physical plan replacement strategy: when Spark's plan contains a
SerializeFromObjectExecover anExternalRDDScanExecbacked by a simple SequenceFile RDD, the plugin replaces it withGpuSequenceFileSerializeFromObjectExec, which bypasses the original RDD and reads files directly using a multi-threaded reader with combine mode.Key Design Decisions
SequenceFile.Readeron the CPU (the format is not amenable to GPU parsing). The decoded binary payloads are buffered in pinned host memory and transferred to the GPU asLIST<UINT8>columns. This meansCOALESCINGandPERFILEreader modes are not supported (no benefit), andMULTITHREADEDis the default.NewHadoopRDD/HadoopRDDto confirm that the RDD lineage is a simple SequenceFile scan withBytesWritablekey/value. Complex lineages (e.g., filtered/mapped RDDs) automatically fall back to CPU.sc.newAPIHadoopFile()andsc.hadoopFile()paths are supported.data/year=2024/*).spark.sql.files.ignoreMissingFilesandspark.sql.files.ignoreCorruptFilesconfigurations.New Files
GpuSequenceFileSerializeFromObjectExecMeta.scalaGpuSequenceFileSerializeFromObjectExec.scalasequencefile/GpuSequenceFileReaders.scalaHostBinaryListBufferer,SequenceFileChunk,MultiFileCloudSequenceFilePartitionReader, factoryNew Configs
spark.rapids.sql.format.sequencefile.reader.typeMULTITHREADEDMULTITHREADEDandAUTOsupported)spark.rapids.sql.format.sequencefile.multiThreadedRead.maxNumFilesParallelInteger.MAX_VALUEspark.rapids.sql.format.sequencefile.rddScan.physicalReplace.enabledtruePerformance tests
val NUM_FILES = 200
val RECORDS_PER_FILE = 50000
val VALUE_SIZE = 1024
val ITERATIONS = 5
I ran performance tests on 200 files with 50,000 records and a 1 MB size per value.
script
Since the decode happens on CPU we got similar perf numbers with CPU file format and we need to copy data to GPU for GPU file format. But it's about 2 times faster than CPU RDD scan.
Test Coverage
Scala unit tests (
SequenceFilePhysicalReplaceSuite,SequenceFileBinaryFileFormatSuite):hadoopRDD) supportignoreMissingFilesbehaviorPython integration tests (
sequencefile_test.py):Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)