-
Notifications
You must be signed in to change notification settings - Fork 31
Description
I have been trying to run the RAPIDS ML benchmarks since last year with moderate success. I am using RAPIDS ML 24.08 with Spark 3.5.1.
I can run 4 workloads without any issues (i.e., linear regression, logistic regression*, random_forest_classifier, random_forest_regressor), but I am having issues running the rest of the workloads.
For 3 workloads (i.e., KMeans, PCA, UMAP), I can run up to ~5GB datasets, but it starts to fail for >=8GB datasets when using a 16GB GPU with UVM enabled due to "GPU OutOfMemory: could not split inputs and retry" and occasionally CPU OoM. This was mentioned in #680 and NVIDIA/spark-rapids#10567. I tried setting the number of concurrent tasks to just 1, but it doesn't work.
For KNN, I get an error when I use multiple GPUs, mainly "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" even when I set the shuffler manager mode to multi-threaded. This does not happen to other workload despite the same configuration.
In 24.06, I was able to run KNN on 2~4 GPUs, but I had issues running datasets that require the use of UVM for 3 and 4 GPUs. When using 3 GPUs, I got "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" only when UVM is required, but it was able to finish running the application after multiple retries. When using 4 GPUs, I got KeyError: '.' when UVM is required.
For DBSCAN, I am getting java.lang.ClassCastException: class com.nvidia.spark.rapids.SerializedTableColumn cannot be cast to class com.nvidia.spark.rapids.GpuColumnVector (com.nvidia.spark.rapids.SerializedTableColumn and com.nvidia.spark.rapids.GpuColumnVector are in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @2fb266a)
I haven't tried out ANN yet.
My config file depends on the size of the dataset and the workload, but this is the config file I used for one of the workloads.
Spark conf
# spark.master yarn
spark.master spark://master:7077
# spark.rapids.sql.concurrentGpuTasks 1
spark.rapids.sql.concurrentGpuTasks 2
spark.driver.memory 100g
spark.executor.memory 50g
spark.executor.cores 4
spark.executor.resource.gpu.amount 1
spark.task.cpus 1
spark.task.resource.gpu.amount 0.25
spark.rapids.memory.pinnedPool.size 50G
# spark.sql.files.maxPartitionBytes 128m
spark.sql.files.maxPartitionBytes 512m
spark.plugins com.nvidia.spark.SQLPlugin
spark.executor.resource.gpu.discoveryScript ./get_gpus_resources.rb
spark.executorEnv.PYTHONPATH /home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.jars /home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.rapids.ml.uvm.enabled true
spark.dynamicAllocation.enabled false
spark.executor.extraJavaOptions "-Duser.timezone=UTC"
spark.driver.extraJavaOptions "-Duser.timezone=UTC"
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.rapids.memory.gpu.pool NONE
spark.sql.execution.sortBeforeRepartition false
spark.rapids.sql.python.gpu.enabled true
spark.rapids.memory.pinnedPool.size 50g
spark.rapids.sql.batchSizeBytes 128m
spark.sql.adaptive.enabled false
spark.executorEnv.UCX_ERROR_SIGNALS ""
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024
spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,tcp,rc
spark.executorEnv.UCX_RNDV_SCHEME put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS 1
spark.executorEnv.NCCL_DEBUG INFO
spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.rapids.shuffle.mode UCX
spark.shuffle.service.enabled false
spark.executorEnv.CUPY_CACHE_DIR /tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG INFO
spark.submit.deployMode client
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE docker # does not do anything, leftover from attempting to use containers with YARN without success
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 # does not do anything, leftover from attempting to use containers with YARN without success
spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE docker # does not do anything, leftover from attempting to use containers with YARN without success
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 # does not do anything, leftover from attempting to use containers with YARN without success
spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.endpoint [ENDPOINT]
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.attempts.maximum 1
spark.hadoop.fs.s3a.connection.establish.timeout 1000000
spark.hadoop.fs.s3a.connection.timeout 1000000
spark.hadoop.fs.s3a.connection.request.timeout 0
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDiretory file:///tmp/spark-events
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.logConf true
spark.executor.heartbeatInterval 1000000s
spark.network.timeout 10000001s
spark.sql.broadcastTimeout 1000000
spark.executorEnv.NCCL_DEBUG WARN # does not really do anything
spark.driverEnv.NCCL_DEBUG WARN # does not really do anything
spark.pyspark.python [PATH]/.venv/bin/python
spark.pyspark.driver.python [PATH]/.venv/bin/python
spark.sql.execution.arrow.maxRecordsPerBatch <value> # value depends on dataset
spark.driver.maxResultSize 0 # only for blob datasets
* - mentioned this on Slack, but I had an issue where the number of iterations vary only when UVM is required, which does not happen for CPU-based Spark MLLib. From the answers I got from Slack, it is inevitable due to the floating point approximations, so I'll consider this workload as having no issues.
EDIT 1: I realized that the final Spark conf used by Spark is different since RAPIDS and Spark add some internal configs. This is what the final conf looks like for the failed application running KNN with 3 GPUs.
Final spark conf generated by RAPIDS and Spark
Name Value
spark.app.id app-20241118084727-0039
spark.app.initial.file.urls spark://master:37523/files/get_gpus_resources.rb
spark.app.initial.jar.urls spark://master:37523/jars/rapids-4-spark_2.12-24.08.1.jar
spark.app.name benchmark_runner.py
spark.app.startTime 1731887245183
spark.app.submitTime 1731887241110
spark.driver.extraJavaOptions -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.driver.host master
spark.driver.maxResultSize 0
spark.driver.memory 100g
spark.driver.port 37523
spark.driverEnv.NCCL_DEBUG WARN
spark.dynamicAllocation.enabled false
spark.eventLog.dir file:///tmp/spark-events
spark.eventLog.enabled true
spark.executor.cores 4
spark.executor.extraJavaOptions -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.executor.heartbeatInterval 1000000s
spark.executor.id driver
spark.executor.memory 50g
spark.executor.resource.gpu.amount 1
spark.executor.resource.gpu.discoveryScript ./get_gpus_resources.rb
spark.executorEnv.CUPY_CACHE_DIR /tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG WARN
spark.executorEnv.PYTHONPATH [PATH]/rapids-4-spark_2.12-24.08.1.jar
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9
spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE docker
spark.files file:///usr/lib/spark/scripts/gpu/get_gpus_resources.rb
spark.hadoop.fs.s3a.access.key *********(redacted)
spark.hadoop.fs.s3a.attempts.maximum 1
spark.hadoop.fs.s3a.connection.establish.timeout 1000000
spark.hadoop.fs.s3a.connection.request.timeout 0
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.connection.timeout 1000000
spark.hadoop.fs.s3a.endpoint http://[ENDPOINT]:9000
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.secret.key *********(redacted)
spark.history.fs.logDiretory file:///tmp/spark-events
spark.history.fs.update.interval 10s
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.jars [PATH]/rapids-4-spark_2.12-24.08.1.jar
spark.logConf true
spark.master spark://master:7077
spark.network.timeout 10000001s
spark.plugins com.nvidia.spark.SQLPlugin
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.driver.user.timezone Z
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pool NONE
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.pinnedPool.size 50g
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.mode MULTITHREADED
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.batchSizeBytes 128m
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.concurrentGpuTasks 2
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.multiThreadedRead.numThreads 20
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.python.gpu.enabled true
spark.pyspark.driver.python [PATH]/.venv/bin/python
spark.pyspark.python [PATH]/.venv/bin/python
spark.rapids.driver.user.timezone Z
spark.rapids.memory.gpu.pool NONE
spark.rapids.memory.pinnedPool.size 50g
spark.rapids.ml.uvm.enabled true
spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.rapids.shuffle.mode MULTITHREADED
spark.rapids.sql.batchSizeBytes 128m
spark.rapids.sql.concurrentGpuTasks 2
spark.rapids.sql.multiThreadedRead.numThreads 20
spark.rapids.sql.python.gpu.enabled true
spark.rdd.compress True
spark.repl.local.jars file:///home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.scheduler.mode FIFO
spark.serializer.objectStreamReset 100
spark.shuffle.service.enabled false
spark.sql.adaptive.enabled false
spark.sql.broadcastTimeout 1000000
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.sql.execution.arrow.maxRecordsPerBatch 39993
spark.sql.execution.sortBeforeRepartition false
spark.sql.extensions com.nvidia.spark.rapids.SQLExecPlugin,com.nvidia.spark.udf.Plugin,com.nvidia.spark.rapids.optimizer.SQLOptimizerPlugin
spark.sql.files.maxPartitionBytes 512m
spark.submit.deployMode client
spark.submit.pyFiles
spark.task.cpus 1
spark.task.resource.gpu.amount 0.25
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE docker
EDITS 2-4: I am currently trying to see if upgrading to Apache Spark 3.5.3 (EDIT 3: Spark 3.5.2 since 24.10 does not support Spark 3.5.3) and RAPIDS 24.10 resolves some of the issues. Just tried it on KNN and DBSCAN. KNN still gives me "ucp._libs.exceptions.UCXUnreachable: <stream_recv>: ", but DBSCAN finally gives a different error now (java.lang.OutOfMemoryError: Java heap space). Two of the three workloads that failed for 8GB seem to work now since it can handle 18GB datasets, but I still have to confirm.
Also forgot to mention this before, but for PCA, generating the dataset can lead to an error due to the dimensions not matching for certain number of rows. Having a message that it is due to the number of rows might be a good idea.