Having issues running RAPIDS ML benchmark

I have been trying to run the RAPIDS ML benchmarks since last year with moderate success. I am using RAPIDS ML 24.08 with Spark 3.5.1.

I can run 4 workloads without any issues (i.e., linear regression, logistic regression*, random_forest_classifier, random_forest_regressor), but I am having issues running the rest of the workloads.

For 3 workloads (i.e., KMeans, PCA, UMAP), I can run up to ~5GB datasets, but it starts to fail for >=8GB datasets when using a 16GB GPU with UVM enabled due to "GPU OutOfMemory: could not split inputs and retry" and occasionally CPU OoM. This was mentioned in https://github.com/NVIDIA/spark-rapids-ml/discussions/680 and https://github.com/NVIDIA/spark-rapids/discussions/10567. I tried setting the number of concurrent tasks to just 1, but it doesn't work.

For KNN, I get an error when I use multiple GPUs, mainly "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" even when I set the shuffler manager mode to multi-threaded. This does not happen to other workload despite the same configuration. 

In 24.06, I was able to run KNN on 2~4 GPUs, but I had issues running datasets that require the use of UVM for 3 and 4 GPUs. When using 3 GPUs, I got "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:"  only when UVM is required, but it was able to finish running the application after multiple retries. When using 4 GPUs, I got `KeyError: '.'` when UVM is required.

For DBSCAN, I am getting  `java.lang.ClassCastException: class com.nvidia.spark.rapids.SerializedTableColumn cannot be cast to class com.nvidia.spark.rapids.GpuColumnVector (com.nvidia.spark.rapids.SerializedTableColumn and com.nvidia.spark.rapids.GpuColumnVector are in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @2fb266a)`

I haven't tried out ANN yet.

My config file depends on the size of the dataset and the workload, but this is the config file I used for one of the workloads.

<details>
<summary>Spark conf</summary>

```
# spark.master yarn
spark.master spark://master:7077
# spark.rapids.sql.concurrentGpuTasks 1
spark.rapids.sql.concurrentGpuTasks 2
spark.driver.memory 100g
spark.executor.memory 50g
spark.executor.cores 4
spark.executor.resource.gpu.amount 1
spark.task.cpus 1
spark.task.resource.gpu.amount 0.25
spark.rapids.memory.pinnedPool.size 50G
# spark.sql.files.maxPartitionBytes 128m
spark.sql.files.maxPartitionBytes 512m
spark.plugins com.nvidia.spark.SQLPlugin
spark.executor.resource.gpu.discoveryScript ./get_gpus_resources.rb
spark.executorEnv.PYTHONPATH /home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.jars /home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.rapids.ml.uvm.enabled true
spark.dynamicAllocation.enabled false
spark.executor.extraJavaOptions "-Duser.timezone=UTC"
spark.driver.extraJavaOptions "-Duser.timezone=UTC"
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.rapids.memory.gpu.pool NONE
spark.sql.execution.sortBeforeRepartition false
spark.rapids.sql.python.gpu.enabled true
spark.rapids.memory.pinnedPool.size 50g
spark.rapids.sql.batchSizeBytes 128m
spark.sql.adaptive.enabled false
spark.executorEnv.UCX_ERROR_SIGNALS ""
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024
spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,tcp,rc
spark.executorEnv.UCX_RNDV_SCHEME put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS 1
spark.executorEnv.NCCL_DEBUG INFO
spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.rapids.shuffle.mode UCX
spark.shuffle.service.enabled false
spark.executorEnv.CUPY_CACHE_DIR /tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG INFO
spark.submit.deployMode client
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE docker # does not do anything, leftover from attempting to use containers with YARN without success
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 # does not do anything, leftover from attempting to use containers with YARN without success
spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE docker # does not do anything, leftover from attempting to use containers with YARN without success
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 # does not do anything, leftover from attempting to use containers with YARN without success
spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.endpoint [ENDPOINT]
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.attempts.maximum 1
spark.hadoop.fs.s3a.connection.establish.timeout 1000000
spark.hadoop.fs.s3a.connection.timeout 1000000
spark.hadoop.fs.s3a.connection.request.timeout 0
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDiretory file:///tmp/spark-events
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.logConf true
spark.executor.heartbeatInterval 1000000s
spark.network.timeout 10000001s
spark.sql.broadcastTimeout 1000000 
spark.executorEnv.NCCL_DEBUG WARN # does not really do anything
spark.driverEnv.NCCL_DEBUG WARN # does not really do anything
spark.pyspark.python [PATH]/.venv/bin/python
spark.pyspark.driver.python [PATH]/.venv/bin/python
spark.sql.execution.arrow.maxRecordsPerBatch <value> # value depends on dataset
spark.driver.maxResultSize 0 # only for blob datasets
```
</details>

\* - mentioned this on Slack, but I had an issue where the number of iterations vary only when UVM is required, which does not happen for CPU-based Spark MLLib. From the answers I got from Slack, it is inevitable due to the floating point approximations, so I'll consider this workload as having no issues.


EDIT 1: I realized that the final Spark conf used by Spark is different since RAPIDS and Spark add some internal configs. This is what the final conf looks like for the failed application running KNN with 3 GPUs.

<details>
<summary>Final spark conf generated by RAPIDS and Spark</summary>

```
Name	Value
spark.app.id	app-20241118084727-0039
spark.app.initial.file.urls	spark://master:37523/files/get_gpus_resources.rb
spark.app.initial.jar.urls	spark://master:37523/jars/rapids-4-spark_2.12-24.08.1.jar
spark.app.name	benchmark_runner.py
spark.app.startTime	1731887245183
spark.app.submitTime	1731887241110
spark.driver.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.driver.host	master
spark.driver.maxResultSize	0
spark.driver.memory	100g
spark.driver.port	37523
spark.driverEnv.NCCL_DEBUG	WARN
spark.dynamicAllocation.enabled	false
spark.eventLog.dir	file:///tmp/spark-events
spark.eventLog.enabled	true
spark.executor.cores	4
spark.executor.extraJavaOptions	-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false "-Duser.timezone=UTC"
spark.executor.heartbeatInterval	1000000s
spark.executor.id	driver
spark.executor.memory	50g
spark.executor.resource.gpu.amount	1
spark.executor.resource.gpu.discoveryScript	./get_gpus_resources.rb
spark.executorEnv.CUPY_CACHE_DIR	/tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG	WARN
spark.executorEnv.PYTHONPATH	[PATH]/rapids-4-spark_2.12-24.08.1.jar
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE	nvidia/cuda:12.6.1-devel-ubi9
spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE	docker
spark.files	file:///usr/lib/spark/scripts/gpu/get_gpus_resources.rb
spark.hadoop.fs.s3a.access.key	*********(redacted)
spark.hadoop.fs.s3a.attempts.maximum	1
spark.hadoop.fs.s3a.connection.establish.timeout	1000000
spark.hadoop.fs.s3a.connection.request.timeout	0
spark.hadoop.fs.s3a.connection.ssl.enabled	true
spark.hadoop.fs.s3a.connection.timeout	1000000
spark.hadoop.fs.s3a.endpoint	http://[ENDPOINT]:9000
spark.hadoop.fs.s3a.path.style.access	true
spark.hadoop.fs.s3a.secret.key	*********(redacted)
spark.history.fs.logDiretory	file:///tmp/spark-events
spark.history.fs.update.interval	10s
spark.history.provider	org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port	18080
spark.jars	[PATH]/rapids-4-spark_2.12-24.08.1.jar
spark.logConf	true
spark.master	spark://master:7077
spark.network.timeout	10000001s
spark.plugins	com.nvidia.spark.SQLPlugin
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.driver.user.timezone	Z
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.gpu.pool	NONE
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.memory.pinnedPool.size	50g
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.ml.uvm.enabled	true
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.manager	com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.shuffle.mode	MULTITHREADED
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.batchSizeBytes	128m
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.concurrentGpuTasks	2
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.multiThreadedRead.numThreads	20
spark.plugins.internal.conf.com.nvidia.spark.SQLPlugin.spark.rapids.sql.python.gpu.enabled	true
spark.pyspark.driver.python	[PATH]/.venv/bin/python
spark.pyspark.python	[PATH]/.venv/bin/python
spark.rapids.driver.user.timezone	Z
spark.rapids.memory.gpu.pool	NONE
spark.rapids.memory.pinnedPool.size	50g
spark.rapids.ml.uvm.enabled	true
spark.rapids.shuffle.manager	com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.rapids.shuffle.mode	MULTITHREADED
spark.rapids.sql.batchSizeBytes	128m
spark.rapids.sql.concurrentGpuTasks	2
spark.rapids.sql.multiThreadedRead.numThreads	20
spark.rapids.sql.python.gpu.enabled	true
spark.rdd.compress	True
spark.repl.local.jars	file:///home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.scheduler.mode	FIFO
spark.serializer.objectStreamReset	100
spark.shuffle.service.enabled	false
spark.sql.adaptive.enabled	false
spark.sql.broadcastTimeout	1000000
spark.sql.cache.serializer	com.nvidia.spark.ParquetCachedBatchSerializer
spark.sql.execution.arrow.maxRecordsPerBatch	39993
spark.sql.execution.sortBeforeRepartition	false
spark.sql.extensions	com.nvidia.spark.rapids.SQLExecPlugin,com.nvidia.spark.udf.Plugin,com.nvidia.spark.rapids.optimizer.SQLOptimizerPlugin
spark.sql.files.maxPartitionBytes	512m
spark.submit.deployMode	client
spark.submit.pyFiles	
spark.task.cpus	1
spark.task.resource.gpu.amount	0.25
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE	nvidia/cuda:12.6.1-devel-ubi9
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE	docker
```
</details>

EDITS 2-4: I am currently trying to see if upgrading to Apache Spark ~3.5.3~ (EDIT 3: Spark 3.5.2 since 24.10 does not support Spark 3.5.3) and RAPIDS 24.10 resolves some of the issues. Just tried it on KNN and DBSCAN. KNN still gives me "ucp._libs.exceptions.UCXUnreachable: <stream_recv>: ", but DBSCAN finally gives a different error now (java.lang.OutOfMemoryError: Java heap space). Two of the three workloads that failed for 8GB seem to work now since it can handle 18GB datasets, but I still have to confirm.


Also forgot to mention this before, but for PCA, generating the dataset can lead to an error due to the dimensions not matching for certain number of rows. Having a message that it is due to the number of rows might be a good idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having issues running RAPIDS ML benchmark #786

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Having issues running RAPIDS ML benchmark #786

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions