Skip to content

Commit aa30ce1

Browse files
trivialfishcho3
andauthored
[backport][pyspark] Improve tutorial on enabling GPU support. (dmlc#8385) [skip ci] (dmlc#8391)
- Quote the databricks doc on how to manage dependencies. - Some wording changes. Co-authored-by: Philip Hyunsu Cho <[email protected]>
1 parent 153d995 commit aa30ce1

File tree

1 file changed

+60
-35
lines changed

1 file changed

+60
-35
lines changed

doc/tutorials/spark_estimator.rst

Lines changed: 60 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -83,17 +83,52 @@ generate result dataset with 3 new columns:
8383
XGBoost PySpark GPU support
8484
***************************
8585

86-
XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
87-
need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
88-
package. Then you can set `use_gpu` parameter to `True`.
86+
XGBoost PySpark fully supports GPU acceleration. Users are not only able to enable
87+
efficient training but also utilize their GPUs for the whole PySpark pipeline including
88+
ETL and inference. In below sections, we will walk through an example of training on a
89+
PySpark standalone GPU cluster. To get started, first we need to install some additional
90+
packages, then we can set the `use_gpu` parameter to `True`.
8991

90-
Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
91-
standalone cluster.
92+
Prepare the necessary packages
93+
==============================
94+
95+
Aside from the PySpark and XGBoost modules, we also need the `cuDF
96+
<https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
97+
recommend using either Conda or Virtualenv to manage python dependencies for PySpark
98+
jobs. Please refer to `How to Manage Python Dependencies in PySpark
99+
<https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
100+
for more details on PySpark dependency management.
101+
102+
In short, to create a Python environment that can be sent to a remote cluster using
103+
virtualenv and pip:
104+
105+
.. code-block:: bash
106+
107+
python -m venv xgboost_env
108+
source xgboost_env/bin/activate
109+
pip install pyarrow pandas venv-pack xgboost
110+
# https://rapids.ai/pip.html#install
111+
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
112+
venv-pack -o xgboost_env.tar.gz
113+
114+
With Conda:
115+
116+
.. code-block:: bash
117+
118+
conda create -y -n xgboost_env -c conda-forge conda-pack python=3.9
119+
conda activate xgboost_env
120+
# use conda when the supported version of xgboost (1.7) is released on conda-forge
121+
pip install xgboost
122+
conda install cudf pyarrow pandas -c rapids -c nvidia -c conda-forge
123+
conda pack -f -o xgboost_env.tar.gz
92124
93125
94126
Write your PySpark application
95127
==============================
96128

129+
Below snippet is a small example for training xgboost model with PySpark. Notice that we are
130+
using a list of feature names and the additional parameter ``use_gpu``:
131+
97132
.. code-block:: python
98133
99134
from xgboost.spark import SparkXGBRegressor
@@ -127,26 +162,11 @@ Write your PySpark application
127162
predict_df = model.transform(test_df)
128163
predict_df.show()
129164
130-
Prepare the necessary packages
131-
==============================
132-
133-
We recommend using Conda or Virtualenv to manage python dependencies
134-
in PySpark. Please refer to
135-
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
136-
137-
.. code-block:: bash
138-
139-
conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
140-
conda activate xgboost-env
141-
pip install xgboost
142-
conda install cudf -c rapids -c nvidia -c conda-forge
143-
conda pack -f -o xgboost-env.tar.gz
144-
145165
146166
Submit the PySpark application
147167
==============================
148168

149-
Assuming you have configured your Spark cluster with GPU support, if not yet, please
169+
Assuming you have configured your Spark cluster with GPU support. Otherwise, please
150170
refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
151171

152172
.. code-block:: bash
@@ -158,10 +178,13 @@ refer to `spark standalone configuration with GPU support <https://nvidia.github
158178
--master spark://<master-ip>:7077 \
159179
--conf spark.executor.resource.gpu.amount=1 \
160180
--conf spark.task.resource.gpu.amount=1 \
161-
--archives xgboost-env.tar.gz#environment \
181+
--archives xgboost_env.tar.gz#environment \
162182
xgboost_app.py
163183
164184
185+
The submit command sends the Python environment created by pip or conda along with the
186+
specification of GPU allocation. We will revisit this command later on.
187+
165188
Model Persistence
166189
=================
167190

@@ -186,26 +209,27 @@ To export the underlying booster model used by XGBoost:
186209
# the same booster object returned by xgboost.train
187210
booster: xgb.Booster = model.get_booster()
188211
booster.predict(...)
189-
booster.save_model("model.json")
212+
booster.save_model("model.json") # or model.ubj, depending on your choice of format.
190213
191-
This booster is shared by other Python interfaces and can be used by other language
192-
bindings like the C and R packages. Lastly, one can extract a booster file directly from
193-
saved spark estimator without going through the getter:
214+
This booster is not only shared by other Python interfaces but also used by all the
215+
XGBoost bindings including the C, Java, and the R package. Lastly, one can extract the
216+
booster file directly from a saved spark estimator without going through the getter:
194217

195218
.. code-block:: python
196219
197220
import xgboost as xgb
198221
bst = xgb.Booster()
222+
# Loading the model saved in previous snippet
199223
bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
200224
201-
Accelerate the whole pipeline of xgboost pyspark
202-
================================================
203225
204-
With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
205-
you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
206-
without any code change by leveraging GPU.
226+
Accelerate the whole pipeline for xgboost pyspark
227+
=================================================
207228

208-
Below is a simple example submit command for enabling GPU acceleration:
229+
With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_, you
230+
can leverage GPUs to accelerate the whole pipeline (ETL, Train, Transform) for xgboost
231+
pyspark without any Python code change. An example submit command is shown below with
232+
additional spark configurations and dependencies:
209233

210234
.. code-block:: bash
211235
@@ -219,8 +243,9 @@ Below is a simple example submit command for enabling GPU acceleration:
219243
--packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
220244
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
221245
--conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
222-
--archives xgboost-env.tar.gz#environment \
246+
--archives xgboost_env.tar.gz#environment \
223247
xgboost_app.py
224248
225-
When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
226-
required for the acceleration.
249+
When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python package
250+
are required. More configuration options can be found in the RAPIDS link above along with
251+
details on the plugin.

0 commit comments

Comments
 (0)