[backport][pyspark] Improve tutorial on enabling GPU support. (dmlc#8385) [skip ci] (dmlc#8391)

trivialfis · hcho3 · web-flow · commit aa30ce10dabd · 2022-10-26T19:31:34.000+08:00
- Quote the databricks doc on how to manage dependencies.
- Some wording changes.

Co-authored-by: Philip Hyunsu Cho &lt;chohyu01@cs.washington.edu&gt;
diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
@@ -83,17 +83,52 @@ generate result dataset with 3 new columns:
 XGBoost PySpark GPU support
 ***************************
 
-XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
-need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
-package. Then you can set `use_gpu` parameter to `True`.
+XGBoost PySpark fully supports GPU acceleration. Users are not only able to enable
+efficient training but also utilize their GPUs for the whole PySpark pipeline including
+ETL and inference. In below sections, we will walk through an example of training on a
+PySpark standalone GPU cluster. To get started, first we need to install some additional
+packages, then we can set the `use_gpu` parameter to `True`.
 
-Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
-standalone cluster.
+Prepare the necessary packages
+==============================
+
+Aside from the PySpark and XGBoost modules, we also need the `cuDF
+<https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
+recommend using either Conda or Virtualenv to manage python dependencies for PySpark
+jobs. Please refer to `How to Manage Python Dependencies in PySpark
+<https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
+for more details on PySpark dependency management.
+
+In short, to create a Python environment that can be sent to a remote cluster using
+virtualenv and pip:
+
+.. code-block:: bash
+
+  python -m venv xgboost_env
+  source xgboost_env/bin/activate
+  pip install pyarrow pandas venv-pack xgboost
+  # https://rapids.ai/pip.html#install
+  pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
+  venv-pack -o xgboost_env.tar.gz
+
+With Conda:
+
+.. code-block:: bash
+
+  conda create -y -n xgboost_env -c conda-forge conda-pack python=3.9
+  conda activate xgboost_env
+  # use conda when the supported version of xgboost (1.7) is released on conda-forge
+  pip install xgboost
+  conda install cudf pyarrow pandas -c rapids -c nvidia -c conda-forge
+  conda pack -f -o xgboost_env.tar.gz
 
 
 Write your PySpark application
 ==============================
 
+Below snippet is a small example for training xgboost model with PySpark. Notice that we are
+using a list of feature names and the additional parameter ``use_gpu``:
+
 .. code-block:: python
 
   from xgboost.spark import SparkXGBRegressor
@@ -127,26 +162,11 @@ Write your PySpark application
   predict_df = model.transform(test_df)
   predict_df.show()
 
-Prepare the necessary packages
-==============================
-
-We recommend using Conda or Virtualenv to manage python dependencies
-in PySpark. Please refer to
-`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
-
-.. code-block:: bash
-
-  conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
-  conda activate xgboost-env
-  pip install xgboost
-  conda install cudf -c rapids -c nvidia -c conda-forge
-  conda pack -f -o xgboost-env.tar.gz
-
 
 Submit the PySpark application
 ==============================
 
-Assuming you have configured your Spark cluster with GPU support, if not yet, please
+Assuming you have configured your Spark cluster with GPU support. Otherwise, please
 refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
 
 .. code-block:: bash
@@ -158,10 +178,13 @@ refer to `spark standalone configuration with GPU support <https://nvidia.github
     --master spark://<master-ip>:7077 \
     --conf spark.executor.resource.gpu.amount=1 \
     --conf spark.task.resource.gpu.amount=1 \
-    --archives xgboost-env.tar.gz#environment \
+    --archives xgboost_env.tar.gz#environment \
     xgboost_app.py
 
 
+The submit command sends the Python environment created by pip or conda along with the
+specification of GPU allocation. We will revisit this command later on.
+
 Model Persistence
 =================
 
@@ -186,26 +209,27 @@ To export the underlying booster model used by XGBoost:
   # the same booster object returned by xgboost.train
   booster: xgb.Booster = model.get_booster()
   booster.predict(...)
-  booster.save_model("model.json")
+  booster.save_model("model.json") # or model.ubj, depending on your choice of format.
 
-This booster is shared by other Python interfaces and can be used by other language
-bindings like the C and R packages. Lastly, one can extract a booster file directly from
-saved spark estimator without going through the getter:
+This booster is not only shared by other Python interfaces but also used by all the
+XGBoost bindings including the C, Java, and the R package. Lastly, one can extract the
+booster file directly from a saved spark estimator without going through the getter:
 
 .. code-block:: python
 
   import xgboost as xgb
   bst = xgb.Booster()
+  # Loading the model saved in previous snippet
   bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
 
-Accelerate the whole pipeline of xgboost pyspark
-================================================
 
-With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
-you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
-without any code change by leveraging GPU.
+Accelerate the whole pipeline for xgboost pyspark
+=================================================
 
-Below is a simple example submit command for enabling GPU acceleration:
+With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_, you
+can leverage GPUs to accelerate the whole pipeline (ETL, Train, Transform) for xgboost
+pyspark without any Python code change. An example submit command is shown below with
+additional spark configurations and dependencies:
 
 .. code-block:: bash
 
@@ -219,8 +243,9 @@ Below is a simple example submit command for enabling GPU acceleration:
     --packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
-    --archives xgboost-env.tar.gz#environment \
+    --archives xgboost_env.tar.gz#environment \
     xgboost_app.py
 
-When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
-required for the acceleration.
+When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python package
+are required. More configuration options can be found in the RAPIDS link above along with
+details on the plugin.