Skip to content

Commit 337ee78

Browse files
wbo4958trivialfis
andauthored
[jvm-packages] Supports external memory (dmlc#11186)
--------- Co-authored-by: Jiaming Yuan <[email protected]>
1 parent 688c2f5 commit 337ee78

File tree

37 files changed

+1273
-322
lines changed

37 files changed

+1273
-322
lines changed

CMakeLists.txt

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,10 +112,19 @@ option(ADD_PKGCONFIG "Add xgboost.pc into system." ON)
112112
if(USE_DEBUG_OUTPUT AND (NOT (CMAKE_BUILD_TYPE MATCHES Debug)))
113113
message(SEND_ERROR "Do not enable `USE_DEBUG_OUTPUT' with release build.")
114114
endif()
115-
if(USE_NCCL AND NOT (USE_CUDA))
115+
if(USE_NVTX AND (NOT USE_CUDA))
116+
message(SEND_ERROR "`USE_NVTX` must be enabled with `USE_CUDA` flag.")
117+
endif()
118+
if(USE_NVTX)
119+
if(CMAKE_VERSION VERSION_LESS "3.25.0")
120+
# CUDA:nvtx3 target is added in 3.25
121+
message("cmake >= 3.25 is required for NVTX.")
122+
endif()
123+
endif()
124+
if(USE_NCCL AND (NOT USE_CUDA))
116125
message(SEND_ERROR "`USE_NCCL` must be enabled with `USE_CUDA` flag.")
117126
endif()
118-
if(USE_DEVICE_DEBUG AND NOT (USE_CUDA))
127+
if(USE_DEVICE_DEBUG AND (NOT USE_CUDA))
119128
message(SEND_ERROR "`USE_DEVICE_DEBUG` must be enabled with `USE_CUDA` flag.")
120129
endif()
121130
if(BUILD_WITH_SHARED_NCCL AND (NOT USE_NCCL))

demo/rmm_plugin/README.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,8 @@ Since with RMM the memory pool is pre-allocated on a specific device, changing t
5858
device ordinal in XGBoost can result in memory error ``cudaErrorIllegalAddress``. Use the
5959
``CUDA_VISIBLE_DEVICES`` environment variable instead of the ``device="cuda:1"`` parameter
6060
for selecting device. For distributed training, the distributed computing frameworks like
61-
``dask-cuda`` are responsible for device management.
61+
``dask-cuda`` are responsible for device management. For Scala-Spark, see
62+
:doc:`/jvm/xgboost4j_spark_gpu_tutorial` for more info.
6263

6364
************************
6465
Memory Over-Subscription

doc/build.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -394,7 +394,8 @@ Additional System-dependent Features
394394
- OpenMP on MacOS: See :ref:`running_cmake_and_build` for installing ``openmp``. The flag
395395
-``mvn -Duse.openmp=OFF`` can be used to disable OpenMP support.
396396
- GPU support can be enabled by passing an additional flag to maven ``mvn -Duse.cuda=ON
397-
install``. See :ref:`build_gpu_support` for more info.
397+
install``. See :ref:`build_gpu_support` for more info. In addition, ``-Dplugin.rmm=ON``
398+
can enable the optional RMM support.
398399

399400
**************************
400401
Building the Documentation
@@ -414,4 +415,5 @@ build it locally, you need a installed XGBoost with all its dependencies along w
414415

415416
Under ``xgboost/doc`` directory, run ``make <format>`` with ``<format>`` replaced by the
416417
format you want. For a list of supported formats, run ``make help`` under the same
417-
directory.
418+
directory. This builds a partial document for Python but not other language bindings. To
419+
build the full document, see :doc:`/contrib/docs`.

doc/jvm/java_intro.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ With parameters and data, you are able to train a booster model.
127127

128128
.. code-block:: java
129129
130-
booster.saveModel("model.bin");
130+
booster.saveModel("model.json");
131131
132132
* Generating model dump with feature map
133133

@@ -142,7 +142,7 @@ With parameters and data, you are able to train a booster model.
142142

143143
.. code-block:: java
144144
145-
Booster booster = XGBoost.loadModel("model.bin");
145+
Booster booster = XGBoost.loadModel("model.json");
146146
147147
**********
148148
Prediction

doc/jvm/xgboost4j_spark_gpu_tutorial.rst

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -259,3 +259,26 @@ For details about other ``RAPIDS Accelerator`` other configurations, please refe
259259

260260
For ``RAPIDS Accelerator Frequently Asked Questions``, please refer to the
261261
`frequently-asked-questions <https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html>`_.
262+
263+
***********
264+
RMM Support
265+
***********
266+
267+
.. versionadded:: 3.0
268+
269+
When compiled with the RMM plugin (see :doc:`/build`), the XGBoost spark package can reuse
270+
the RMM memory pool automatically based on `spark.rapids.memory.gpu.pooling.enabled` and
271+
`spark.rapids.memory.gpu.pool`. Please note that both submit options need to be set
272+
accordingly. In addition, XGBoost employs NCCL for GPU communication, which requires some
273+
GPU memory for communication buffers and one should not let RMM take all the available
274+
memory. Example configuration related to memory pool:
275+
276+
.. code-block:: bash
277+
278+
spark-submit \
279+
--master $master \
280+
--conf spark.rapids.memory.gpu.allocFraction=0.5 \
281+
--conf spark.rapids.memory.gpu.maxAllocFraction=0.8 \
282+
--conf spark.rapids.memory.gpu.pool=ARENA \
283+
--conf spark.rapids.memory.gpu.pooling.enabled=true \
284+
...

doc/jvm/xgboost4j_spark_tutorial.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -561,3 +561,30 @@ An equivalent way is to pass in parameters in XGBoostClassifier's constructor:
561561
562562
If the training failed during these 100 rounds, the next run of training would start by reading the latest checkpoint
563563
file in ``/checkpoints_path`` and start from the iteration when the checkpoint was built until to next failure or the specified 100 rounds.
564+
565+
566+
***************
567+
External Memory
568+
***************
569+
570+
.. versionadded:: 3.0
571+
572+
.. warning::
573+
574+
The feature is experimental.
575+
576+
Here we refer to the iterator-based external memory instead of the one that uses special
577+
URL parameters. XGBoost-Spark has experimental support for GPU-based external memory
578+
training (:doc:`/jvm/xgboost4j_spark_gpu_tutorial`) since 3.0. When it's used in
579+
combination with GPU-based training, data is first cached on disk and then staged on CPU
580+
memory. See :doc:`/tutorials/external_memory` for general concept and best practices for
581+
the external memory training. In addition, see the doc string of the estimator parameter
582+
`useExternalMemory`. With Spark estimators:
583+
584+
.. code-block:: scala
585+
586+
val xgbClassifier = new XGBoostClassifier(xgbParam)
587+
.setFeaturesCol(featuresNames)
588+
.setLabelCol(labelName)
589+
.setUseExternalMemory(true)
590+
.setDevice("cuda") // CPU is not yet supported

jvm-packages/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ build.sh
22
xgboost4j-tester/pom.xml
33
xgboost4j-tester/iris.csv
44
dependency-reduced-pom.xml
5+
.factorypath

jvm-packages/create_jni.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,10 @@ def native_build(cli_args: argparse.Namespace) -> None:
7373
os.environ["JAVA_HOME"] = (
7474
subprocess.check_output("/usr/libexec/java_home").strip().decode()
7575
)
76+
if cli_args.use_debug == "ON":
77+
CONFIG["CMAKE_BUILD_TYPE"] = "Debug"
78+
CONFIG["USE_NVTX"] = cli_args.use_nvtx
79+
CONFIG["PLUGIN_RMM"] = cli_args.plugin_rmm
7680

7781
print("building Java wrapper", flush=True)
7882
with cd(".."):
@@ -187,5 +191,8 @@ def native_build(cli_args: argparse.Namespace) -> None:
187191
)
188192
parser.add_argument("--use-cuda", type=str, choices=["ON", "OFF"], default="OFF")
189193
parser.add_argument("--use-openmp", type=str, choices=["ON", "OFF"], default="ON")
194+
parser.add_argument("--use-debug", type=str, choices=["ON", "OFF"], default="OFF")
195+
parser.add_argument("--use-nvtx", type=str, choices=["ON", "OFF"], default="OFF")
196+
parser.add_argument("--plugin-rmm", type=str, choices=["ON", "OFF"], default="OFF")
190197
cli_args = parser.parse_args()
191198
native_build(cli_args)

jvm-packages/pom.xml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,9 @@
5757
<log.capi.invocation>OFF</log.capi.invocation>
5858
<use.cuda>OFF</use.cuda>
5959
<use.openmp>ON</use.openmp>
60+
<use.debug>OFF</use.debug>
61+
<use.nvtx>OFF</use.nvtx>
62+
<plugin.rmm>OFF</plugin.rmm>
6063
<cudf.version>24.10.0</cudf.version>
6164
<spark.rapids.version>24.10.0</spark.rapids.version>
6265
<spark.rapids.classifier>cuda12</spark.rapids.classifier>

jvm-packages/xgboost4j-spark-gpu/src/main/java/ml/dmlc/xgboost4j/java/CudfColumnBatch.java

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,17 @@ private List<CudfColumn> initializeCudfColumns(Table table) {
8686
.collect(Collectors.toList());
8787
}
8888

89+
// visible for testing
90+
public Table getFeatureTable() {
91+
return featureTable;
92+
}
93+
94+
// visible for testing
95+
public Table getLabelTable() {
96+
return labelTable;
97+
}
98+
99+
89100
public List<CudfColumn> getFeatures() {
90101
return features;
91102
}

0 commit comments

Comments
 (0)