Skip to content

Does DJL Xgboost support GPU based inferencing #3789

@suryakant261

Description

@suryakant261

Description

I tried running DJL XGboost inferencing on GPU, however it internally seems to be falling to CPU predictor

Expected Behavior

It should have run on GPU. Following flame graph shows it running on CPU

Image

How to Reproduce?

POM dependencies:

<dependency>
           <groupId>ai.djl.ml.xgboost</groupId>
           <artifactId>xgboost-gpu</artifactId>
           <version>0.29.0</version>
       </dependency>
<dependency>
           <groupId>ai.djl.pytorch</groupId>
           <artifactId>pytorch-native-cu121</artifactId>
           <classifier>linux-x86_64</classifier>
           <scope>runtime</scope>
           <version>2.1.2</version>
       </dependency>

Code block used for model init

Path modelDir = Paths.get(modelInternalParametersFile).getParent();

    try {
          Device device;
          Translator<DenseFeature[], BatchModelOutput> translator = new XGBoostTranslator();

          try {
              // Attempt to get the GPU device.
              device = Device.gpu();
              log.info("GPU device found. Initializing XGBoost model on GPU.");
          } catch (Exception e) {
              log.error("No GPU device found or DJL CUDA engine not configured. Falling back to CPU.", e);
              e.printStackTrace();
              device = Device.cpu();
          }

          // Use the Criteria API to load the model with our custom translator.
          Criteria<DenseFeature[], BatchModelOutput> criteria = Criteria.builder()
                  .setTypes(DenseFeature[].class, BatchModelOutput.class)
                  .optDevice(device)
                  .optEngine("XGBoost")
                  .optModelPath(modelDir)
                  .optTranslator(translator)
                  .optModelName(MODEL_INTERNAL_PARAMETER)
                  .build();

          // Load the model and create a single, reusable, thread-safe predictor.
          this.model = criteria.loadModel();
          this.predictor = model.newPredictor(translator);
          log.info("DjlXGBoostModel initialized successfully on device: {}.", model.getNDManager().getDevice());

      } catch (Exception e) {
          log.error("Failed to load DJL XGBoost model from path: {}", modelDir, e);
          throw new RuntimeException("Could not initialize DjlXGBoostModel", e);
      }

This init block does not give any exception or error message, indicating it is able to detect GPU

Environment Info

Cuda toolkit : cuda-repo-debian12-12-8-local_12.8.0-570.86.10-1_amd64.deb.1
In docker file :
ENV PATH /usr/local/cuda-12.8/bin:/usr/local/nvidia/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/cuda-12.8/lib64:/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
Nvidia-smi output

Image

within same environemnt i am able to run DJL Pytorch engine , however xgboost seems to be failing back to CPU

Wanted to understand why I am not able to schedule xgboost on GPU .Does this indicate a potential bug or a misconfiguration on our part

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions