@@ -221,6 +221,31 @@ see the `DJL Serving Documentation on Python Mode. <https://docs.djl.ai/docs/ser
221221
222222For more information about DJL Serving, see the `DJL Serving documentation. <https://docs.djl.ai/docs/serving/index.html >`_
223223
224+ **************************
225+ Ahead of time partitioning
226+ **************************
227+
228+ To optimize the deployment of large models that do not fit in a single GPU, the model’s tensor weights are partitioned at
229+ runtime and each partition is loaded in individual GPU. But runtime partitioning takes significant amount of time and
230+ memory on model loading. So, DJLModel offers an ahead of time partitioning capability for DeepSpeed and FasterTransformer
231+ engines, which lets you partition your model weights and save them before deployment. HuggingFace does not support
232+ tensor parallelism, so ahead of time partitioning cannot be done for it. In our experiment with GPT-J model, loading
233+ this model with partitioned checkpoints increased the model loading time by 40%.
234+
235+ `partition ` method invokes an Amazon SageMaker Training job to partition the model and upload those partitioned
236+ checkpoints to S3 bucket. You can either provide your desired S3 bucket to upload the partitioned checkpoints or it will be
237+ uploaded to the default SageMaker S3 bucket. Please note that this S3 bucket will be remembered for deployment. When you
238+ call `deploy ` method after partition, DJLServing downloads the partitioned model checkpoints directly from the uploaded
239+ s3 url, if available.
240+
241+ .. code ::
242+
243+ # partitions the model using Amazon Sagemaker Training Job.
244+ djl_model.partition("ml.g5.12xlarge")
245+
246+ predictor = deepspeed_model.deploy("ml.g5.12xlarge",
247+ initial_instance_count=1)
248+
224249***********************
225250SageMaker DJL Classes
226251***********************
0 commit comments