| 
1 | 1 | <!--  | 
2 |  | -# Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.  | 
 | 2 | +# Copyright 2020-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.  | 
3 | 3 | #  | 
4 | 4 | # Redistribution and use in source and binary forms, with or without  | 
5 | 5 | # modification, are permitted provided that the following conditions  | 
@@ -81,8 +81,8 @@ Currently, Triton requires that a specially patched version of  | 
81 | 81 | PyTorch be used with the PyTorch backend. The full source for  | 
82 | 82 | these PyTorch versions are available as Docker images from  | 
83 | 83 | [NGC](https://ngc.nvidia.com). For example, the PyTorch version  | 
84 |  | -compatible with the 22.12 release of Triton is available as  | 
85 |  | -nvcr.io/nvidia/pytorch:22.12-py3.  | 
 | 84 | +compatible with the 25.09 release of Triton is available as  | 
 | 85 | +nvcr.io/nvidia/pytorch:25.09-py3.  | 
86 | 86 | 
 
  | 
87 | 87 | Copy over the LibTorch and Torchvision headers and libraries from the  | 
88 | 88 | [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)  | 
@@ -246,6 +246,79 @@ complex execution modes and dynamic shapes. If not specified, all are enabled by  | 
246 | 246 | 
 
  | 
247 | 247 |     `ENABLE_JIT_PROFILING`  | 
248 | 248 | 
 
  | 
 | 249 | +### PyTorch 2.0 Models  | 
 | 250 | + | 
 | 251 | +The model repository should look like:  | 
 | 252 | + | 
 | 253 | +```bash  | 
 | 254 | +model_repository/  | 
 | 255 | +`-- model_directory  | 
 | 256 | +    |-- 1  | 
 | 257 | +    |   |-- model.py  | 
 | 258 | +    |   `-- [model.pt]  | 
 | 259 | +    `-- config.pbtxt  | 
 | 260 | +```  | 
 | 261 | + | 
 | 262 | +The `model.py` contains the class definition of the PyTorch model.  | 
 | 263 | +The class should extend the  | 
 | 264 | +[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module).  | 
 | 265 | +The `model.pt` may be optionally provided which contains the saved  | 
 | 266 | +[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference)  | 
 | 267 | +of the model.  | 
 | 268 | + | 
 | 269 | +### TorchScript Models  | 
 | 270 | + | 
 | 271 | +The model repository should look like:  | 
 | 272 | + | 
 | 273 | +```bash  | 
 | 274 | +model_repository/  | 
 | 275 | +`-- model_directory  | 
 | 276 | +    |-- 1  | 
 | 277 | +    |   `-- model.pt  | 
 | 278 | +    `-- config.pbtxt  | 
 | 279 | +```  | 
 | 280 | + | 
 | 281 | +The `model.pt` is the TorchScript model file.  | 
 | 282 | + | 
 | 283 | +### Customization  | 
 | 284 | + | 
 | 285 | +The following PyTorch settings may be customized by setting parameters on the  | 
 | 286 | +`config.pbtxt`.  | 
 | 287 | + | 
 | 288 | +[`torch.set_num_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_threads.html#torch.set_num_threads)  | 
 | 289 | + | 
 | 290 | +* Key: `NUM_THREADS`  | 
 | 291 | +* Value: The number of threads used for intra-op parallelism on CPU.  | 
 | 292 | + | 
 | 293 | +[`torch.set_num_interop_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_interop_threads.html#torch.set_num_interop_threads)  | 
 | 294 | + | 
 | 295 | +* Key: `NUM_INTEROP_THREADS`  | 
 | 296 | +* Value: The number of threads used for interop parallelism (e.g. in JIT interpreter) on CPU.  | 
 | 297 | + | 
 | 298 | +[`torch.compile()` parameters](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile)  | 
 | 299 | + | 
 | 300 | +* Key: `TORCH_COMPILE_OPTIONAL_PARAMETERS`  | 
 | 301 | +* Value: Any of following parameter(s) encoded as a JSON object.  | 
 | 302 | +  * `fullgraph` (`bool`): Whether it is ok to break model into several subgraphs.  | 
 | 303 | +  * `dynamic` (`bool`): Use dynamic shape tracing.  | 
 | 304 | +  * `backend` (`str`): The backend to be used.  | 
 | 305 | +  * `mode` (`str`): Can be either `"default"`, `"reduce-overhead"`, or `"max-autotune"`.  | 
 | 306 | +  * `options` (`dict`): A dictionary of options to pass to the backend.  | 
 | 307 | +  * `disable` (`bool`): Turn `torch.compile()` into a no-op for testing.  | 
 | 308 | + | 
 | 309 | +For example:  | 
 | 310 | + | 
 | 311 | +```proto  | 
 | 312 | +parameters: {  | 
 | 313 | +  key: "NUM_THREADS"  | 
 | 314 | +  value: { string_value: "4" }  | 
 | 315 | +}  | 
 | 316 | +parameters: {  | 
 | 317 | +  key: "TORCH_COMPILE_OPTIONAL_PARAMETERS"  | 
 | 318 | +  value: { string_value: "{\"disable\": true}" }  | 
 | 319 | +}  | 
 | 320 | +```  | 
 | 321 | + | 
249 | 322 | ### Support  | 
250 | 323 | 
 
  | 
251 | 324 | #### Model Instance Group Kind  | 
@@ -306,126 +379,9 @@ instance in the  | 
306 | 379 | to ensure that the model instance and the tensors used for inference are  | 
307 | 380 | assigned to the same GPU device as on which the model was traced.  | 
308 | 381 | 
 
  | 
309 |  | -# PyTorch 2.0 Backend \[Experimental\]  | 
310 |  | - | 
311 |  | -> [!WARNING]  | 
312 |  | -> *This feature is subject to change and removal.*  | 
313 |  | -
  | 
314 |  | -Starting from 24.01, PyTorch models can be served directly via  | 
315 |  | -[Python runtime](src/model.py). By default, Triton will use the  | 
316 |  | -[LibTorch runtime](#pytorch-libtorch-backend) for PyTorch models. To use Python  | 
317 |  | -runtime, provide the following  | 
318 |  | -[runtime setting](https://github.com/triton-inference-server/backend/blob/main/README.md#backend-shared-library)  | 
319 |  | -in the model configuration:  | 
320 |  | - | 
321 |  | -```  | 
322 |  | -runtime: "model.py"  | 
323 |  | -```  | 
324 |  | - | 
325 |  | -## Dependencies  | 
 | 382 | +* Python functions optimizable by `torch.compile` may not be served directly in the `model.py` file, they need to be enclosed by a class extending the  | 
 | 383 | +  [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module).  | 
326 | 384 | 
 
  | 
327 |  | -### Python backend dependency  | 
 | 385 | +* Model weights cannot be shared across multiple instances on the same GPU device.  | 
328 | 386 | 
 
  | 
329 |  | -This feature depends on  | 
330 |  | -[Python backend](https://github.com/triton-inference-server/python_backend),  | 
331 |  | -see  | 
332 |  | -[Python-based Backends](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md)  | 
333 |  | -for more details.  | 
334 |  | - | 
335 |  | -### PyTorch dependency  | 
336 |  | - | 
337 |  | -This feature will take advantage of the  | 
338 |  | -[`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile)  | 
339 |  | -optimization, make sure the  | 
340 |  | -[PyTorch 2.0+ pip package](https://pypi.org/project/torch) is available in the  | 
341 |  | -same Python environment.  | 
342 |  | - | 
343 |  | -Alternatively, a [Python Execution Environment](#using-custom-python-execution-environments)  | 
344 |  | -with the PyTorch dependency may be used. It can be created with the  | 
345 |  | -[provided script](tools/gen_pb_exec_env.sh). The resulting  | 
346 |  | -`pb_exec_env_model.py.tar.gz` file should be placed at the same  | 
347 |  | -[backend shared library](https://github.com/triton-inference-server/backend/blob/main/README.md#backend-shared-library)  | 
348 |  | -directory as the [Python runtime](src/model.py).  | 
349 |  | - | 
350 |  | -## Model Layout  | 
351 |  | - | 
352 |  | -### PyTorch 2.0 models  | 
353 |  | - | 
354 |  | -The model repository should look like:  | 
355 |  | - | 
356 |  | -```  | 
357 |  | -model_repository/  | 
358 |  | -`-- model_directory  | 
359 |  | -    |-- 1  | 
360 |  | -    |   |-- model.py  | 
361 |  | -    |   `-- [model.pt]  | 
362 |  | -    `-- config.pbtxt  | 
363 |  | -```  | 
364 |  | - | 
365 |  | -The `model.py` contains the class definition of the PyTorch model. The class  | 
366 |  | -should extend the  | 
367 |  | -[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module).  | 
368 |  | -The `model.pt` may be optionally provided which contains the saved  | 
369 |  | -[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference)  | 
370 |  | -of the model.  | 
371 |  | - | 
372 |  | -### TorchScript models  | 
373 |  | - | 
374 |  | -The model repository should look like:  | 
375 |  | - | 
376 |  | -```  | 
377 |  | -model_repository/  | 
378 |  | -`-- model_directory  | 
379 |  | -    |-- 1  | 
380 |  | -    |   `-- model.pt  | 
381 |  | -    `-- config.pbtxt  | 
382 |  | -```  | 
383 |  | - | 
384 |  | -The `model.pt` is the TorchScript model file.  | 
385 |  | - | 
386 |  | -## Customization  | 
387 |  | - | 
388 |  | -The following PyTorch settings may be customized by setting parameters on the  | 
389 |  | -`config.pbtxt`.  | 
390 |  | - | 
391 |  | -[`torch.set_num_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_threads.html#torch.set_num_threads)  | 
392 |  | -- Key: NUM_THREADS  | 
393 |  | -- Value: The number of threads used for intraop parallelism on CPU.  | 
394 |  | - | 
395 |  | -[`torch.set_num_interop_threads(int)`](https://pytorch.org/docs/stable/generated/torch.set_num_interop_threads.html#torch.set_num_interop_threads)  | 
396 |  | -- Key: NUM_INTEROP_THREADS  | 
397 |  | -- Value: The number of threads used for interop parallelism (e.g. in JIT  | 
398 |  | -interpreter) on CPU.  | 
399 |  | - | 
400 |  | -[`torch.compile()` parameters](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile)  | 
401 |  | -- Key: TORCH_COMPILE_OPTIONAL_PARAMETERS  | 
402 |  | -- Value: Any of following parameter(s) encoded as a JSON object.  | 
403 |  | -  - fullgraph (*bool*): Whether it is ok to break model into several subgraphs.  | 
404 |  | -  - dynamic (*bool*): Use dynamic shape tracing.  | 
405 |  | -  - backend (*str*): The backend to be used.  | 
406 |  | -  - mode (*str*): Can be either "default", "reduce-overhead" or "max-autotune".  | 
407 |  | -  - options (*dict*): A dictionary of options to pass to the backend.  | 
408 |  | -  - disable (*bool*): Turn `torch.compile()` into a no-op for testing.  | 
409 |  | - | 
410 |  | -For example:  | 
411 |  | -```  | 
412 |  | -parameters: {  | 
413 |  | -    key: "NUM_THREADS"  | 
414 |  | -    value: { string_value: "4" }  | 
415 |  | -}  | 
416 |  | -parameters: {  | 
417 |  | -    key: "TORCH_COMPILE_OPTIONAL_PARAMETERS"  | 
418 |  | -    value: { string_value: "{\"disable\": true}" }  | 
419 |  | -}  | 
420 |  | -```  | 
421 |  | - | 
422 |  | -## Limitations  | 
423 |  | - | 
424 |  | -Following are few known limitations of this feature:  | 
425 |  | -- Python functions optimizable by `torch.compile` may not be served directly in  | 
426 |  | -the `model.py` file, they need to be enclosed by a class extending the  | 
427 |  | -[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module).  | 
428 |  | -- Model weights cannot be shared across multiple instances on the same GPU  | 
429 |  | -device.  | 
430 |  | -- When using `KIND_MODEL` as model instance kind, the default device of the  | 
431 |  | -first parameter on the model is used.  | 
 | 387 | +* When using `KIND_MODEL` as model instance kind, the default device of the first parameter on the model is used.  | 
0 commit comments