[Doc] Update releases.md for 1.13.10+xpu (#2142)

tye1 · ganyi1996ppo · jingxu10 · web-flow · commit fc71104e2569 · 2023-01-05T20:13:25.000+08:00
* Update releases.md
* Add known issue md link
* Add GPU known issues

---------

Co-authored-by: Pleaplusone &lt;38376071+ganyi1996ppo@users.noreply.github.com&gt;
Co-authored-by: Jing Xu &lt;jing.xu@intel.com&gt;
diff --git a/docs/tutorials/AOT.md b/docs/tutorials/AOT.md
@@ -7,14 +7,17 @@ Ahead of Time (AOT) Compilation
 
 ## Use case
 
-Intel® Extension for PyTorch\* provides build option `USE_AOT_DEVLIST` for users who install Intel® Extension for PyTorch\* via source compilation to configure device list for AOT compilation. The target device in device list is specified by DEVICE type of the target. Multi-target AOT compilation is supported by using a comma (,) as a delimiter in device list. See below table for the AOT setting targeting [Intel® Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html) & Intel® Arc™ series GPUs.
+Intel® Extension for PyTorch\* provides build option `USE_AOT_DEVLIST` for users who install Intel® Extension for PyTorch\* via source compilation to configure device list for AOT compilation. The target device in device list is specified by DEVICE type of the target. Multi-target AOT compilation is supported by using a comma (,) as a delimiter in device list. See below table for the AOT setting targeting [Intel® Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html) & Intel® Arc™ A-Series GPUs.
 
 | Supported HW | AOT Setting |
-| ------------ |---------------------|
-| Intel® Data Center GPU Flex Series 170 and <BR> Intel® Data Center GPU Max Series | USE_AOT_DEVLIST='ats-m150,pvc' |
-| Intel® Arc™ Series | USE_AOT_DEVLIST='dg2-g10'.<br />Depending on the driver-version, the AOT devlist string might even be `dg2-g10-c0` or `dg2`.<br />Please try `dg2-g10` first. If you would encounter a build error corresponding to AOT, please try one of the other two strings |
+| ------------ | ----------- |
+| Intel® Data Center GPU Flex Series 170 | USE_AOT_DEVLIST='ats-m150' |
+| Intel® Data Center GPU Max Series | USE_AOT_DEVLIST='pvc' |
+| Intel® Arc™ A-Series | USE_AOT_DEVLIST='ats-m150' |
 
-Intel® Extension for PyTorch\* enables AOT compilation for Intel GPU target devices in prebuilt wheel files. Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series are the enabled target devices in current release, with Intel® Arc™ series GPUs having experimental support. If Intel® Extension for PyTorch\* is executed on a device which is not pre-configured in `USE_AOT_DEVLIST`, this application can still run because JIT compilation will be triggered automatically to allow execution on the current device. It causes additional compilation time during execution.
+**Note:** Multiple AOT settings can be used together by seperating setting texts with a comma (,) to make the compiled wheel file have multiple AOT supports. E.g. a wheel file built with `USE_AOT_DEVLIST='ats-m150,pvc'` has both `ats-m150` and `pvc` AOT enabled.
+
+Intel® Extension for PyTorch\* enables AOT compilation for Intel GPU target devices in prebuilt wheel files. Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series are the enabled target devices in current release, with Intel® Arc™ A-Series GPUs having experimental support. If Intel® Extension for PyTorch\* is executed on a device which is not pre-configured in `USE_AOT_DEVLIST`, this application can still run because JIT compilation will be triggered automatically to allow execution on the current device. It causes additional compilation time during execution.
 
 For more GPU platforms, please refer to [Use AOT for Integrated Graphics (Intel GPU)](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html).
 
diff --git a/docs/tutorials/features/DPC++_Extension.md b/docs/tutorials/features/DPC++_Extension.md
@@ -204,9 +204,7 @@ Let’s go through the DPC++ code step by step:
 
 ```
 #include <torch/extension.h>
-
 #include <ipex.h>
-
 #include <vector>
 
 template <typename scalar_t>
diff --git a/docs/tutorials/performance_tuning/known_issues.md b/docs/tutorials/performance_tuning/known_issues.md
@@ -1,6 +1,90 @@
 Known Issues
 ============
 
+## Known Issues in GPU-Specific
+
+- [CRITICAL ERROR] Kernel 'XXX' removed due to usage of FP64 instructions unsupported by the targeted hardware
+
+    FP64 is not natively supported by the [Intel® Data Center GPU Flex Series](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/data-center-gpu/flex-series/overview.html) platform. If you run any AI workload on that platform and receive this error message, it means a kernel requiring FP64 instructions is removed and not executed, hence the accuracy of the whole workload is wrong.
+
+- symbol undefined caused by `_GLIBCXX_USE_CXX11_ABI`
+
+    ```bash
+    ImportError: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev
+    ```
+    
+    DPC++ does not support `_GLIBCXX_USE_CXX11_ABI=0`, Intel® Extension for PyTorch\* is always compiled with `_GLIBCXX_USE_CXX11_ABI=1`. This symbol undefined issue appears when PyTorch\* is compiled with `_GLIBCXX_USE_CXX11_ABI=0`. Update PyTorch\* CMAKE file to set `_GLIBCXX_USE_CXX11_ABI=1` and compile PyTorch\* with particular compiler which supports `_GLIBCXX_USE_CXX11_ABI=1`. We recommend using prebuilt wheels in [download server](https://developer.intel.com/ipex-whl-stable-xpu) to avoid this issue.
+
+- Can't find oneMKL library when build Intel® Extension for PyTorch\* without oneMKL
+
+    ```bash
+    /usr/bin/ld: cannot find -lmkl_sycl
+    /usr/bin/ld: cannot find -lmkl_intel_ilp64
+    /usr/bin/ld: cannot find -lmkl_core
+    /usr/bin/ld: cannot find -lmkl_tbb_thread
+    dpcpp: error: linker command failed with exit code 1 (use -v to see invocation)
+    ```
+    
+    When PyTorch\* is built with oneMKL library and Intel® Extension for PyTorch\* is built without oneMKL library, this linker issue may occur. Resolve it by setting:
+    
+    ```bash
+    export USE_ONEMKL=OFF
+    export MKL_DPCPP_ROOT=${PATH_To_Your_oneMKL}/__release_lnx/mkl
+    ```
+    
+    Then clean build Intel® Extension for PyTorch\*.
+
+- undefined symbol: `mkl_lapack_dspevd`. Intel MKL FATAL ERROR: cannot load `libmkl_vml_avx512.so.2` or `libmkl_vml_def.so.2`
+
+    This issue may occur when Intel® Extension for PyTorch\* is built with oneMKL library and PyTorch\* is not build with any MKL library. The oneMKL kernel may run into CPU backend incorrectly and trigger this issue. Resolve it by installing MKL library from conda:
+    
+    ```bash
+    conda install mkl
+    conda install mkl-include
+    ```
+    
+    then clean build PyTorch\*.
+
+- OSError: `libmkl_intel_lp64.so.1`: cannot open shared object file: No such file or directory
+
+    Wrong MKL library is used when multiple MKL libraries exist in system. Preload oneMKL by:
+    
+    ```bash
+    export LD_PRELOAD=${MKL_DPCPP_ROOT}/lib/intel64/libmkl_intel_lp64.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_intel_ilp64.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_gnu_thread.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_core.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_sycl.so.2
+    ```
+    
+    If you continue seeing similar issues for other shared object files, add the corresponding files under `${MKL_DPCPP_ROOT}/lib/intel64/` by `LD_PRELOAD`. Note that the suffix of the libraries may change (e.g. from .1 to .2), if more than one oneMKL library is installed on the system.
+
+- RuntimeError: Number of dpcpp devices should be greater than zero! 
+
+    Running some AI models (e.g. 3D-Unet inference) on Ubuntu22.04 may trigger this runtime error, as oneAPI Base Toolkit 2023.0 fails to return available GPU device on ubuntu22.04 in such scenario. The workaround solution is to update the model script to make sure `import torch` and `import intel_extension_for_pytorch` happen before importing other libraries.
+    
+- OpenMP library could not be found
+
+    Build Intel® Extension for PyTorch\* on SLES15 SP3 using default GCC 7.5 and CentOS8 using default GCC 8.5 may trigger this build error.
+    
+    ```bash
+    Make Error at third_party/ideep/mkl-dnn/third_party/oneDNN/cmake/OpenMP.cmake:118 (message):
+      OpenMP library could not be found.  Proceeding might lead to highly
+      sub-optimal performance.
+    Call Stack (most recent call first):
+      third_party/ideep/mkl-dnn/third_party/oneDNN/CMakeLists.txt:117 (include)
+    ```
+    
+    The root cause is GCC 7.5 or 8.5 does not support `-Wno-error=redundant-move` option. Uplift to GCC version >=9 can solve this issue.
+
+- Unit test failures on Intel® Data Center GPU Flex Series 170
+        
+    The following unit tests fail on Intel® Data Center GPU Flex Series 170.
+
+      test_groupnorm.py::TestTorchMethod::test_group_norm_backward 
+      test_groupnorm_channels_last.py::TestTorchMethod::test_group_norm_backward
+      test_fusion.py::TestNNMethod::test_conv_binary_mul
+
+    The same test cases pass on Intel® Data Center GPU Max Series. The root cause of the failures is under investigation.
+
+## Known Issues in CPU-Specific
+
 - If you found the workload runs with Intel® Extension for PyTorch\* occupies a remarkably large amount of memory, you can try to reduce the occupied memory size by setting the `--weights_prepack` parameter of the `ipex.optimize()` function to `False`.
 
 - Supporting of EmbeddingBag with INT8 when bag size > 1 is working in progress.
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
@@ -1,6 +1,33 @@
 Releases
 =============
 
+## 1.13.10+xpu
+
+Intel® Extension for PyTorch\* v1.13.10+xpu extends PyTorch\* 1.13 with up-to-date features and optimizations on `xpu` for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* `xpu` device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.
+
+### Highlights
+
+This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch\* dispatching mechanism for the `xpu` device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
+
+This release provides the following features:
+- Usability and Performance Features listed in [Intel® Extension for PyTorch\* v1.13.0+cpu release](https://intel.github.io/intel-extension-for-pytorch/cpu/1.13.0+cpu/tutorials/releases.html#id1)
+- Distributed Training
+  - support of distributed training with DistributedDataParallel (DDP) on Intel GPU hardware
+  - support of distributed training with Horovod (experimental) on Intel GPU hardware
+- DLPack Solution
+  - mechanism to share tensor data without copy when interoparate with other libraries on Intel GPU hardware
+- Legacy Profiler Tool
+  - an extension of PyTorch* legacy profiler for profiling operators' overhead on Intel GPU hardware
+- Simple Trace Tool
+  - built-in debugging tool to print out the call stack for a piece of code
+
+This release adds the following fusion patterns in PyTorch\* JIT mode for Intel GPU:
+- `Conv2D` + UnaryOp(`abs`, `sqrt`, `square`, `exp`, `log`, `round`, `GeLU`, `Log_Sigmoid`, `Hardswish`, `Mish`, `HardSigmoid`, `Tanh`, `Pow`, `ELU`, `hardtanh`)
+- `Linear` + UnaryOp(`abs`, `sqrt`, `square`, `exp`, `log`, `round`, `Log_Sigmoid`, `Hardswish`, `HardSigmoid`, `Pow`, `ELU`, `SiLU`, `hardtanh`, `Leaky_relu`)
+### Known Issues
+
+Please refer to [Known Issues webpage](./performance_tuning/known_issues.md).
+
 ## 1.10.200+gpu
 
 Intel® Extension for PyTorch\* v1.10.200+gpu extends PyTorch\* 1.10 with up-to-date features and optimizations on XPU for an extra performance boost on Intel Graphics cards. XPU is a user visible device that is a counterpart of the well-known CPU and CUDA in the PyTorch\* community. XPU represents an Intel-specific kernel and graph optimizations for various “concrete” devices. The XPU runtime will choose the actual device when executing AI workloads on the XPU device. The default selected device is Intel GPU. XPU kernels from Intel® Extension for PyTorch\* are written in [DPC++](https://github.com/intel/llvm#oneapi-dpc-compiler) that supports [SYCL language](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html) and also a number of [DPC++ extensions](https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions).