Siwasaki/pr/libstdcxxfs by shintaro-iwasaki · Pull Request #17 · shintaro-iwasaki/triton

shintaro-iwasaki · 2023-01-05T22:33:34Z

No description provided.

I suspect this was the cause of the "new compiles even on a warm cache" behavior I was seeing, though haven't 100% confirmed it. Python `set()` iteration order is nondeterministic when you create a new process. So the same args could produce different `instance_descriptor`s and have false cache misses.

Based on the discussion in triton-lang#700, this PR enables downloading pybind11 in `setup.py` without `git submodule` instead of copy-pasting pybind11 code. The downloaded pybind11 will be in `~/.triton/pybind` (like `llvm`).

…ng#708) This allows compiling in a subprocess. I'm not seeing a ton of speedup from this, but figure it is a good change anyway.

…iton-lang#726) Without this patch, a debug version of python complains that: ``` Fatal Python error: Python memory allocator called without holding the GIL Python runtime state: initialized ```

…ang#727)

Fixes triton-lang#532, all 3 inputs to where have to be broadcast together.

Use environment variable `CUDA_HOME` with default value`/usr/local/cuda` for `cu_include_dir` triton-lang#731

…n by specialization parameters (triton-lang#742)

…riton-lang#765)

…rmance surprises as older `ptxas` are much slower. (triton-lang#769) This also makes codegen simpler by avoiding special handling of eviction policies

It is currently necessary for optimal performance in quantized workloads to add a special-purpose instruction in the IR. Backward compatibility with this instruction is *NOT* guaranteed.

Init a potential fix for mov.u8 which is not supported by ptx for now. Use mov.u16 instead and cast it to u8.

…g#747)

Fix two problems in libdevice and external dispatch: 1. Use static triton types (e.g., tl.int32) instead of creating new types. Otherwise, `tl.int32` and `tl.dtype('int32')` are not the same thing. 2. The name of an extern inst should be empty but not the symbol name of the inst. TTIR generator will assign names automatically. Otherwise, we have the same variable name when there are multiple same extern insts. Before the PR: ```bash __nv_exp = extern_elementwise f64<1024> %11; __nv_exp = extern_elementwise f64<1024> %11; ``` After the PR: ```bash %12 = extern_elementwise f64<1024> %11; %13 = extern_elementwise f64<1024> %11; ```

In ```torch._inductor```, we [convert 0d CPU tensor to scalar during triton codegen](pytorch/pytorch#87329), so need add missing triton support for bf16/fp16/fp64.

)

This reverts commit 584086f.

- Unifying several interfaces with different types to a single one, e.g. `fsub_ru` and `dsub_ru` -> `sub_ru`; - Minor bug fix: `fast_pow` is incorrectly classified into the `pow` interface, of which arguments are the same as `powf`; - Explicit interfaces for casting functions, e.g. decoupling `ll2float_ru` to `ll2float_ru` and `ull2float_ru`; - Removing interfaces that are not in NVIDIA's official documents, e.g. `fmaf_ieee_rn`, which is confusing together with `fmaf_rn`. Note that this PR for the master branch is different from triton-lang#829, which is for the MLIR branch.

This PR clarifies which features are supported on P100 via its tests, though Pascal is not officially and fully supported by Triton. ## What this PR does - Skip unsupported tests on P100. - Atomic RMW - `tl.dot()` (perhaps not all patterns, but basically most `tl.dot()` tests do not work on P100). - Add an explicit error if shared memory size >= 64K on P100. - Otherwise it causes `Invalid CUDA argument` error at `cuLaunchKernel()`, but this error is not very straightforward to understand. Instead of this generic CUDA argument error, this PR makes Triton show an error during codegen when `sm < 70`. This check happens in C/C++ so won't add an overhead in Triton's Python runtime. - 3 tests (see below) are currently failing, but these are not marked as skipped because any codegen update in the future can change the kernel size of the other tests. - This change won't affect Triton-MLIR. Hopefully Triton-MLIR's generic `tl.dot()` implementation would support P100. Importantly, Triton passed all the other tests on P100. Though this support is not official, it is great for, for example, PyTorch's TorchDynamo/Inductor, which can use Triton (without `tl.dot()`) for its backend (https://github.com/pytorch/torchdynamo/issues/1591). ### Results on P100 (Google Cloud) ```sh $ pytest test/unit ... ================================================================================== short test summary info ================================================================================== FAILED test/unit/language/test_core.py::test_reduce2d[argmin-float32-shape99-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_reduce2d[argmax-float32-shape113-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_permute[float32-shape5-perm5] - RuntimeError: Device does not support shared memory of 67584bytes ================================================================== 3 failed, 3824 passed, 952 skipped in 470.90s (0:07:50) ================================================================== ``` <details><summary> <b>Environment Details (collapsed)</b></summary> <p> ### VM details (Google Cloud) https://cloud.google.com/ ``` # You need a paid account (free trial does not cover GPUs) Google Cloud -> New Project -> Compute-Engine -> VM Instance Machine: GPU: NVIDIA Tesla P100 x 1 CPU: 2 vCPUs, 7.5GB memory Boot disk: OS: Ubuntu 18.04 LTS Disk: 40GB (cannot build Triton on the default 10GB disk) - When I tried, about $1.2 per hour. - US instances were full when I tried. I used Asia or Australia. - Needed a paid account (GPU is not covered by free trial) - Needed quota request for any GPU instance (by default, no GPU instance is allowed). Needed to wait an hour for approval ``` ### Reproducer ```sh ## 1. Install CUDA and a driver # Update the apt key (https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/) sudo apt-key del 7fa2af80 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb # Download CUDA as instructed wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" sudo apt-get update sudo apt-get -y install cuda # Are you using P100? nvidia-smi | grep "Tesla P100" ## 2. Setup the build environment sudo apt update sudo apt install -y build-essential wget git libz-dev wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $(pwd)/anaconda3 eval "$($(pwd)/anaconda3/bin/conda shell.bash hook)" conda create -y --name triton_base conda activate triton_base conda install -y cmake setuptools ## 3. Build Triton git clone https://github.com/openai/triton.git cd triton/python pip3 install -e '.[tests]' ## 4. Test pytest test/unit ``` ### Environment ```sh $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:04.0 Off | 0 | | N/A 36C P0 25W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ``` </p></details>

For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <phil@openai.com>

….py (triton-lang#883) Ran mypy over `build_extern.py`, cleaned up type annotations. Found a fixed a bug where `ExternLibrary(format=)` was being ignored.

The previous `{i}` was silently expanding to the `i` from the enumeration loop on `regular_args` (when it wasn't empty).

…atch (triton-lang#1004) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>

…triton-lang#1007)

…#1008)

…ang#1014) Continue the work triton-lang#990 # Background The `versionMinor` in MmaEncodingAttr holds some states of DotOp's operands in Volta, while such operands will be modified by some patterns, making the states out-of-date. This PR helps to correct the states. # Implementation It adds three new patterns: 1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding the MmaEncodingAttr instances with wrong states and create new correct ones for them, 2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating the wrong MmaEncodingAttr instances with new correct ones, currently it supports the following Ops a. `convert_layout[X -> mma]` b. `arith.constant SplatAttr : !tensor<mma>` c. `dot ... : !tensor<mma>` # Limitation This PR chooses the mapping way to bypass the IR walk complexity from the circular dependency between dot_operand[parent] and mma. We use the MmaEncodingAttr instance as the mapping key, but there might be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the same wrong MmaEncodingAttr instance. To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID field to MmaEncodingAttr.

@LyricZhao

…finement (triton-lang#1018) 1, add explicit value cache in emitting indices calculation; 2, move the indices calculation emitting logics into ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by templates. Refer to the discussion in this thread by @LyricZhao : https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969

…of python installed (triton-lang#1019)

…hared (triton-lang#1025)

…g#1027) Also add tests for `tt.trans`.

This is a hotfix for issue 1 in triton-lang#1017

Currently Triton returns tensors with the input types rather than i32 when doing reduce argmax/argmin.

…lang#1030) Fixing problem 2 in triton-lang#1017 Co-authored-by: Philippe Tillet <phil@openai.com>

jansel and others added 30 commits September 23, 2022 21:44

[BUILD] Download pybind11 in setup.py (triton-lang#703)

3ac929b

Based on the discussion in triton-lang#700, this PR enables downloading pybind11 in `setup.py` without `git submodule` instead of copy-pasting pybind11 code. The downloaded pybind11 will be in `~/.triton/pybind` (like `llvm`).

[FRONTEND] Make triton.compile work without a cuda context (triton-la…

998fd5f

…ng#708) This allows compiling in a subprocess. I'm not seeing a ton of speedup from this, but figure it is a good change anyway.

[TESTING] allclose fixup (triton-lang#724)

dad9752

[FRONTEND] Make sure to hold the gil when creating python objects (tr…

f45e31b

…iton-lang#726) Without this patch, a debug version of python complains that: ``` Fatal Python error: Python memory allocator called without holding the GIL Python runtime state: initialized ```

[CODEGEN] Fix an inliner to call a function with a phi-node (triton-l…

ae59f51

…ang#727)

[CODEGEN] Fix extract_N_bufferable in layout analysis (triton-lang#728)

7b61303

[TUTORIALS] Attention tutorial fixup

b244db0

[FRONTEND] fix broadcasting for where (triton-lang#729)

f55960e

Fixes triton-lang#532, all 3 inputs to where have to be broadcast together.

[RUNTIME] Dump llvm, ttir, and sass to help debugging (triton-lang#732)

4a2d3b7

[RUNTIME] Support environments with multiple cudalibs (triton-lang#733)

2b0f877

[FRONTEND] properly broadcast scalar where condition (triton-lang#736)

d3c925d

[RUNTIME] remove fixed cu_include_dir (triton-lang#739)

77c752d

Use environment variable `CUDA_HOME` with default value`/usr/local/cuda` for `cu_include_dir` triton-lang#731

[RUNTIME] Fixed JIT bug that leg some constexpr values to be override…

bdfdb9a

…n by specialization parameters (triton-lang#742)

[RUNTIME] Add callback functions for external tools (triton-lang#738)

11345e9

[DOCS] Fixed typos in 01-vector-add.py (triton-lang#751)

9a11a56

[RUNTIME] support multiple devices in the same process (triton-lang#757)

5d4b26d

[FRONTEND] Fix a bool tensor storing problem (triton-lang#746)

09cc2d4

[RUNTIME] Make entry point cache key depend on triton version hash (t…

af76c98

…riton-lang#765)

[DRIVER] Bumped CUDA requirement to 11.4+. This is to avoid bad perfo…

33e6f0d

…rmance surprises as older `ptxas` are much slower. (triton-lang#769) This also makes codegen simpler by avoiding special handling of eviction policies

[IR] Added special-purpose dequantize instruction (triton-lang#759)

71b46ac

It is currently necessary for optimal performance in quantized workloads to add a special-purpose instruction in the IR. Backward compatibility with this instruction is *NOT* guaranteed.

[Backend] Fix for mov.u8 (triton-lang#766)

bc98aea

Init a potential fix for mov.u8 which is not supported by ptx for now. Use mov.u16 instead and cast it to u8.

[DOCS] Update installation.rst to fix windows build error (triton-lan…

ddae106

…g#747)

Add bf16/fp16/fp64 support for ty_to_cpp (triton-lang#800)

5ca1ed0

In ```torch._inductor```, we [convert 0d CPU tensor to scalar during triton codegen](pytorch/pytorch#87329), so need add missing triton support for bf16/fp16/fp64.

[Frontend] Return a scalar if all input args are scalar (triton-lang#816

3ca667d

)

[BUILD] Now using cibuildwheel default

584086f

Revert "[BUILD] Now using cibuildwheel default"

6311d70

This reverts commit 584086f.

[DOCS] Add install from source instructions to README (triton-lang#821)

578ada7

shintaro-iwasaki and others added 19 commits November 3, 2022 00:11

[FRONTEND] Fix ExternLibrary(format=) bug; type annotate build_extern…

0e4691e

….py (triton-lang#883) Ran mypy over `build_extern.py`, cleaned up type annotations. Found a fixed a bug where `ExternLibrary(format=)` was being ignored.

Fix format double substitution bug: {i} => {{i}} (triton-lang#886)

44f5779

The previous `{i}` was silently expanding to the `i` from the enumeration loop on `regular_args` (when it wasn't empty).

[DRIVER] Fix typos (triton-lang#939)

8650b4d

[FRONTEND] Fix 3d indexing (triton-lang#1006)

b5aafb0

[FRONTEND] Export broadcast and broadcast_to in triton.language (…

925d3d7

…triton-lang#1007)

[BACKEND] Support splat constant on the DotOperandLayout (triton-lang…

fd2da4a

…#1008)

[BUILD] Fixed error when compiling in systems with multiple versions …

194ba10

…of python installed (triton-lang#1019)

[BACKEND] Add generic support of convert_layout from distributed to s…

0e8590f

…hared (triton-lang#1025)

[Backend] Use post-order traversal for liveness numbering (triton-lan…

678b9f5

…g#1027) Also add tests for `tt.trans`.

[Frontend] Fix import for libdevice (triton-lang#1028)

8460ea3

This is a hotfix for issue 1 in triton-lang#1017

[FRONTEND] Fix argmin/max output type (triton-lang#1012)

bc73bbb

Currently Triton returns tensors with the input types rather than i32 when doing reduce argmax/argmin.

[FRONTEND] Add logical operations on constexprs (triton-lang#1033)

411bacb

[Example] Fix the compile function in copy_strided.py (triton-lang#1029)

2193bee

[Frontend] Convert constexpr to value for store and load ops (triton-…

4023149

…lang#1030) Fixing problem 2 in triton-lang#1017 Co-authored-by: Philippe Tillet <phil@openai.com>

shintaro-iwasaki force-pushed the siwasaki/pr/libstdcxxfs branch from 295dd59 to 09a9986 Compare January 5, 2023 22:43

[CMake] Explicitly link against lstdc++fs for old GCC

08068d7

shintaro-iwasaki force-pushed the siwasaki/pr/libstdcxxfs branch from 09a9986 to 08068d7 Compare January 5, 2023 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Siwasaki/pr/libstdcxxfs#17

Siwasaki/pr/libstdcxxfs#17
shintaro-iwasaki wants to merge 50 commits intomasterfrom
siwasaki/pr/libstdcxxfs

shintaro-iwasaki commented Jan 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

shintaro-iwasaki commented Jan 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants