Skip to content

Conversation

@jorickert
Copy link

No description provided.

brad0 and others added 30 commits January 29, 2025 03:21
This makes some other problems show up like the fact that we didn't
suppress diagnostics during __builtin_constant_p evaluation.
This patch implements support for the UNROLL directive to control how
many times a loop should be unrolled.
It must be placed immediately before a `DO LOOP` and applies only to the
loop that follows. N is an integer that specifying the unrolling factor.
This is done by adding an attribute to the branch into the loop in LLVM
to indicate that the loop should unrolled.
The code pushed to support the directive `VECTOR ALWAYS` has been
modified to take account of the fact that several directives can be used
before a `DO LOOP`.
)

This commit restricts the use of scalar types in vector math builtins,
particularly the `__builtin_elementwise_*` builtins.

Previously, small scalar integer types would be promoted to `int`, as
per the usual conversions. This would silently do the wrong thing for
certain operations, such as `add_sat`, `popcount`, `bitreverse`, and
others. Similarly, since unsigned integer types were promoted to `int`,
something like `add_sat(unsigned char, unsigned char)` would perform a
*signed* operation.

With this patch, promotable scalar integer types are not promoted to
int, and are kept intact. If any of the types differ in the binary and
ternary builtins, an error is issued. Similarly an error is issued if
builtins are supplied integer types of different signs. Mixing enums of
different types in binary/ternary builtins now consistently raises an
error in all language modes.

This brings the behaviour surrounding scalar types more in line with
that of vector types. No change is made to vector types, which are both
not promoted and whose element types must match.

Fixes llvm#84047.

RFC:
https://discourse.llvm.org/t/rfc-change-behaviour-of-elementwise-builtins-on-scalar-integer-types/83725
)

This resolves the same issue addressed in
llvm#124286, but for invoke
operations. The issue arose from duplicated logic for both imports. This
PR also refactors the common import code for call and invoke
instructions to mitigate issues in the future.
As decided on
https://discourse.llvm.org/t/rfc-lets-document-and-enforce-a-minimum-python-version-for-lldb/82731.

LLDB 20 recommended `>= 3.8` but did not remove support for anything
earlier. Now we are in what will become LLDB 21, so I'm removing that
support and making
`>= 3.8` required.

See https://docs.python.org/3/c-api/apiabiversion.html#c.PY_VERSION_HEX
for the format of PY_VERSION_HEX.
Was flagged in llvm#124735
but done separately so it didn't get in the way of that.
This patch moves up the checks that verify if it is legal to replace the
atomic load/store with memcpy. Currently these checks are done after we
determine to convert the load/store to memcpy/memmove, which makes the
logic a bit confusing.

This patch is a prelude to llvm#50892
…subdirectory (llvm#124744)

I left these alone in llvm#124463 but I think it makes sense to clean these
up as well (which Philip also noted in llvm#124464).
…vm#124789)

These were based off instruction count, not throughput - we can probably improve these further, but these throughput numbers match the worse expanded shuffles we see in the vector-shuffle-128-v* codegen tests.
LoopInterchange have converted `DVEntry::LE` and `DVEntry::GE` in
direction vectors to '<' and '>' respectively. This handling is
incorrect because the information about the '=' it lost. This leads to
miscompilation in some cases. To resolve this issue, convert them to '*'
instead.

Resolve llvm#123920
…123675)

The PR addresses issues with the filters of 1 x r and of r x 1 and with
the tiling.

---------

Signed-off-by: Dmitriy Smirnov <[email protected]>
Thread-local code generation requires constant pools because most of the
relocations needed for it operate on data, so it cannot be used with
-mexecute-only (or -mpure-code, which is aliased in the driver).

Without this we hit an assertion in the backend when trying to generate
a constant pool.
…124775)

When using PAuthLR, the PAUTH_PROLOGUE expands into a sequence of
instructions which takes the address of one of those instructions, and
uses that address to compute the return address signature. If this is
duplicated, there will be two different addresses used in calculating
the signature, so the epilogue will only be correct for (at most) one of
them.

This change also restricts code generation when using v8.3-A return
address signing, without PAuthLR. This isn't strictly needed, as
duplicating the prologue there would be valid. We could fix this by
having two copies of PAUTH_PROLOGUE, with and without isNotDuplicable,
but I don't think it's worth adding the extra complexity to a security
feature for that.
…#123640)

Add new runPass helpers to run a VPlan transformation. This makes it
easier to add additional checks/functionality for each transform run. In
this patch, an option is added to run the verifier after each VPlan
transform.

Follow-ups will use the same helper to also support printing VPlans
after each transform.

Note that the verifier at the moment requires there to be a canonical IV
and vector loop region, so the final lowering transforms aren't run via
runPass yet.

PR: llvm#123640
Add the implementation of the IERRNO intrinsic to get the last system
error number, as given by the C errno variable.
This intrinsic is also used in RAMSES
(https://github.com/ramses-organisation/ramses/).
Using the `__builtin_elementwise_(add|sub)_sat` functions allows us to
directly optimize to the desired intrinsic, and avoid scalarization for
vector types.
Adds AVX512 bf16 dot-product operation and defines lowering to LLVM
intrinsics.

AVX512 intrinsic operation definition is extended with an optional
extension field that allows specifying necessary LLVM mnemonic suffix
e.g., `"bf16"` for `x86_avx512bf16_` intrinsics.
…egister shuffles

Patch adds usage of processShuffleMasks in TTI for RISCV. This function is already used for X86
shuffles estimations and in DAGTypeLegalizer::SplitVecRes_VECTOR_SHUFFLE
functions and in RISCV codegen.
Patch allows better cost estimation for sparse masks and unifies
cost/codegen between different targets/passes

Reviewers: preames

Reviewed By: preames

Pull Request: llvm#118103
…4770)

The tablgen definition `TypedStrAttr` is an attribute constraints that
is meant to restrict the type of a `StringAttr` to the type given as
parameter. However, the definition did not previously restrict the type;
any `StringAttr` was accepted. This PR makes the definition actually
enforce the type.

To test the constraints, the PR also changes the test op that was
previously used to test this constraint such that the enforced type is
`AnyInteger` instead of `AnyType`. The latter allowed any type, so not
enforcing that constraint had no observable effect. The PR then adds a
test case with a wrong type and ensures that diagnostics are produced.

Signed-off-by: Ingo Müller <[email protected]>
Infer mnemonics from the names of the records.
…#124799)

After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.
Add support for the `-fopenmp-version=60` command line argument. It is
needed for llvm#119891 (`#pragma
omp stripe`) which will be the first OpenMP 6.0 directive implemented.
Add regression tests for Clang in `-fopenmp-version=60` mode.
Requires x86 target for the lit test to ensure required instructions are
available.
This enables more projects in the CMake cache to add them to the
buildbot coverage in the AMDGPU buildbots.
kmitropoulou and others added 30 commits January 29, 2025 09:00
…nches. (llvm#124028)

- **[NFC] Use GCNPat instead of Pat.**
- **[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point
branches.**

---------

Co-authored-by: Konstantina Mitropoulou <[email protected]>
Add v16i16 coverage and "reverse order hadd/hsub" tests
A new op that allows for representing arbitrary contractions on operands
of arbitrary rank, with arbitrary transposes and arbitrary broadcasts
specified through its indexing_maps attribute.

Supports the expected lowerings to linalg.generic and to
vector.contract.

Corresponding RFC is here:
https://discourse.llvm.org/t/mlir-rfc-introduce-linalg-contract/83589
…02944)

This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs. The ultimate goal of
these statistics to help identify bad code patterns and ways to mitigate
them. The pass operates at the LLVM IR level so that it can, in theory,
support any LLVM-based compiler for programming languages supporting
GPUs. It has been tested so far with LLVM IR generated by Clang for
OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass runs at the end of LTO, and options like
``-Rpass=kernel-info`` enable its remarks. Example `opt` and `clang`
command lines appear in `llvm/docs/KernelInfo.rst`. Remarks include
summary statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca). Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
… NFC (llvm#124658)

Use the Flags argument to add FrameSetup directly instead of walking
backwards to add the flag after the call.
…ve (llvm#122900)

A trait poperty can be one of serveral alternatives (name, expression,
etc.), and each property in a list was parsed as if it could be any of
these alternatives independently from other properties. This made the
parsing vulnerable to certain ambiguities in the trait grammar (provided
in the OpenMP spec).
At the same time the OpenMP spec gives the expected types of properties
for almost every trait: all properties listed for a given trait are
usually of the same type, e.g. names, clauses, etc.

Incorporate these restrictions into the parser, and additionally use
property extensions as the fallback if the parsing of the expected
property type failed. This is intended to allow the parser to succeed,
and instead let the semantic-checking code emit a more user-friendly
message.
We can take advantage of the fact that we subsequently only clone cold
allocation contexts, since not cold behavior is the default, and
significantly reduce the amount of metadata (and later ThinLTO summary
and MemProfContextDisambiguation graph nodes) by pruning unnecessary not
cold contexts when building metadata from the trie.

Specifically, we only need to keep notcold contexts that overlap the
longest with cold allocations, to know how deeply to clone those
contexts to expose the cold allocation behavior.

For a large target this reduced ThinLTO bitcode object sizes by about
35%. It reduced the ThinLTO indexing time by about half and the peak
ThinLTO indexing memory by about 20%.
…vm#124948)

VectorCombine

Since the transformation which is the subject of the
'fold-binop-of-reductions.ll` test will be in VectorCombine move the
test there.
Added an extension point after vectorizer passes in the PassBuilder.
Additionally, added extension points before and after vectorizer passes
in `buildLTODefaultPipeline`. Credit goes to @mshockwave for guiding me
through my first LLVM contribution (and my first open source
contribution in general!) :)
- Implemented `registerVectorizerEndEPCallback`
- Implemented `invokeVectorizerEndEPCallbacks`
- Added `VectorizerEndEPCallbacks` SmallVector
- Added a command line option `passes-ep-vectorizer-end` to
`NewPMDriver.cpp`
- `buildModuleOptimizationPipeline` now calls
`invokeVectorizerEndEPCallbacks`
- `buildO0DefaultPipeline` now calls `invokeVectorizerEndEPCallbacks`
- `buildLTODefaultPipeline` now calls BOTH
`invokeVectorizerStartEPCallbacks` and `invokeVectorizerEndEPCallbacks`
- Added LIT tests to `new-pm-defaults.ll`, `new-pm-lto-defaults.ll`,
`new-pm-O0-ep-callbacks.ll`, and `pass-pipeline-parsing.ll`
- Renamed `CHECK-EP-Peephole` to `CHECK-EP-PEEPHOLE` in
`new-pm-lto-defaults.ll` for consistency.

This code is intended for developers that wish to implement and run
custom passes after the vectorizer passes in the PassBuilder pipeline.
For example, in llvm#91796, a pass was created that changed the induction
variables of vectorized code. This is right after the vectorization
passes.
llvm#111894)

This flag has been deprecated since Clang 19, having been the default
since then.

It has remained because its negation was still useful to work around
backwards compatibility breaking changes from P0522.

However, in Clang 20 we have landed various changes which implemented
P3310R2 and beyond, which solve almost all of the expected issues, the
known remaining few being a bit obscure.

So this change removes the flag completely and all of its implementation
and support code.

Hopefully any remaining users can just stop using the flag. If there are
still important issues remaining, this removal will also shake the tree
and help us know.
…length. (llvm#124827)

The length might be any integer, so hlfir.get_length lowering
should explicitly cast it to `index`.
…lvm#124867)

An hlfir.elemental with a shape `(0, HUGE)` still runs `HUGE`
number of iterations when expanded into a loop nest.
HLFIR transformational operations inlined as hlfir.elemental
may execute slower comparing to Fortran runtime implementation.
This patch adds an option for BufferizeHLFIR pass to reset all
upper bounds in the elemental loop nests to zero, if the result
is an empty array.

A separate patch will enable this option in the driver after I do
more performance testing. The option is off by default now.
…12241)

This fixes instantiation of definition for friend function templates,
when the declaration found and the one containing the definition have
different template contexts.

In these cases, the the function declaration corresponding to the
definition is not available; it may not even be instantiated at all.

So this patch adds a bit which tracks which function template
declaration was instantiated from the member template. It's used to find
which primary template serves as a context for the purpose of
obtainining the template arguments needed to instantiate the definition.

Fixes llvm#55509

Relanding patch, with no changes, after it was reverted due to revert of
commit this patch depended on.
This refactors the standard stream implementation in multiple ways:
- The streams are now `stream_data` structs, which contain all the data
required for a stream
- The windows mangling is generated via a macro instead of having magic
strings for the different streams. (i.e. it's now only partially magic)
This is an implementation of P1061 Structure Bindings Introduce a Pack
without the ability to use packs outside of templates. There is a couple
of ways the AST could have been sliced so let me know what you think.
The only part of this change that I am unsure of is the
serialization/deserialization stuff. I followed the implementation of
other Exprs, but I do not really know how it is tested. Thank you for
your time considering this.

---------

Co-authored-by: Yanzuo Liu <[email protected]>
…VE (llvm#121817)

Parse METADIRECTIVE as a standalone executable directive at the moment.
This will allow testing the parser code.

There is no lowering, not even clause conversion yet. There is also no
verification of the allowed values for trait sets, trait properties.
…it (llvm#124965)

This patch addresses post commit review comments from llvm#124859. 

The extra compile definition is not necessary and goes against the
effort to separate the runtimes from the flang compiler itself. The
function declaration for `CUFInit` can be accessed anyway since the
header are always present. The insertion of the call is only based on
the language feature options from the folding context.

A program compiled with cuda enabled but no cufruntime would just fail
at link time as expected.
Modify the DA pretty printer to match the output of other analysis
passes. This enables update_analyze_test_checks.py to also work on DA
tests. Auto generate all the Dependence Analysis tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.