Add Sifive Xsfvqmacc extension intrinsics#5
Add Sifive Xsfvqmacc extension intrinsics#5yulong18 wants to merge 4 commits intosifive-rvv-intrinsicfrom
Conversation
|
Hi, @kito-cheng and @monkchiang : vint32m1_t test1(vint32m1_t vd, vint8m1_t vs1, vint8mf2_t vs2, size_t vl) { |
| {"xsfcease", &gcc_options::x_riscv_sifive_subext, MASK_XSFCEASE}, | ||
| {"xsfvqmaccqoq", &gcc_options::x_riscv_sifive_subext, MASK_XSFVQMACCQOQ}, | ||
| {"xsfvqmaccdod", &gcc_options::x_riscv_sifive_subext, MASK_XSFVQMACCDOD}, | ||
|
|
gcc/config/riscv/riscv-c.cc
Outdated
| builtin_define_with_int_value ("__riscv_th_v_intrinsic", | ||
| riscv_ext_version_value (0, 11)); | ||
|
|
||
|
|
| SHAPE(crypto_vv, crypto_vv) | ||
| SHAPE(crypto_vi, crypto_vi) | ||
| SHAPE(crypto_vv_no_op_type, crypto_vv_no_op_type) | ||
| SHAPE(sf_vqmacc,sf_vqmacc) |
There was a problem hiding this comment.
| SHAPE(sf_vqmacc,sf_vqmacc) | |
| SHAPE(sf_vqmacc, sf_vqmacc) |
kito-cheng
left a comment
There was a problem hiding this comment.
Could you add testcase?
Also you could copy /contrib/clang-format to /.clang-format, then you can use clang-format or git clang-format to format your code
| return e.use_exact_insn ( | ||
| code_for_pred_fnr_clip (ZERO_EXTEND, e.vector_mode ())); |
There was a problem hiding this comment.
How to distinguish between x and xu here?
| gcc_unreachable (); | ||
| } | ||
| } |
There was a problem hiding this comment.
| gcc_unreachable (); | |
| } | |
| } | |
| } | |
| gcc_unreachable (); | |
| } |
gcc_unreachable in wrong level I think?
| static CONSTEXPR const vfnrclip x_obj; | ||
| static CONSTEXPR const vfnrclip xu_obj; |
There was a problem hiding this comment.
sf_vfnrclip_x_obj;
sf_vfnrclip_xu_obj;
| if (e.op_info->op == OP_TYPE_4x8x4) | ||
| return e.use_widen_ternop_insn ( | ||
| code_for_pred_quad_mul_plusus_qoq (e.vector_mode ())); |
| static CONSTEXPR const vqmacc vqmacc_obj; | ||
| static CONSTEXPR const vqmaccu vqmaccu_obj; | ||
| static CONSTEXPR const vqmaccsu vqmaccsu_obj; | ||
| static CONSTEXPR const vqmaccsu vqmaccus_obj; |
| /* vop_v --> vop_v_<type>. */ | ||
| b.append_name (type_suffixes[instance.type.index].vector); |
| if (overloaded_p && (instance.pred == PRED_TYPE_tu || instance.pred == PRED_TYPE_mu || | ||
| instance.pred == PRED_TYPE_tumu)) |
This patch folds svindex with constant arguments into a vector series.
We implemented this in svindex_impl::fold using the function build_vec_series.
For example,
svuint64_t f1 ()
{
return svindex_u642 (10, 3);
}
compiled with -O2 -march=armv8.2-a+sve, is folded to {10, 13, 16, ...}
in the gimple pass lower.
This optimization benefits cases where svindex is used in combination with
other gimple-level optimizations.
For example,
svuint64_t f2 ()
{
return svmul_x (svptrue_b64 (), svindex_u64 (10, 3), 5);
}
has previously been compiled to
f2:
index z0.d, riscvarchive#10, #3
mul z0.d, z0.d, #5
ret
Now, it is compiled to
f2:
mov x0, 50
index z0.d, x0, riscvarchive#15
ret
We added test cases checking
- the application of the transform during gimple for constant arguments,
- the interaction with another gimple-level optimization.
The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?
Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com>
gcc/
* config/aarch64/aarch64-sve-builtins-base.cc
(svindex_impl::fold): Add constant folding.
gcc/testsuite/
* gcc.target/aarch64/sve/index_const_fold.c: New test.
We can make use of the integrated rotate step of the XAR instruction
to implement most vector integer rotates, as long we zero out one
of the input registers for it. This allows for a lower-latency sequence
than the fallback SHL+USRA, especially when we can hoist the zeroing operation
away from loops and hot parts. This should be safe to do for 64-bit vectors
as well even though the XAR instructions operate on 128-bit values, as the
bottom 64-bit results is later accessed through the right subregs.
This strategy is used whenever we have XAR instructions, the logic
in aarch64_emit_opt_vec_rotate is adjusted to resort to
expand_rotate_as_vec_perm only when it's expected to generate a single REV*
instruction or when XAR instructions are not present.
With this patch we can gerate for the input:
v4si
G1 (v4si r)
{
return (r >> 23) | (r << 9);
}
v8qi
G2 (v8qi r)
{
return (r << 3) | (r >> 5);
}
the assembly for +sve2:
G1:
movi v31.4s, 0
xar z0.s, z0.s, z31.s, riscvarchive#23
ret
G2:
movi v31.4s, 0
xar z0.b, z0.b, z31.b, #5
ret
instead of the current:
G1:
shl v31.4s, v0.4s, 9
usra v31.4s, v0.4s, 23
mov v0.16b, v31.16b
ret
G2:
shl v31.8b, v0.8b, 3
usra v31.8b, v0.8b, 5
mov v0.8b, v31.8b
ret
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add
generation of XAR sequences when possible.
gcc/testsuite/
* gcc.target/aarch64/rotate_xar_1.c: New test.
22a7f1e to
584fce9
Compare
| /* Return true if intrinsics maybe require qfrm operand. */ | ||
| virtual bool may_require_qfrm_p () const; | ||
|
|
| /* We choose to return false by default since most of the intrinsics does | ||
| not need qfrm operand. */ | ||
| inline bool | ||
| function_base::may_require_qfrm_p () const | ||
| { | ||
| return false; | ||
| } | ||
|
|
| #include "riscv_vector.h" | ||
|
|
||
| vint8mf8_t test1(float vs1, vfloat32mf2_t vs2, size_t vl) { | ||
| return __riscv_sf_vfnrclip_x_f_qf_i8mf8(vs2, vs1, vl); | ||
| } | ||
|
|
There was a problem hiding this comment.
Could you reference this file https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.target/riscv/target-attr-01.c and add correct test directive?
e.g.
/* { dg-do compile } */
/* { dg-skip-if "" { *-*-* } { "-flto" } { "" } } */
/* { dg-options "-march=rv64gcv_xsfvfnrclipxfqf -O2 -mabi=lp64d" } */
/* { dg-final { check-function-bodies "**" "" } } */and add some check within the testcase like:
/*
** foo:
** ...
** vsetvli\s*x0, a0, e32, m1, ta, ta
** sf.vfnrclip.x.f.qf\s*fa0,v8
** ...
*/And don't forgot sf_vqmacc.c
| vint8m2_t test2(float vs1, vfloat32m8_t vs2, size_t vl) { | ||
| return __riscv_sf_vfnrclip_x_f_qf_i8m2(vs2, vs1, vl); | ||
| } |
There was a problem hiding this comment.
Also testcase for __riscv_sf_vfnrclip_xu_f_qf_*, not really need all combination but please add few to improve the coverage
gcc/config/riscv/riscv.md
Outdated
| ;; vidiv vector single-width integer divide instructions | ||
| ;; viwmul vector widening integer multiply instructions | ||
| ;; vimuladd vector single-width integer multiply-add instructions | ||
| ;; vsfmuladd vector matrix integer multiply-add instructions |
There was a problem hiding this comment.
Minor comment, please put this in the end of the comment, don't mixed with standard instruction
gcc/config/riscv/riscv.md
Outdated
| ;; vsmul vector single-width fractional multiply with rounding and saturation instructions | ||
| ;; vsshift vector single-width scaling shift instructions | ||
| ;; vnclip vector narrowing fixed-point clip instructions | ||
| ;; vsfclip vector fp32 to int8 ranged clip instructions |
584fce9 to
da24b1c
Compare
|
Hi, @kito-cheng : |
da24b1c to
16f13a0
Compare
| /* According to SIFIVE vector-intrinsic-doc, it adds suffixes | ||
| for vop_m C++ overloaded API.*/ |
There was a problem hiding this comment.
the comment seems not updated?
| return e.use_exact_insn ( | ||
| code_for_pred_sf_vfnrclip_x_f_qf (UNSPEC, e.vector_mode ())); | ||
| gcc_unreachable (); |
There was a problem hiding this comment.
| return e.use_exact_insn ( | |
| code_for_pred_sf_vfnrclip_x_f_qf (UNSPEC, e.vector_mode ())); | |
| gcc_unreachable (); | |
| return e.use_exact_insn ( | |
| code_for_pred_sf_vfnrclip_x_f_qf (UNSPEC, e.vector_mode ())); |
| return e.use_exact_insn ( | ||
| code_for_pred_sf_vfnrclip_x_f_qf (UNSPEC, e.vector_mode ())); | ||
| gcc_unreachable (); |
There was a problem hiding this comment.
| return e.use_exact_insn ( | |
| code_for_pred_sf_vfnrclip_x_f_qf (UNSPEC, e.vector_mode ())); | |
| gcc_unreachable (); | |
| return e.use_exact_insn ( | |
| code_for_pred_sf_vfnrclip_x_f_qf (UNSPEC, e.vector_mode ())); |
| }; | ||
|
|
||
| /* Implements vqmacc. */ | ||
| class vqmacc : public function_base |
There was a problem hiding this comment.
| class vqmacc : public function_base | |
| class sf_vqmacc : public function_base |
| }; | ||
|
|
||
| /* Implements vqmaccu. */ | ||
| class vqmaccu : public function_base |
There was a problem hiding this comment.
| class vqmaccu : public function_base | |
| class sf_vqmaccu : public function_base |
| }; | ||
|
|
||
| /* Implements vqmaccus. */ | ||
| class vqmaccus : public function_base |
There was a problem hiding this comment.
| class vqmaccus : public function_base | |
| class sf_vqmaccus : public function_base |
| extern const function_base *const sf_vqmacc; | ||
| extern const function_base *const sf_vqmaccu; | ||
| extern const function_base *const sf_vqmaccsu; | ||
| extern const function_base *const sf_vqmaccus; |
There was a problem hiding this comment.
Could you create a new file sifive-vector-builtins-bases.h to hold sifive intrinsic
| /* Implements vfnrclip. */ | ||
| template <int UNSPEC, enum frm_op_type FRM_OP = NO_FRM> | ||
| class vfnrclip_x_f_qf : public function_base |
There was a problem hiding this comment.
create sifive-vector-builtins-bases.cc to hold those intrinsic
gcc/config/riscv/riscv.md
Outdated
| ;; vsmul vector single-width fractional multiply with rounding and saturation instructions | ||
| ;; vsshift vector single-width scaling shift instructions | ||
| ;; vnclip vector narrowing fixed-point clip instructions | ||
| ;; vfnrclip vector fp32 to int8 ranged clip instructions |
There was a problem hiding this comment.
They are not standard instruction, so plz create a new section in the comment rather than mixed in the std section, also plz prefixed with sf_
gcc/config/riscv/riscv.md
Outdated
| ;; vfncvtbf16 vector narrowing single floating-point to brain floating-point instruction | ||
| ;; vfwcvtbf16 vector widening brain floating-point to single floating-point instruction | ||
| ;; vfwmaccbf16 vector BF16 widening multiply-accumulate | ||
| ;; vqmacc vector matrix integer multiply-add instructions |
2c9c076 to
1f32394
Compare
|
Hi, @kito-cheng and @pz9115 : |
|
Hi, @kito-cheng : |
…o_debug_section [PR116614]
cat abc.C
#define A(n) struct T##n {} t##n;
#define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n#riscvarchive#8) A(n#riscvarchive#9)
#define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n#riscvarchive#8) B(n#riscvarchive#9)
#define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n#riscvarchive#8) C(n#riscvarchive#9)
#define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n#riscvarchive#8) D(n#riscvarchive#9)
E(1) E(2) E(3)
int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
The following patch fixes that. Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.
Yet, the fix isn't solely about removing the
if (new_i - 1 >= SHN_LORESERVE)
{
*err = ENOTSUP;
return "Too many copied sections";
}
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit. But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.
So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.
2024-09-07 Jakub Jelinek <jakub@redhat.com>
PR lto/116614
* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
comments.
(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
consistency.
(simple_object_elf_find_sections): Formatting fixes.
(simple_object_elf_fetch_attributes): Likewise.
(simple_object_elf_attributes_merge): Likewise.
(simple_object_elf_start_write): Likewise.
(simple_object_elf_write_ehdr): Likewise.
(simple_object_elf_write_shdr): Likewise.
(simple_object_elf_write_to_file): Likewise.
(simple_object_elf_copy_lto_debug_section): Likewise. Don't fail for
new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
over .symtab_shndx sections, though emit those last and compute their
section content when processing associated .symtab sections. Handle
simple_object_internal_read failure even in the .symtab_shndx reading
case.
(cherry picked from commit bb8dd09)
No description provided.