Skip to content

memcpy in loop with known-zero source not optimized well #31516

@llvmbot

Description

@llvmbot
Bugzilla Link 32168
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @majnemer,@hfinkel,@rotateright

Extended Description

IR that does a memcpy in a loop out of a memset(0) is optimized quite badly due to a bad interaction between several passes.

For example, this IR:

%Biquad = type { double, double, double, double, double, double, double, double, double }
declare void @​llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i32, i1)
declare void @​llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i32, i1)

define void @​initme([5 x %Biquad]*) {
entry-block:
  %temp = alloca %Biquad, align 8
  %temp_i8 = bitcast %Biquad* %temp to i8*
  call void @​llvm.memset.p0i8.i64(i8* %temp_i8, i8 0, i64 72, i32 8, i1 false)

  %p0 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64 0
  %pN = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64 5
  br label %slice_loop_body
slice_loop_body:
  %p = phi %Biquad* [ %p0, %entry-block ], [ %p_next, %slice_loop_body ]
  %p_i8 = bitcast %Biquad* %p to i8*
  call void @​llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8, i8* %temp_i8, i64 72, i32 8, i1 false)
  %p_next = getelementptr inbounds %Biquad, %Biquad* %p, i64 1
  %cond = icmp eq %Biquad* %p_next, %pN
  br i1 %cond, label %exit, label %slice_loop_body
exit:
  ret void
}

which (as can be demonstrated using the pass sequence "-loop-unroll -simplifycfg -instcombine -memcpyopt -instcombine") is equivalent to a memset:
define void @​initme([5 x %Biquad]) {
entry-block:
%1 = bitcast [5 x %Biquad]
%0 to i8*
call void @​llvm.memset.p0i8.i64(i8* %1, i8 0, i64 360, i32 8, i1 false)
ret void
}

However, if you use the standard -O2 optimization sequence, you get this terrible result:
define void @​initme([5 x %Biquad]) local_unnamed_addr #​1 {
entry-block:
%p_next.1 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]
%0, i64 0, i64 2
%p_i8.2 = bitcast %Biquad* %p_next.1 to i8*
%1 = bitcast [5 x %Biquad]* %0 to i8*
call void @​llvm.memset.p0i8.i64(i8* %1, i8 0, i64 144, i32 8, i1 false)
call void @​llvm.memset.p0i8.i64(i8* %p_i8.2, i8 0, i64 72, i32 8, i1 false)
%p_next.2 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64 3
%p_i8.3 = bitcast %Biquad* %p_next.2 to i8*
call void @​llvm.memset.p0i8.i64(i8* %p_i8.3, i8 0, i64 72, i32 8, i1 false)
%p_next.3 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64 4
%p_i8.4 = bitcast %Biquad* %p_next.3 to i8*
call void @​llvm.memset.p0i8.i64(i8* %p_i8.4, i8 0, i64 72, i32 8, i1 false)
ret void
}

If the array was larger (say, had 100 members), the code generated would have been proportionally worse.

The reason this happens is:

  1. There is no optimization that turns a memcpy initialized in a different basic block to memset, and in particular LoopIdiomRecognize can't optimize this out.
  2. Loop unrolling happily unrolls this loop, and generates a "chain" of geps:
    %p_i8 = bitcast %Biquad* %p0 to i8*
    call void @​llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8, i8* %temp_i8, i64 72, i32 8, i1 false)
    %p_next = getelementptr inbounds %Biquad, %Biquad* %p0, i64 1
    %p_i8.1 = bitcast %Biquad* %p_next to i8*
    call void @​llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8.1, i8* %temp_i8, i64 72, i32 8, i1 false)
    %p_next.1 = getelementptr inbounds %Biquad, %Biquad* %p_next, i64 1
    ...
  3. MemCpyOpt does not handle GEP chains well (https://github.com/llvm-mirror/llvm/blob/f33a6990794fc06d1e54c1cbecca0afa0a3d7d7a/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L429), so it only merges the first 2 memcpys. This is normally fine, because InstCombine (which collapses GEP chains) normally runs before MemCpyOpt, but here unrolling introduces new chains.

The terrible code generated is causing random slowdowns - e.g. rust-lang/rust#40267.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions