Skip to content

Conversation

@klausler
Copy link
Contributor

Rework derived type initialization in the runtime to just initialize the first element of any array, and then memcpy it to the others, rather than exercising the per-component paths for each element.

Reword derived type destruction in the runtime to detect and exploit a fast path for allocatable components whose types themselves don't need nested destruction.

Small tweaks were made in hot paths exposed by profiling in descriptor operations and derived type assignment.

@klausler klausler requested a review from vzakhari July 10, 2025 23:51
@klausler klausler force-pushed the wq-speedup branch 2 times, most recently from 7597a02 to b9be690 Compare July 11, 2025 23:53
Copy link
Contributor

@vzakhari vzakhari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you, Peter!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// address all elements. It genernalizes contiguity by also allowing
// address all elements. It generalizes contiguity by also allowing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I no longer always see this sort of thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark such methods with force-inline attribute. I am not suggesting doing it in this PR, but I wonder if you tried it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try it, and I've run into other trouble because something somewhere has a link-time unresolved external reference now to Descriptor::Elements(). Updating soon to leave Elements out-of-line with a new inline InlineElements() that it can call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clue how this change could have caused a linking issue..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't track it down either. The unresolved reference is not within flang_rt.runtime or the generated code for the failing test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the chunk size grows, I guess, we can hit cache conflicts depending on the aliasing implementation. Do you know if there is any non-temporal memcpy implementation that we can try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should use

char *to{rawInstance + *stride};
char *from{rawInstance};
for (std::size_t bytes{...}; bytes--; ) {
  *to++ = *from++;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that. It will probably be the same as memcpy if the compiler vectorizes it. I think memcpy may do better with rep movs on x86. Let's keep it simple with memcpy. If the cache issue pops up anywhere I will experiment with https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcpy is not defined for overlapping regions, unfortunately.

Rework derived type initialization in the runtime to just
initialize the first element of any array, and then memcpy
it to the others, rather than exercising the per-component
paths for each element.

Reword derived type destruction in the runtime to detect and
exploit a fast path for allocatable components whose types
themselves don't need nested destruction.

Small tweaks were made in hot paths exposed by profiling in
descriptor operations and derived type assignment.
@klausler klausler merged commit 2e53a68 into llvm:main Jul 14, 2025
9 checks passed
@klausler klausler deleted the wq-speedup branch July 14, 2025 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants