Skip to content

Conversation

@cj401-amd
Copy link

@cj401-amd cj401-amd commented Dec 5, 2025

Motivation

backport rocprofiler-sdk (v3) and roctracer (v1) to v0.8.0

Technical Details

  • support both v3 and v1, building to use v1 as profiling backend is via --bazel_options="--define=xla_rocm_profiler=v1"
  • support adjusting the maximum number of trace event, via export XLA_FLAGS="--xla_gpu_rocm_max_trace_events=30000000" , default to 4M
  • support tracing hipGraphLaunch , export export XLA_FLAGS="--xla_gpu_graph_min_graph_size=1" (--xla_gpu_graph_level has been retired now)

no kernel details in the trace file to be fixed.

@Eetusjo
Copy link

Eetusjo commented Dec 8, 2025

I created a PR to fix the missing buffer flush in upstream: openxla#34956. @cj401-amd I think you could pick that commit here so the fix gets merged at the same time.

RocmTracerOptions GpuTracer::GetRocmTracerOptions() {
// TODO(rocm-profiler): We need support for context similar to CUDA
RocmTracerOptions options;
#if defined(XLA_GPU_ROCM_TRACER_BACKEND) && (XLA_GPU_ROCM_TRACER_BACKEND == 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI, you can skip defined() and use simply

#if XLA_GPU_ROCM_TRACER_BACKEND == 1

After all replacements due to macro expansion and evaluations of defined-macro-expressions ... have been performed, all remaining identifiers and keywords, except for true and false, are replaced with the pp-number 0 ....
https://timsong-cpp.github.io/cppwp/n4659/cpp.cond#9

// Number of GPUs involved.
uint32_t num_gpus;
friend bool operator==(const RocmDeviceOccupancyParams& a,
const RocmDeviceOccupancyParams& b) noexcept {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of having a noexcept here?

noexcept is an exception firewall that might be expensive to maintain, hard to diagnose & debug (the app just terminates without a message), and more importantly, not much useful, as in many cases a compiler could see on its own that nothing throws from below the stack. So I'm not recommending to use it unless there are real reasons for it (but then for diagnosing reasons it's better to start with generic exception handler that could at least output some info right before termination.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could copy constructors of any the objects throw?


inline auto GetCallbackTracingNames() {
return rocprofiler::sdk::get_callback_tracing_names();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume these, or at least the function, are expected to be used from this translation unit only, - is this correct?

If so, please put them, and all other similar private definitions of this file into an anonymous namespace to prevent a linker from seeing it.

} else if (event.type == RocmTracerEventType::MemoryAlloc) {
VLOG(7) << "Add MemAlloc stat";
std::string value =
// TODO(rocm-profiler): we need to discover the memory kind similar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO(rocm-profiler):

TODO comments don't reference who should implement it, or the topic to which the TODO applies. Instead, they should reference a person (or a well defined team as a stretch of a person concept) who knows the most about the problem.

https://google.github.io/styleguide/cppguide.html#TODO_Comments

});
events.clear();
bool PerDeviceCollector::IsHostEvent(const RocmTracerEvent& event,
tsl::int64* line_id) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never check if line_id is a nullptr hence assuming it's always valid. Why not pass it a reference instead tsl::int64& line_id ? This has a clearer "not null" semantic. Otherwise it's better to put a nullptr check here.

Copy link

@Arech8 Arech8 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to myriads of other unguarded pointer uses in the PR code...

if (hipGetDeviceProperties(&device_properties_, device_ordinal) != hipSuccess)
return;

auto clock_rate_in_khz = device_properties_.clockRate; // in KHz
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is redundant here. You've already properly named the var. The var could be even shorter: clock_rate_khz and still remain understandable.

for (auto& event : aggregated_events) {
auto id = event.device_id - min_device_id;
if (id < num_gpus_) {
per_device_collector_[id].AddEvent(std::move(event));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow-wow-wow, since you're passing the object here into AddEvent(), you should declare it properly in the loop definition:

 for (auto&& event : aggregated_events) {

You can still use it as lvalue in }else{ block, but that way you at least declare that you can steal the object from the container.

Comment on lines +561 to +562
end_gputime_ns, &device_plane,
&host_plane);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminding about passing by reference, probably even as a reference to a const object, since Export semantic typically shouldn't modify the objects it exports.

}
void PerDeviceCollector::AddEvent(RocmTracerEvent&& event) {
absl::MutexLock lock(&events_mutex_);
events_.emplace_back(std::move(event));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, push_back() seems more appropriate here, since you adding an existing object to a collection of such objects.

There might be other instances of emplace_back misuse, please check the code.

Comment on lines +289 to +298
trace_event->kernel_info = KernelDetails{
.private_segment_size = kinfo.private_segment_size,
.group_segment_size = kinfo.group_segment_size,
.workgroup_x = kinfo.workgroup_size.x,
.workgroup_y = kinfo.workgroup_size.y,
.workgroup_z = kinfo.workgroup_size.z,
.grid_x = kinfo.grid_size.x,
.grid_y = kinfo.grid_size.y,
.grid_z = kinfo.grid_size.z,
.func_ptr = nullptr,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeeeerm, designated initializers are C++20 feature. Aren't XLA uses C++17?

size_t num_headers, uint64_t drop_count) {
if (collector() == nullptr) return;
if (num_headers == 0) return;
assert(drop_count == 0 && "drop count should be zero for lossless policy");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps LOG(WARNING) could be better here? It's unlikely that a user will profile something under a debug build, or at least without NDEBUG...

ProfilerKernelInfo{tsl::port::MaybeAbiDemangle(data->kernel_name),
*data});
} else if (record.phase == ROCPROFILER_CALLBACK_PHASE_UNLOAD) {
// FIXME: clear these? At minimum need kernel names at shutdown, async
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like with TODO, I think, FIXME needs to mention someone who knows the most why this needs to be fixed.


PerDeviceCollector() = default;

void AddEvent(RocmTracerEvent&& event);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be
void AddEvent(RocmTracerEvent event)

it shall not restrict the user to move objects in?
You do call it as AddEvent(std::move(obj)); if you want to.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of copying the object if it could just be moved from the source? (I don't know if it's could be moved from there, so I'm genuinely asking)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will not be copying it, but what will you do is decided by the user of that api.
if you use std::move() to pass the argument it will be moved, if not it will be copied.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.... that depends on how expensive the move is (might be, though typically not), and if the move operation is defined the first hand for the type. Passing by rvalue ref guarantees you'll have at most one copy (upon attempting to move the object into destination) in the worst case.

Can you explain what do you think is wrong with passing as rvalue ref?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume you have an object you want to add into that collection but you also want to use it later on. You just can't do that with this api. With a different api you can remote std::move and do it!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but does this assumption holds in reality?

And what could you do with the object after you've added a copy of it into a collection? If you're interested in querying its properties only (a.k.a. const this access), they couldn't this function just return a const reference to the object added into the collection?

This code don't need that feature, what's the value of having it then? If someone would need it, they could add an override, couldn't they?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but does this assumption holds in reality?

And what could you do with the object after you've added a copy of it into a collection? If you're interested in querying its properties only (a.k.a. const this access), they couldn't this function just return a const reference to the object added into the collection?

This code don't need that feature, what's the value of having it then? If someone would need it, they could add an override, couldn't they?

It is more about the code simplicity and readability. It is always better to read MyType param than MyType&& param less symbols less confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"And what could you do with the object after you've added a copy of it into a collection? If you're interested in querying its properties only (a.k.a. const this access), they couldn't this function just return a const reference to the object added into the collection?"
You can add it so some other container, copy it, modify it whatever. It is about the interface design and code readability.
In this particular case maybe we don't do so but why do we make our interface more complex than it should be? Why interface has to have a move semantics on it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you notice you have suddenly fully changed reasons to amend this?

And I fully disagree with the new rationale... There's just nothing wrong with Type&& argument, it's a perfectly normal C++. Did you ever read high-performance code? You must always think of even tiniest moves there, it's riddled with move semantic for that reason. I totally can't buy the "simplicity and readability" argument, especially given that I absolutely can't expect people using std::move(event) to prevent copying. This rarely happens even with very experienced people, most just never cares, so it's good to write code the way it just won't happen. For those who want something different - function overloads can handle this nicely.

int isValid = 0;
rocprofiler_context_is_valid(context_, &isValid);
if (isValid == 0) {
context_.handle = 0; // Leak on failure.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a failure path? Shouldn't the whole initialization be reverted or it remains in a stable non-leaking state?

Comment on lines +526 to +528
rocprofiler_query_available_agents(ROCPROFILER_AGENT_INFO_VERSION_0,
iterate_cb, sizeof(rocprofiler_agent_t),
static_cast<void*>(&agents));
Copy link

@Arech8 Arech8 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eeeeeeerm.. Why do you passing std::vector<rocprofiler_agent_v0_t> agents; as void*? Is rocprofiler_query_available_agents implemented in a different compilation unit binary? If yes, what guarantees it uses exactly the same standard lib with the same implementation as dynamic library?
C++ have no standardized ABI on top of that...

Comment on lines +564 to +566
static rocprofiler_tool_configure_result_t cfg{
sizeof(rocprofiler_tool_configure_result_t), &toolInitStatic,
&RocmTracer::toolFinalize, nullptr};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what would happen if rocprofiler_configure() is called the second time now with different arguments?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eeeeeerrm, but you aren't actually using any of the arguments. Why do you output them to the log then, if they mean nothing? Don't you think you're confusing users that way, since they would expect the arguments matter somehow when confirmed by the log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants