Skip to content

Commit e430427

Browse files
cboss6Lu Teng
andauthored
[Feat] Enable plugin profiler for SYCL backend (#377)
Co-authored-by: Lu Teng <[email protected]>
1 parent bda51ee commit e430427

22 files changed

+7828
-55
lines changed

docs/images/perfetto_profile.png

655 KB
Loading

docs/images/tensorboard_profile.png

71.2 KB
Loading
163 KB
Loading
260 KB
Loading
336 KB
Loading

docs/profiler.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# GPU Profiler
2+
3+
Intel® Extension for OpenXLA* provides Profiler to track the performance of workloads running on the Intel GPU. [PJRT_C_API](https://github.com/openxla/xla/blob/main/xla/backends/profiler/plugin/profiler_c_api.h) lets third-party plugin communicate profiling data in XLA's native profiling format which takes a serialized XSpace object and fills the object with runtime data obtained through the oneAPI [Level Zero](https://www.intel.com/content/www/us/en/developer/articles/technical/using-oneapi-level-zero-interface.html) low-level device interface.
4+
5+
Users can display the execution profile of specific XLA modules, XLA ops, GPU kernels and so on with profiling visualizer like TensorBoard or Perfetto.
6+
7+
## How to use
8+
* Neccessary environment variables:
9+
10+
| env | functionality |
11+
| --- | --- |
12+
| ZE_ENABLE_TRACING_LAYER | Set to 1 to enable the Tracing Layer for Level-Zero API Tracing, see [L0 loader APIs](https://github.com/oneapi-src/level-zero/blob/77d092e314365cc54b9b873a47210a799ed5a77c/doc/loader_api.md?plain=1#L40) for more details. |
13+
| UseCyclesPerSecondTimer | Set to 1 to help libraries with transition to the new resolution since time resolution returned by device properties has been changed to cycles/second in Level-Zero. |
14+
| NEOReadDebugKeys | Set to 1 to read debug environment variables on Linux release builds, see [NEOReadDebugKeys](https://github.com/intel/compute-runtime/blob/master/FAQ.md#how-can-i-enable-reading-debug-environment-variables-on-linux-release-builds) for more details. |
15+
16+
* Script:
17+
```
18+
import jax
19+
import jax.numpy as jnp
20+
21+
print("jax.local_devices(): ", jax.local_devices())
22+
23+
@jax.jit
24+
def lax_conv():
25+
key = jax.random.PRNGKey(0)
26+
lhs = jax.random.uniform(key, (2,1,9,9), jnp.float32)
27+
rhs = jax.random.uniform(key, (1,1,4,4), jnp.float32)
28+
side = jax.random.uniform(key, (1,1,1,1), jnp.float32)
29+
out = jax.lax.conv_with_general_padding(lhs, rhs, (1,1), ((0,0),(0,0)), (1,1), (1,1))
30+
out = jax.nn.relu(out)
31+
out = jnp.multiply(out, side)
32+
return out
33+
34+
jax.profiler.start_trace("./profile_tmp")
35+
print(lax_conv())
36+
jax.profiler.stop_trace()
37+
```
38+
39+
* Run:
40+
```
41+
$ export ZE_ENABLE_TRACING_LAYER=1
42+
$ export UseCyclesPerSecondTimer=1
43+
$ python jax_conv.py
44+
```
45+
When this computation is done, the program will generate a directory "profile_tmp", choose one of following tools to visualize profiling data collected in this directory.
46+
47+
### TensorBoard profiling
48+
TensorBoard's profiler can be used to profiler JAX or TensorFlow programs. Tensorboard is a great way to acquire and visualize performance traces and profiles of your program.
49+
50+
The end result looks something like this:
51+
<p align="center">
52+
<img src="images/tensorboard_profile.png" alt="tensorboard_profile.png" />
53+
</p>
54+
55+
56+
* Requirement:
57+
```
58+
pip install -U tensorboard-plugin-profile
59+
```
60+
61+
* Run TensorBoard:
62+
After executing above python script code, you will find the log files in ./profile_tmp. Then, run TensorBoard with following command:
63+
```
64+
tensorboard --logdir=./profile_tmp --bind_all
65+
```
66+
67+
* Analyze the result from the Profile tab:
68+
The GPU profiler supports the following profiling items:
69+
* kernel_stats:
70+
![image](images/tensorboard_profile_kernelstats.png)
71+
* framework_op_stats:
72+
![image](images/tensorboard_profile_opstats.png)
73+
74+
* trace_viewer:
75+
![image](images/tensorboard_profile_traceview.png)
76+
77+
78+
### Perfetto profiling
79+
[Perfetto](https://ui.perfetto.dev/) is a high-performance system tracing and analysis tool primarily used for capturing and analyzing various performance events in Linux or Android systems. We can use Perfetto to visualize profiling data generated by JAX profiler.
80+
After executing above python script code, you will find the log files in ./profile_tmp. Then follow below steps:
81+
* Preparation
82+
Unzip the .gz file in ./profile_tmp:
83+
```
84+
unzip xxx.trace.json.gz
85+
```
86+
Then we can get a xxx.trace.json.
87+
88+
* Open the trace file within Perfetto:
89+
![image](images/perfetto_profile.png)
90+
91+
## FAQ
92+
1.If you see "No dashboards are activated for the current data set." the first time you enter the TensorBoard in the browser:
93+
94+
Refresh the page, and the profile should be shown.

third_party/openxla.patch

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -404,16 +404,17 @@ index e23dcc3a4..aaaf22ed8 100644
404404
system_link_files = {
405405
"//third_party/systemlibs:BUILD": "bazel/BUILD",
406406
diff --git a/xla/backends/profiler/plugin/BUILD b/xla/backends/profiler/plugin/BUILD
407-
index 169a4eaa4..1b8c0bae0 100644
407+
index 169a4eaa4..161e4e045 100644
408408
--- a/xla/backends/profiler/plugin/BUILD
409409
+++ b/xla/backends/profiler/plugin/BUILD
410-
@@ -62,6 +62,9 @@ cc_library(
410+
@@ -62,6 +62,10 @@ cc_library(
411411
deps = [
412412
":profiler_c_api_hdrs",
413413
":profiler_error",
414414
+ "//xla/backends/profiler/cpu:host_tracer",
415415
+ "//xla/backends/profiler/cpu:metadata_collector",
416416
+ "//xla/backends/profiler/cpu:python_tracer",
417+
+ "@intel_extension_for_openxla//xla/profiler:sycl_device_tracer",
417418
"@tsl//tsl/platform:logging",
418419
"@tsl//tsl/profiler/lib:profiler_collection",
419420
"@tsl//tsl/profiler/lib:profiler_factory",

xla/profiler/BUILD

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
package(default_visibility = ["//visibility:public"])
2+
3+
cc_library(
4+
name = "sycl_device_tracer",
5+
srcs = ["device_tracer_sycl.cc"],
6+
linkstatic = 1,
7+
visibility = ["//visibility:public"],
8+
deps = [
9+
":ze_tracer",
10+
"//xla/stream_executor/sycl:hw_info",
11+
"//xla/stream_executor/sycl:sycl_gpu_runtime",
12+
"@tsl//tsl/profiler/backends/cpu:annotation_stack",
13+
"@tsl//tsl/profiler/lib:profiler_factory",
14+
"@tsl//tsl/profiler/lib:profiler_interface",
15+
"@tsl//tsl/profiler/protobuf:trace_events_proto_cc",
16+
"@tsl//tsl/profiler/protobuf:xplane_proto_cc",
17+
"@tsl//tsl/profiler/protobuf:profiler_options_proto_cc",
18+
"@tsl//tsl/profiler/utils:parse_annotation",
19+
"@tsl//tsl/profiler/utils:trace_utils",
20+
"@tsl//tsl/profiler/utils:tf_xplane_visitor",
21+
"@tsl//tsl/profiler/utils:xplane_builder",
22+
"@tsl//tsl/profiler/utils:xplane_utils",
23+
"@tsl//tsl/profiler/utils:xplane_schema",
24+
],
25+
alwayslink = True,
26+
)
27+
28+
cc_library(
29+
name = "ze_tracer",
30+
hdrs = [
31+
"trace_options.h",
32+
"tracing.h",
33+
"ze_api_collector.h",
34+
"ze_kernel_collector.h",
35+
"ze_tracer.h",
36+
"ze_utils.h",
37+
],
38+
srcs = [
39+
":profiler_utils",
40+
],
41+
visibility = ["//visibility:public"],
42+
deps = [
43+
":ze_correlator",
44+
"@tsl//tsl/platform:abi",
45+
],
46+
)
47+
48+
cc_library(
49+
name = "ze_correlator",
50+
hdrs = [
51+
"correlator.h",
52+
],
53+
srcs = [
54+
"correlator.cc",
55+
":profiler_utils",
56+
],
57+
visibility = ["//visibility:public"],
58+
deps = [
59+
"@com_google_absl//absl/time",
60+
"@tsl//tsl/profiler/backends/cpu:annotation_stack",
61+
],
62+
)
63+
64+
filegroup(
65+
name = "profiler_utils",
66+
srcs = [
67+
"utils.h",
68+
],
69+
visibility = ["//visibility:public"],
70+
)

xla/profiler/correlator.cc

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
/* Copyright (c) 2024 Intel Corporation
2+
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
6+
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
9+
Unless required by applicable law or agreed to in writing, software
10+
distributed under the License is distributed on an "AS IS" BASIS,
11+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
See the License for the specific language governing permissions and
13+
limitations under the License.
14+
==============================================================================*/
15+
16+
//==============================================================
17+
// Copyright (C) Intel Corporation
18+
//
19+
// SPDX-License-Identifier: MIT
20+
// =============================================================
21+
22+
#include "xla/profiler/correlator.h"
23+
24+
thread_local uint64_t Correlator::kernel_id_ = 0;
25+
namespace xla{
26+
namespace profiler {
27+
int64_t GetCurrentTimeNanos() {
28+
// absl::GetCurrentTimeNanos() is much faster than EnvTime::NowNanos().
29+
// It is wrapped under xla::profiler::GetCurrentTimeNanos to avoid ODR
30+
// violation and to allow switching to yet another implementation if required.
31+
return absl::GetCurrentTimeNanos();
32+
};
33+
} // namespace profiler
34+
} // namespace xla
35+
// Returns the current CPU wallclock time in nanoseconds.

xla/profiler/correlator.h

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
/* Copyright (c) 2021-2022 Intel Corporation
2+
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
6+
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
9+
Unless required by applicable law or agreed to in writing, software
10+
distributed under the License is distributed on an "AS IS" BASIS,
11+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
See the License for the specific language governing permissions and
13+
limitations under the License.
14+
==============================================================================*/
15+
16+
//==============================================================
17+
// Copyright (C) Intel Corporation
18+
//
19+
// SPDX-License-Identifier: MIT
20+
//==============================================================
21+
#ifndef XLA_PROFILER_CORRELATOR_H_
22+
#define XLA_PROFILER_CORRELATOR_H_
23+
24+
#include <level_zero/ze_api.h>
25+
26+
#include <map>
27+
#include <vector>
28+
#include <iostream>
29+
30+
#include "absl/time/clock.h"
31+
#include "xla/profiler/utils.h"
32+
33+
namespace xla{
34+
namespace profiler {
35+
// Returns the current CPU wallclock time in nanoseconds.
36+
int64_t GetCurrentTimeNanos();
37+
38+
} // namespace profiler
39+
} // namespace xla
40+
41+
struct ApiCollectorOptions {
42+
bool call_tracing;
43+
bool need_tid;
44+
bool need_pid;
45+
};
46+
47+
class Correlator {
48+
public:
49+
Correlator() : base_time_(xla::profiler::GetCurrentTimeNanos()) {
50+
}
51+
52+
uint64_t GetTimestamp() const {
53+
return xla::profiler::GetCurrentTimeNanos() - base_time_;
54+
}
55+
56+
uint64_t GetStartPoint() const { return base_time_; }
57+
58+
uint64_t GetKernelIdVector() const { return kernel_id_; }
59+
60+
void SetKernelId(uint64_t kernel_id) { kernel_id_ = kernel_id; }
61+
62+
std::vector<uint64_t> GetKernelIdVector (ze_command_list_handle_t command_list) {
63+
if (kernel_id_map_.count(command_list) > 0) {
64+
return kernel_id_map_[command_list];
65+
} else {
66+
return std::vector<uint64_t>();
67+
}
68+
}
69+
70+
void CreateKernelIdList(ze_command_list_handle_t command_list) {
71+
kernel_id_map_[command_list] = std::vector<uint64_t>();
72+
}
73+
74+
void RemoveKernelIdList(ze_command_list_handle_t command_list) {
75+
kernel_id_map_.erase(command_list);
76+
}
77+
78+
void ResetKernelIdList(ze_command_list_handle_t command_list) {
79+
kernel_id_map_[command_list].clear();
80+
}
81+
82+
void AddKernelId(ze_command_list_handle_t command_list, uint64_t kernel_id) {
83+
kernel_id_map_[command_list].push_back(kernel_id);
84+
}
85+
86+
std::vector<uint64_t> GetCallIdVector(ze_command_list_handle_t command_list) {
87+
if (call_id_map_.count(command_list) > 0) {
88+
return call_id_map_[command_list];
89+
} else {
90+
return std::vector<uint64_t>();
91+
}
92+
}
93+
94+
void CreateCallIdList(ze_command_list_handle_t command_list) {
95+
call_id_map_[command_list] = std::vector<uint64_t>();
96+
}
97+
98+
void RemoveCallIdList(ze_command_list_handle_t command_list) {
99+
call_id_map_.erase(command_list);
100+
}
101+
102+
void ResetCallIdList(ze_command_list_handle_t command_list) {
103+
call_id_map_[command_list].clear();
104+
}
105+
106+
void AddCallId(ze_command_list_handle_t command_list, uint64_t call_id) {
107+
call_id_map_[command_list].push_back(call_id);
108+
}
109+
110+
private:
111+
uint64_t base_time_;
112+
std::map<ze_command_list_handle_t, std::vector<uint64_t> > kernel_id_map_;
113+
std::map<ze_command_list_handle_t, std::vector<uint64_t> > call_id_map_;
114+
115+
static thread_local uint64_t kernel_id_;
116+
};
117+
118+
#endif // XLA_PROFILER_CORRELATOR_H_

0 commit comments

Comments
 (0)