Skip to content

Commit cb43691

Browse files
Kan LiangPeter Zijlstra
authored andcommitted
perf: Save PMU specific data in task_struct
Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode. For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit - 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3 System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit - 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3 It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU. Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished. Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later. Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Alexey Budankov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
1 parent c53e14f commit cb43691

File tree

3 files changed

+38
-0
lines changed

3 files changed

+38
-0
lines changed

include/linux/perf_event.h

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1021,6 +1021,41 @@ struct perf_event_context {
10211021
local_t nr_no_switch_fast;
10221022
};
10231023

1024+
/**
1025+
* struct perf_ctx_data - PMU specific data for a task
1026+
* @rcu_head: To avoid the race on free PMU specific data
1027+
* @refcount: To track users
1028+
* @global: To track system-wide users
1029+
* @ctx_cache: Kmem cache of PMU specific data
1030+
* @data: PMU specific data
1031+
*
1032+
* Currently, the struct is only used in Intel LBR call stack mode to
1033+
* save/restore the call stack of a task on context switches.
1034+
*
1035+
* The rcu_head is used to prevent the race on free the data.
1036+
* The data only be allocated when Intel LBR call stack mode is enabled.
1037+
* The data will be freed when the mode is disabled.
1038+
* The content of the data will only be accessed in context switch, which
1039+
* should be protected by rcu_read_lock().
1040+
*
1041+
* Because of the alignment requirement of Intel Arch LBR, the Kmem cache
1042+
* is used to allocate the PMU specific data. The ctx_cache is to track
1043+
* the Kmem cache.
1044+
*
1045+
* Careful: Struct perf_ctx_data is added as a pointer in struct task_struct.
1046+
* When system-wide Intel LBR call stack mode is enabled, a buffer with
1047+
* constant size will be allocated for each task.
1048+
* Also, system memory consumption can further grow when the size of
1049+
* struct perf_ctx_data enlarges.
1050+
*/
1051+
struct perf_ctx_data {
1052+
struct rcu_head rcu_head;
1053+
refcount_t refcount;
1054+
int global;
1055+
struct kmem_cache *ctx_cache;
1056+
void *data;
1057+
};
1058+
10241059
struct perf_cpu_pmu_context {
10251060
struct perf_event_pmu_context epc;
10261061
struct perf_event_pmu_context *task_epc;

include/linux/sched.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ struct mempolicy;
6565
struct nameidata;
6666
struct nsproxy;
6767
struct perf_event_context;
68+
struct perf_ctx_data;
6869
struct pid_namespace;
6970
struct pipe_inode_info;
7071
struct rcu_node;
@@ -1311,6 +1312,7 @@ struct task_struct {
13111312
struct perf_event_context *perf_event_ctxp;
13121313
struct mutex perf_event_mutex;
13131314
struct list_head perf_event_list;
1315+
struct perf_ctx_data __rcu *perf_ctx_data;
13141316
#endif
13151317
#ifdef CONFIG_DEBUG_PREEMPT
13161318
unsigned long preempt_disable_ip;

kernel/events/core.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14070,6 +14070,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
1407014070
child->perf_event_ctxp = NULL;
1407114071
mutex_init(&child->perf_event_mutex);
1407214072
INIT_LIST_HEAD(&child->perf_event_list);
14073+
child->perf_ctx_data = NULL;
1407314074

1407414075
ret = perf_event_init_context(child, clone_flags);
1407514076
if (ret) {

0 commit comments

Comments
 (0)