Skip to content

Commit 8449d32

Browse files
committed
Merge tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance * tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Treat cpusets in attaching as populated sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() docs: cgroup: No special handling of unpopulated memcgs docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: Explain reclaim protection target selftests/cgroup: conform test to KTAP format output cpuset: remove need_rebuild_sched_domains cpuset: remove global remote_children list cpuset: simplify node setting on error cgroup: include missing header for struct irq_work cgroup: Fix sleeping from invalid context warning on PREEMPT_RT cgroup/cpuset: Globally track isolated_cpus update cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition cgroup/cpuset: Move up prstate_housekeeping_conflict() helper cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() cgroup: Defer task cgroup unlink until after the task is done switching out cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() ...
2 parents 2b60145 + b1bcaed commit 8449d32

File tree

20 files changed

+436
-206
lines changed

20 files changed

+436
-206
lines changed

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,8 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5353
5-2. Memory
5454
5-2-1. Memory Interface Files
5555
5-2-2. Usage Guidelines
56-
5-2-3. Memory Ownership
56+
5-2-3. Reclaim Protection
57+
5-2-4. Memory Ownership
5758
5-3. IO
5859
5-3-1. IO Interface Files
5960
5-3-2. Writeback
@@ -1317,7 +1318,7 @@ PAGE_SIZE multiple when read back.
13171318
smaller overages.
13181319

13191320
Effective min boundary is limited by memory.min values of
1320-
all ancestor cgroups. If there is memory.min overcommitment
1321+
ancestor cgroups. If there is memory.min overcommitment
13211322
(child cgroup or cgroups are requiring more protected memory
13221323
than parent will allow), then each child cgroup will get
13231324
the part of parent's protection proportional to its
@@ -1326,9 +1327,6 @@ PAGE_SIZE multiple when read back.
13261327
Putting more memory than generally available under this
13271328
protection is discouraged and may lead to constant OOMs.
13281329

1329-
If a memory cgroup is not populated with processes,
1330-
its memory.min is ignored.
1331-
13321330
memory.low
13331331
A read-write single value file which exists on non-root
13341332
cgroups. The default is "0".
@@ -1343,7 +1341,7 @@ PAGE_SIZE multiple when read back.
13431341
smaller overages.
13441342

13451343
Effective low boundary is limited by memory.low values of
1346-
all ancestor cgroups. If there is memory.low overcommitment
1344+
ancestor cgroups. If there is memory.low overcommitment
13471345
(child cgroup or cgroups are requiring more protected memory
13481346
than parent will allow), then each child cgroup will get
13491347
the part of parent's protection proportional to its
@@ -1934,6 +1932,27 @@ memory - is necessary to determine whether a workload needs more
19341932
memory; unfortunately, memory pressure monitoring mechanism isn't
19351933
implemented yet.
19361934

1935+
Reclaim Protection
1936+
~~~~~~~~~~~~~~~~~~
1937+
1938+
The protection configured with "memory.low" or "memory.min" applies relatively
1939+
to the target of the reclaim (i.e. any of memory cgroup limits, proactive
1940+
memory.reclaim or global reclaim apparently located in the root cgroup).
1941+
The protection value configured for B applies unchanged to the reclaim
1942+
targeting A (i.e. caused by competition with the sibling E)::
1943+
1944+
root - ... - A - B - C
1945+
\ ` D
1946+
` E
1947+
1948+
When the reclaim targets ancestors of A, the effective protection of B is
1949+
capped by the protection value configured for A (and any other intermediate
1950+
ancestors between A and the target).
1951+
1952+
To express indifference about relative sibling protection, it is suggested to
1953+
use memory_recursiveprot. Configuring all descendants of a parent with finite
1954+
protection to "max" works but it may unnecessarily skew memory.events:low
1955+
field.
19371956

19381957
Memory Ownership
19391958
~~~~~~~~~~~~~~~~

include/linux/cgroup.h

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -137,9 +137,10 @@ extern void cgroup_cancel_fork(struct task_struct *p,
137137
struct kernel_clone_args *kargs);
138138
extern void cgroup_post_fork(struct task_struct *p,
139139
struct kernel_clone_args *kargs);
140-
void cgroup_exit(struct task_struct *p);
141-
void cgroup_release(struct task_struct *p);
142-
void cgroup_free(struct task_struct *p);
140+
void cgroup_task_exit(struct task_struct *p);
141+
void cgroup_task_dead(struct task_struct *p);
142+
void cgroup_task_release(struct task_struct *p);
143+
void cgroup_task_free(struct task_struct *p);
143144

144145
int cgroup_init_early(void);
145146
int cgroup_init(void);
@@ -680,9 +681,10 @@ static inline void cgroup_cancel_fork(struct task_struct *p,
680681
struct kernel_clone_args *kargs) {}
681682
static inline void cgroup_post_fork(struct task_struct *p,
682683
struct kernel_clone_args *kargs) {}
683-
static inline void cgroup_exit(struct task_struct *p) {}
684-
static inline void cgroup_release(struct task_struct *p) {}
685-
static inline void cgroup_free(struct task_struct *p) {}
684+
static inline void cgroup_task_exit(struct task_struct *p) {}
685+
static inline void cgroup_task_dead(struct task_struct *p) {}
686+
static inline void cgroup_task_release(struct task_struct *p) {}
687+
static inline void cgroup_task_free(struct task_struct *p) {}
686688

687689
static inline int cgroup_init_early(void) { return 0; }
688690
static inline int cgroup_init(void) { return 0; }

include/linux/cpuset.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ extern void inc_dl_tasks_cs(struct task_struct *task);
7474
extern void dec_dl_tasks_cs(struct task_struct *task);
7575
extern void cpuset_lock(void);
7676
extern void cpuset_unlock(void);
77+
extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
7778
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
7879
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
7980
extern bool cpuset_cpu_is_isolated(int cpu);
@@ -195,10 +196,16 @@ static inline void dec_dl_tasks_cs(struct task_struct *task) { }
195196
static inline void cpuset_lock(void) { }
196197
static inline void cpuset_unlock(void) { }
197198

199+
static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
200+
struct cpumask *mask)
201+
{
202+
cpumask_copy(mask, task_cpu_possible_mask(p));
203+
}
204+
198205
static inline void cpuset_cpus_allowed(struct task_struct *p,
199206
struct cpumask *mask)
200207
{
201-
cpumask_copy(mask, task_cpu_possible_mask(p));
208+
cpuset_cpus_allowed_locked(p, mask);
202209
}
203210

204211
static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)

include/linux/sched.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1324,7 +1324,10 @@ struct task_struct {
13241324
struct css_set __rcu *cgroups;
13251325
/* cg_list protected by css_set_lock and tsk->alloc_lock: */
13261326
struct list_head cg_list;
1327-
#endif
1327+
#ifdef CONFIG_PREEMPT_RT
1328+
struct llist_node cg_dead_lnode;
1329+
#endif /* CONFIG_PREEMPT_RT */
1330+
#endif /* CONFIG_CGROUPS */
13281331
#ifdef CONFIG_X86_CPU_RESCTRL
13291332
u32 closid;
13301333
u32 rmid;

kernel/cgroup/cgroup.c

Lines changed: 76 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@
6060
#include <linux/sched/deadline.h>
6161
#include <linux/psi.h>
6262
#include <linux/nstree.h>
63+
#include <linux/irq_work.h>
6364
#include <net/sock.h>
6465

6566
#define CREATE_TRACE_POINTS
@@ -287,6 +288,7 @@ static void kill_css(struct cgroup_subsys_state *css);
287288
static int cgroup_addrm_files(struct cgroup_subsys_state *css,
288289
struct cgroup *cgrp, struct cftype cfts[],
289290
bool is_add);
291+
static void cgroup_rt_init(void);
290292

291293
#ifdef CONFIG_DEBUG_CGROUP_REF
292294
#define CGROUP_REF_FN_ATTRS noinline
@@ -941,7 +943,8 @@ static void css_set_move_task(struct task_struct *task,
941943
/*
942944
* We are synchronized through cgroup_threadgroup_rwsem
943945
* against PF_EXITING setting such that we can't race
944-
* against cgroup_exit()/cgroup_free() dropping the css_set.
946+
* against cgroup_task_dead()/cgroup_task_free() dropping
947+
* the css_set.
945948
*/
946949
WARN_ON_ONCE(task->flags & PF_EXITING);
947950

@@ -6354,6 +6357,7 @@ int __init cgroup_init(void)
63546357
BUG_ON(ss_rstat_init(NULL));
63556358

63566359
get_user_ns(init_cgroup_ns.user_ns);
6360+
cgroup_rt_init();
63576361

63586362
cgroup_lock();
63596363

@@ -6967,19 +6971,29 @@ void cgroup_post_fork(struct task_struct *child,
69676971
}
69686972

69696973
/**
6970-
* cgroup_exit - detach cgroup from exiting task
6974+
* cgroup_task_exit - detach cgroup from exiting task
69716975
* @tsk: pointer to task_struct of exiting process
69726976
*
69736977
* Description: Detach cgroup from @tsk.
69746978
*
69756979
*/
6976-
void cgroup_exit(struct task_struct *tsk)
6980+
void cgroup_task_exit(struct task_struct *tsk)
69776981
{
69786982
struct cgroup_subsys *ss;
6979-
struct css_set *cset;
69806983
int i;
69816984

6982-
spin_lock_irq(&css_set_lock);
6985+
/* see cgroup_post_fork() for details */
6986+
do_each_subsys_mask(ss, i, have_exit_callback) {
6987+
ss->exit(tsk);
6988+
} while_each_subsys_mask();
6989+
}
6990+
6991+
static void do_cgroup_task_dead(struct task_struct *tsk)
6992+
{
6993+
struct css_set *cset;
6994+
unsigned long flags;
6995+
6996+
spin_lock_irqsave(&css_set_lock, flags);
69836997

69846998
WARN_ON_ONCE(list_empty(&tsk->cg_list));
69856999
cset = task_css_set(tsk);
@@ -6997,34 +7011,81 @@ void cgroup_exit(struct task_struct *tsk)
69977011
test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
69987012
cgroup_update_frozen(task_dfl_cgroup(tsk));
69997013

7000-
spin_unlock_irq(&css_set_lock);
7014+
spin_unlock_irqrestore(&css_set_lock, flags);
7015+
}
70017016

7002-
/* see cgroup_post_fork() for details */
7003-
do_each_subsys_mask(ss, i, have_exit_callback) {
7004-
ss->exit(tsk);
7005-
} while_each_subsys_mask();
7017+
#ifdef CONFIG_PREEMPT_RT
7018+
/*
7019+
* cgroup_task_dead() is called from finish_task_switch() which doesn't allow
7020+
* scheduling even in RT. As the task_dead path requires grabbing css_set_lock,
7021+
* this lead to sleeping in the invalid context warning bug. css_set_lock is too
7022+
* big to become a raw_spinlock. The task_dead path doesn't need to run
7023+
* synchronously but can't be delayed indefinitely either as the dead task pins
7024+
* the cgroup and task_struct can be pinned indefinitely. Bounce through lazy
7025+
* irq_work to allow batching while ensuring timely completion.
7026+
*/
7027+
static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks);
7028+
static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork);
7029+
7030+
static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork)
7031+
{
7032+
struct llist_node *lnode;
7033+
struct task_struct *task, *next;
7034+
7035+
lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks));
7036+
llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) {
7037+
do_cgroup_task_dead(task);
7038+
put_task_struct(task);
7039+
}
7040+
}
7041+
7042+
static void __init cgroup_rt_init(void)
7043+
{
7044+
int cpu;
7045+
7046+
for_each_possible_cpu(cpu) {
7047+
init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu));
7048+
per_cpu(cgrp_dead_tasks_iwork, cpu) =
7049+
IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn);
7050+
}
7051+
}
7052+
7053+
void cgroup_task_dead(struct task_struct *task)
7054+
{
7055+
get_task_struct(task);
7056+
llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
7057+
irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
70067058
}
7059+
#else /* CONFIG_PREEMPT_RT */
7060+
static void __init cgroup_rt_init(void) {}
70077061

7008-
void cgroup_release(struct task_struct *task)
7062+
void cgroup_task_dead(struct task_struct *task)
7063+
{
7064+
do_cgroup_task_dead(task);
7065+
}
7066+
#endif /* CONFIG_PREEMPT_RT */
7067+
7068+
void cgroup_task_release(struct task_struct *task)
70097069
{
70107070
struct cgroup_subsys *ss;
70117071
int ssid;
70127072

70137073
do_each_subsys_mask(ss, ssid, have_release_callback) {
70147074
ss->release(task);
70157075
} while_each_subsys_mask();
7076+
}
7077+
7078+
void cgroup_task_free(struct task_struct *task)
7079+
{
7080+
struct css_set *cset = task_css_set(task);
70167081

70177082
if (!list_empty(&task->cg_list)) {
70187083
spin_lock_irq(&css_set_lock);
70197084
css_set_skip_task_iters(task_css_set(task), task);
70207085
list_del_init(&task->cg_list);
70217086
spin_unlock_irq(&css_set_lock);
70227087
}
7023-
}
70247088

7025-
void cgroup_free(struct task_struct *task)
7026-
{
7027-
struct css_set *cset = task_css_set(task);
70287089
put_css_set(cset);
70297090
}
70307091

kernel/cgroup/cpuset-internal.h

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -155,12 +155,16 @@ struct cpuset {
155155
/* for custom sched domain */
156156
int relax_domain_level;
157157

158-
/* number of valid local child partitions */
159-
int nr_subparts;
160-
161158
/* partition root state */
162159
int partition_root_state;
163160

161+
/*
162+
* Whether cpuset is a remote partition.
163+
* It used to be a list anchoring all remote partitions — we can switch back
164+
* to a list if we need to iterate over the remote partitions.
165+
*/
166+
bool remote_partition;
167+
164168
/*
165169
* number of SCHED_DEADLINE tasks attached to this cpuset, so that we
166170
* know when to rebuild associated root domain bandwidth information.
@@ -175,9 +179,6 @@ struct cpuset {
175179
/* Handle for cpuset.cpus.partition */
176180
struct cgroup_file partition_file;
177181

178-
/* Remote partition silbling list anchored at remote_children */
179-
struct list_head remote_sibling;
180-
181182
/* Used to merge intersecting subsets for generate_sched_domains */
182183
struct uf_node node;
183184
};

0 commit comments

Comments
 (0)