2019-05-19 15:08:55 +03:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-17 02:20:36 +04:00
/*
2010-09-10 18:51:36 +04:00
* kernel / workqueue . c - generic async execution with shared worker pool
2005-04-17 02:20:36 +04:00
*
2010-09-10 18:51:36 +04:00
* Copyright ( C ) 2002 Ingo Molnar
2005-04-17 02:20:36 +04:00
*
2010-09-10 18:51:36 +04:00
* Derived from the taskqueue / keventd code by :
* David Woodhouse < dwmw2 @ infradead . org >
* Andrew Morton
* Kai Petzke < wpp @ marie . physik . tu - berlin . de >
* Theodore Ts ' o < tytso @ mit . edu >
2005-04-17 02:20:36 +04:00
*
2010-09-10 18:51:36 +04:00
* Made to use alloc_percpu by Christoph Lameter .
2005-04-17 02:20:36 +04:00
*
2010-09-10 18:51:36 +04:00
* Copyright ( C ) 2010 SUSE Linux Products GmbH
* Copyright ( C ) 2010 Tejun Heo < tj @ kernel . org >
2005-10-31 02:01:59 +03:00
*
2010-09-10 18:51:36 +04:00
* This is the generic async execution mechanism . Work items as are
* executed in process context . The worker pool is shared and
2013-08-21 04:50:39 +04:00
* automatically managed . There are two worker pools for each CPU ( one for
* normal work items and the other for high priority ones ) and some extra
* pools for workqueues which are not bound to any specific CPU - the
* number of these backing pools is dynamic .
2010-09-10 18:51:36 +04:00
*
2017-08-07 05:33:22 +03:00
* Please read Documentation / core - api / workqueue . rst for details .
2005-04-17 02:20:36 +04:00
*/
2011-05-23 22:51:41 +04:00
# include <linux/export.h>
2005-04-17 02:20:36 +04:00
# include <linux/kernel.h>
# include <linux/sched.h>
# include <linux/init.h>
2024-02-05 00:28:06 +03:00
# include <linux/interrupt.h>
2005-04-17 02:20:36 +04:00
# include <linux/signal.h>
# include <linux/completion.h>
# include <linux/workqueue.h>
# include <linux/slab.h>
# include <linux/cpu.h>
# include <linux/notifier.h>
# include <linux/kthread.h>
2006-02-23 21:43:43 +03:00
# include <linux/hardirq.h>
2006-10-11 12:21:26 +04:00
# include <linux/mempolicy.h>
2006-12-07 07:34:49 +03:00
# include <linux/freezer.h>
2006-12-07 07:37:26 +03:00
# include <linux/debug_locks.h>
2007-10-19 10:39:55 +04:00
# include <linux/lockdep.h>
2010-06-29 12:07:11 +04:00
# include <linux/idr.h>
2013-03-12 22:30:03 +04:00
# include <linux/jhash.h>
2012-12-17 19:01:23 +04:00
# include <linux/hashtable.h>
2013-03-12 22:30:00 +04:00
# include <linux/rculist.h>
2013-04-01 22:23:32 +04:00
# include <linux/nodemask.h>
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
# include <linux/moduleparam.h>
2013-05-01 02:27:22 +04:00
# include <linux/uaccess.h>
2017-11-03 18:27:50 +03:00
# include <linux/sched/isolation.h>
2023-03-07 15:53:35 +03:00
# include <linux/sched/debug.h>
2018-01-11 03:53:35 +03:00
# include <linux/nmi.h>
2021-05-20 13:14:22 +03:00
# include <linux/kvm_para.h>
2023-07-18 01:50:02 +03:00
# include <linux/delay.h>
2024-02-14 21:33:55 +03:00
# include <linux/irq_work.h>
2010-06-29 12:07:14 +04:00
2013-01-19 02:05:55 +04:00
# include "workqueue_internal.h"
2005-04-17 02:20:36 +04:00
2024-01-27 00:55:50 +03:00
enum worker_pool_flags {
2013-01-24 23:01:33 +04:00
/*
* worker_pool flags
2012-07-17 23:39:27 +04:00
*
2013-01-24 23:01:33 +04:00
* A bound pool is either associated or disassociated with its CPU .
2012-07-17 23:39:27 +04:00
* While associated ( ! DISASSOCIATED ) , all workers are bound to the
* CPU and none has % WORKER_UNBOUND set and concurrency management
* is in effect .
*
* While DISASSOCIATED , the cpu may be offline and all workers have
* % WORKER_UNBOUND set and concurrency management disabled , and may
2013-01-24 23:01:33 +04:00
* be executing on any CPU . The pool behaves as an unbound one .
2012-07-17 23:39:27 +04:00
*
2013-03-14 06:47:39 +04:00
* Note that DISASSOCIATED should be flipped only while holding
2018-05-18 18:47:13 +03:00
* wq_pool_attach_mutex to avoid changing binding state while
2014-05-20 13:46:35 +04:00
* worker_attach_to_pool ( ) is in progress .
2024-02-05 00:28:06 +03:00
*
* As there can only be one concurrent BH execution context per CPU , a
* BH pool is per - CPU and always DISASSOCIATED .
2012-07-17 23:39:27 +04:00
*/
2024-02-05 00:28:06 +03:00
POOL_BH = 1 < < 0 , /* is a BH pool */
POOL_MANAGER_ACTIVE = 1 < < 1 , /* being managed */
2013-01-24 23:01:33 +04:00
POOL_DISASSOCIATED = 1 < < 2 , /* cpu can't serve workers */
2024-02-27 04:38:55 +03:00
POOL_BH_DRAINING = 1 < < 3 , /* draining after CPU offline */
2024-01-27 00:55:50 +03:00
} ;
2010-06-29 12:07:12 +04:00
2024-01-27 00:55:50 +03:00
enum worker_flags {
2010-06-29 12:07:12 +04:00
/* worker flags */
WORKER_DIE = 1 < < 1 , /* die die die */
WORKER_IDLE = 1 < < 2 , /* is idle */
2010-06-29 12:07:14 +04:00
WORKER_PREP = 1 < < 3 , /* preparing to run works */
2010-06-29 12:07:15 +04:00
WORKER_CPU_INTENSIVE = 1 < < 6 , /* cpu intensive */
2010-07-02 12:03:51 +04:00
WORKER_UNBOUND = 1 < < 7 , /* worker is unbound */
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
WORKER_REBOUND = 1 < < 8 , /* worker was rebound */
2010-06-29 12:07:14 +04:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
WORKER_UNBOUND | WORKER_REBOUND ,
2024-01-27 00:55:50 +03:00
} ;
2010-06-29 12:07:12 +04:00
2024-02-21 08:36:14 +03:00
enum work_cancel_flags {
WORK_CANCEL_DELAYED = 1 < < 0 , /* canceling a delayed_work */
2024-03-25 20:21:03 +03:00
WORK_CANCEL_DISABLE = 1 < < 1 , /* canceling to disable */
2024-02-21 08:36:14 +03:00
} ;
2024-01-27 00:55:50 +03:00
enum wq_internal_consts {
2013-01-24 23:01:33 +04:00
NR_STD_WORKER_POOLS = 2 , /* # standard pools per cpu */
2012-07-14 09:16:44 +04:00
2013-03-12 22:30:03 +04:00
UNBOUND_POOL_HASH_ORDER = 6 , /* hashed by pool->attrs */
2010-06-29 12:07:12 +04:00
BUSY_WORKER_HASH_ORDER = 6 , /* 64 pointers */
2010-06-29 12:07:12 +04:00
2010-06-29 12:07:14 +04:00
MAX_IDLE_WORKERS_RATIO = 4 , /* 1/4 of busy can be idle */
IDLE_WORKER_TIMEOUT = 300 * HZ , /* keep idle ones for 5 mins */
2011-02-16 20:10:19 +03:00
MAYDAY_INITIAL_TIMEOUT = HZ / 100 > = 2 ? HZ / 100 : 2 ,
/* call for help after 10ms
( min two ticks ) */
2010-06-29 12:07:14 +04:00
MAYDAY_INTERVAL = HZ / 10 , /* and then every 100ms */
CREATE_COOLDOWN = HZ , /* time to breath after fail */
/*
* Rescue workers are used only on emergencies and shared by
2014-03-11 14:09:12 +04:00
* all cpus . Give MIN_NICE .
2010-06-29 12:07:14 +04:00
*/
2014-03-11 14:09:12 +04:00
RESCUER_NICE_LEVEL = MIN_NICE ,
HIGHPRI_NICE_LEVEL = MIN_NICE ,
2013-04-01 22:23:34 +04:00
2024-01-15 20:08:22 +03:00
WQ_NAME_LEN = 32 ,
2010-06-29 12:07:12 +04:00
} ;
2005-04-17 02:20:36 +04:00
2024-02-05 00:28:06 +03:00
/*
* We don ' t want to trap softirq for too long . See MAX_SOFTIRQ_TIME and
* MAX_SOFTIRQ_RESTART in kernel / softirq . c . These are macros because
* msecs_to_jiffies ( ) can ' t be an initializer .
*/
# define BH_WORKER_JIFFIES msecs_to_jiffies(2)
# define BH_WORKER_RESTARTS 10
2005-04-17 02:20:36 +04:00
/*
2010-06-29 12:07:10 +04:00
* Structure fields follow one of the following exclusion rules .
*
2010-08-24 16:22:47 +04:00
* I : Modifiable by initialization / destruction paths and read - only for
* everyone else .
2010-06-29 12:07:10 +04:00
*
2010-06-29 12:07:14 +04:00
* P : Preemption protected . Disabling preemption is enough and should
* only be modified and accessed from the local cpu .
*
2013-01-24 23:01:33 +04:00
* L : pool - > lock protected . Access with pool - > lock held .
2010-06-29 12:07:10 +04:00
*
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* LN : pool - > lock and wq_node_nr_active - > lock protected for writes . Either for
* reads .
*
2023-05-18 06:02:08 +03:00
* K : Only modified by worker while holding pool - > lock . Can be safely read by
* self , while holding pool - > lock or from IRQ context if % current is the
* kworker .
*
* S : Only modified by worker self .
*
2018-05-18 18:47:13 +03:00
* A : wq_pool_attach_mutex protected .
2013-03-20 00:45:21 +04:00
*
2013-03-26 03:57:17 +04:00
* PL : wq_pool_mutex protected .
2013-03-14 06:47:40 +04:00
*
2019-03-13 19:55:47 +03:00
* PR : wq_pool_mutex protected for writes . RCU protected for reads .
2013-03-12 22:30:00 +04:00
*
2015-05-12 15:32:29 +03:00
* PW : wq_pool_mutex and wq - > mutex protected for writes . Either for reads .
*
* PWR : wq_pool_mutex and wq - > mutex protected for writes . Either or
2019-03-13 19:55:47 +03:00
* RCU for reads .
2015-05-12 15:32:29 +03:00
*
2013-03-26 03:57:17 +04:00
* WQ : wq - > mutex protected .
*
2019-03-13 19:55:47 +03:00
* WR : wq - > mutex protected for writes . RCU protected for reads .
2013-03-14 06:47:40 +04:00
*
2024-01-29 21:11:24 +03:00
* WO : wq - > mutex protected for writes . Updated with WRITE_ONCE ( ) and can be read
* with READ_ONCE ( ) without locking .
*
2013-03-14 06:47:40 +04:00
* MD : wq_mayday_lock protected .
2023-03-07 15:53:35 +03:00
*
* WD : Used internally by the watchdog .
2005-04-17 02:20:36 +04:00
*/
2013-01-19 02:05:55 +04:00
/* struct worker is defined in workqueue_internal.h */
2010-06-29 12:07:11 +04:00
2012-07-13 01:46:37 +04:00
struct worker_pool {
2020-05-27 22:46:33 +03:00
raw_spinlock_t lock ; /* the pool lock */
2013-03-12 22:29:59 +04:00
int cpu ; /* I: the associated cpu */
2013-04-01 22:23:34 +04:00
int node ; /* I: the associated node ID */
2013-01-24 23:01:33 +04:00
int id ; /* I: pool ID */
2023-08-08 04:57:22 +03:00
unsigned int flags ; /* L: flags */
2012-07-13 01:46:37 +04:00
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
unsigned long watchdog_ts ; /* L: watchdog timestamp */
2023-03-07 15:53:35 +03:00
bool cpu_stall ; /* WD: stalled cpu bound pool */
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
2021-12-23 15:31:40 +03:00
/*
* The counter is incremented in a process context on the associated CPU
* w / preemption disabled , and decremented or reset in the same context
* but w / pool - > lock held . The readers grab pool - > lock and are
* guaranteed to see if the counter reached zero .
*/
int nr_running ;
2021-12-07 10:35:42 +03:00
2012-07-13 01:46:37 +04:00
struct list_head worklist ; /* L: list of pending works */
workqueue: reimplement idle worker rebinding
Currently rebind_workers() uses rebinds idle workers synchronously
before proceeding to requesting busy workers to rebind. This is
necessary because all workers on @worker_pool->idle_list must be bound
before concurrency management local wake-ups from the busy workers
take place.
Unfortunately, the synchronous idle rebinding is quite complicated.
This patch reimplements idle rebinding to simplify the code path.
Rather than trying to make all idle workers bound before rebinding
busy workers, we simply remove all to-be-bound idle workers from the
idle list and let them add themselves back after completing rebinding
(successful or not).
As only workers which finished rebinding can on on the idle worker
list, the idle worker list is guaranteed to have only bound workers
unless CPU went down again and local wake-ups are safe.
After the change, @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list. More specifically,
nr_idle may be non-zero while ->idle_list is empty. All users of
->nr_idle and ->idle_list are audited. The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.
After this patch, rebind_workers() no longer performs the nasty
idle-rebind retries which require temporary release of gcwq->lock, and
both unbinding and rebinding are atomic w.r.t. global_cwq->lock.
worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
and removed along with the definition of struct idle_rebind.
Changed from V1:
1) remove unlikely from too_many_workers(), ->idle_list can be empty
anytime, even before this patch, no reason to use unlikely.
2) fix a small rebasing mistake.
(which is from rebasing the orignal fixing patch to for-next)
3) add a lot of comments.
4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-18 20:59:22 +04:00
2018-03-20 12:24:05 +03:00
int nr_workers ; /* L: total number of workers */
int nr_idle ; /* L: currently idle workers */
2012-07-13 01:46:37 +04:00
2021-12-23 15:31:38 +03:00
struct list_head idle_list ; /* L: list of idle workers */
2012-07-13 01:46:37 +04:00
struct timer_list idle_timer ; /* L: worker idle timeout */
2023-01-12 19:14:29 +03:00
struct work_struct idle_cull_work ; /* L: worker idle cleanup */
struct timer_list mayday_timer ; /* L: SOS timer for workers */
2012-07-13 01:46:37 +04:00
2013-03-14 03:51:36 +04:00
/* a workers is either on busy_hash or idle_list, or the manager */
2013-01-24 23:01:33 +04:00
DECLARE_HASHTABLE ( busy_hash , BUSY_WORKER_HASH_ORDER ) ;
/* L: hash of busy workers */
2015-03-09 16:22:28 +03:00
struct worker * manager ; /* L: purely informational */
2014-05-20 13:46:34 +04:00
struct list_head workers ; /* A: attached workers */
2023-01-12 19:14:31 +03:00
struct list_head dying_workers ; /* A: workers about to die */
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
struct completion * detach_completion ; /* all workers detached */
2013-01-24 23:39:44 +04:00
2014-05-20 13:46:32 +04:00
struct ida worker_ida ; /* worker IDs for task name */
2013-01-24 23:39:44 +04:00
2013-03-12 22:30:00 +04:00
struct workqueue_attrs * attrs ; /* I: worker attributes */
2013-03-26 03:57:17 +04:00
struct hlist_node hash_node ; /* PL: unbound_pool_hash node */
int refcnt ; /* PL: refcnt for unbound pools */
2013-03-12 22:30:00 +04:00
2013-03-12 22:30:03 +04:00
/*
2019-03-13 19:55:47 +03:00
* Destruction of pool is RCU protected to allow dereferences
2013-03-12 22:30:03 +04:00
* from get_work_pool ( ) .
*/
struct rcu_head rcu ;
2021-12-07 10:35:42 +03:00
} ;
2010-06-29 12:07:12 +04:00
2023-05-18 06:02:08 +03:00
/*
* Per - pool_workqueue statistics . These can be monitored using
* tools / workqueue / wq_monitor . py .
*/
enum pool_workqueue_stats {
PWQ_STAT_STARTED , /* work items started execution */
PWQ_STAT_COMPLETED , /* work items completed execution */
2023-05-18 06:02:09 +03:00
PWQ_STAT_CPU_TIME , /* total CPU time consumed */
2023-05-18 06:02:08 +03:00
PWQ_STAT_CPU_INTENSIVE , /* wq_cpu_intensive_thresh_us violations */
2023-05-18 06:02:08 +03:00
PWQ_STAT_CM_WAKEUP , /* concurrency-management worker wakeups */
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
PWQ_STAT_REPATRIATED , /* unbound workers brought back into scope */
2023-05-18 06:02:08 +03:00
PWQ_STAT_MAYDAY , /* maydays to rescuer */
PWQ_STAT_RESCUED , /* linked work items executed by rescuer */
PWQ_NR_STATS ,
} ;
2005-04-17 02:20:36 +04:00
/*
2024-02-21 08:36:14 +03:00
* The per - pool workqueue . While queued , bits below WORK_PWQ_SHIFT
2013-02-14 07:29:12 +04:00
* of work_struct - > data are used for flags and the remaining high bits
* point to the pwq ; thus , pwqs need to be aligned at two ' s power of the
* number of flag bits .
2005-04-17 02:20:36 +04:00
*/
2013-02-14 07:29:12 +04:00
struct pool_workqueue {
2012-07-13 01:46:37 +04:00
struct worker_pool * pool ; /* I: the associated pool */
2010-06-29 12:07:10 +04:00
struct workqueue_struct * wq ; /* I: the owning workqueue */
2010-06-29 12:07:11 +04:00
int work_color ; /* L: current color */
int flush_color ; /* L: flushing color */
2013-03-12 22:30:04 +04:00
int refcnt ; /* L: reference count */
2010-06-29 12:07:11 +04:00
int nr_in_flight [ WORK_NR_COLORS ] ;
/* L: nr of in_flight works */
2024-02-08 22:12:20 +03:00
bool plugged ; /* L: execution suspended */
2021-08-17 04:32:37 +03:00
/*
* nr_active management and WORK_STRUCT_INACTIVE :
*
* When pwq - > nr_active > = max_active , new work item is queued to
* pwq - > inactive_works instead of pool - > worklist and marked with
* WORK_STRUCT_INACTIVE .
*
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* All work items marked with WORK_STRUCT_INACTIVE do not participate in
* nr_active and all work items in pwq - > inactive_works are marked with
* WORK_STRUCT_INACTIVE . But not all WORK_STRUCT_INACTIVE work items are
* in pwq - > inactive_works . Some of them are ready to run in
* pool - > worklist or worker - > scheduled . Those work itmes are only struct
* wq_barrier which is used for flush_work ( ) and should not participate
* in nr_active . For non - barrier work item , it is marked with
* WORK_STRUCT_INACTIVE iff it is in pwq - > inactive_works .
2021-08-17 04:32:37 +03:00
*/
2010-06-29 12:07:12 +04:00
int nr_active ; /* L: nr of active works */
2021-08-17 04:32:34 +03:00
struct list_head inactive_works ; /* L: inactive works */
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
struct list_head pending_node ; /* LN: node on wq_node_nr_active->pending_pwqs */
2013-03-26 03:57:17 +04:00
struct list_head pwqs_node ; /* WR: node on wq->pwqs */
2013-03-14 06:47:40 +04:00
struct list_head mayday_node ; /* MD: node on wq->maydays */
2013-03-12 22:30:04 +04:00
2023-05-18 06:02:08 +03:00
u64 stats [ PWQ_NR_STATS ] ;
2013-03-12 22:30:04 +04:00
/*
2023-08-08 04:57:23 +03:00
* Release of unbound pwq is punted to a kthread_worker . See put_pwq ( )
2023-08-08 04:57:23 +03:00
* and pwq_release_workfn ( ) for details . pool_workqueue itself is also
* RCU protected so that the first pwq can be determined without
2023-08-08 04:57:23 +03:00
* grabbing wq - > mutex .
2013-03-12 22:30:04 +04:00
*/
2023-08-08 04:57:23 +03:00
struct kthread_work release_work ;
2013-03-12 22:30:04 +04:00
struct rcu_head rcu ;
2024-02-21 08:36:14 +03:00
} __aligned ( 1 < < WORK_STRUCT_PWQ_SHIFT ) ;
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:11 +04:00
/*
* Structure used to wait for workqueue flush .
*/
struct wq_flusher {
2013-03-26 03:57:17 +04:00
struct list_head list ; /* WQ: list of flushers */
int flush_color ; /* WQ: flush color waiting for */
2010-06-29 12:07:11 +04:00
struct completion done ; /* flush completion */
} ;
2013-03-12 22:30:05 +04:00
struct wq_device ;
2024-01-29 21:11:24 +03:00
/*
* Unlike in a per - cpu workqueue where max_active limits its concurrency level
* on each CPU , in an unbound workqueue , max_active applies to the whole system .
* As sharing a single nr_active across multiple sockets can be very expensive ,
* the counting and enforcement is per NUMA node .
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
*
* The following struct is used to enforce per - node max_active . When a pwq wants
* to start executing a work item , it should increment - > nr using
* tryinc_node_nr_active ( ) . If acquisition fails due to - > nr already being over
* - > max , the pwq is queued on - > pending_pwqs . As in - flight work items finish
* and decrement - > nr , node_activate_pending_pwq ( ) activates the pending pwqs in
* round - robin order .
2024-01-29 21:11:24 +03:00
*/
struct wq_node_nr_active {
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
int max ; /* per-node max_active */
atomic_t nr ; /* per-node nr_active */
raw_spinlock_t lock ; /* nests inside pool locks */
struct list_head pending_pwqs ; /* LN: pwqs with inactive works */
2024-01-29 21:11:24 +03:00
} ;
2005-04-17 02:20:36 +04:00
/*
2013-03-14 03:51:36 +04:00
* The externally visible workqueue . It relays the issued work items to
* the appropriate worker_pool through its pool_workqueues .
2005-04-17 02:20:36 +04:00
*/
struct workqueue_struct {
2013-03-26 03:57:17 +04:00
struct list_head pwqs ; /* WR: all pwqs of this wq */
2015-03-09 16:22:28 +03:00
struct list_head list ; /* PR: list of all workqueues */
2010-06-29 12:07:11 +04:00
2013-03-26 03:57:17 +04:00
struct mutex mutex ; /* protects this wq */
int work_color ; /* WQ: current work color */
int flush_color ; /* WQ: current flush color */
2013-02-14 07:29:12 +04:00
atomic_t nr_pwqs_to_flush ; /* flush in progress */
2013-03-26 03:57:17 +04:00
struct wq_flusher * first_flusher ; /* WQ: first flusher */
struct list_head flusher_queue ; /* WQ: flush waiters */
struct list_head flusher_overflow ; /* WQ: flush overflow list */
2010-06-29 12:07:11 +04:00
2013-03-14 06:47:40 +04:00
struct list_head maydays ; /* MD: pwqs requesting rescue */
2019-09-21 00:09:14 +03:00
struct worker * rescuer ; /* MD: rescue worker */
2010-06-29 12:07:14 +04:00
2013-03-26 03:57:18 +04:00
int nr_drainers ; /* WQ: drain in progress */
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/* See alloc_workqueue() function comment for info on min/max_active */
2024-01-29 21:11:24 +03:00
int max_active ; /* WO: max active works */
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
int min_active ; /* WO: min active works */
2024-01-29 21:11:24 +03:00
int saved_max_active ; /* WQ: saved max_active */
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
int saved_min_active ; /* WQ: saved min_active */
2013-03-12 22:30:05 +04:00
2015-05-12 15:32:29 +03:00
struct workqueue_attrs * unbound_attrs ; /* PW: only for unbound wqs */
2024-01-29 21:11:24 +03:00
struct pool_workqueue __rcu * dfl_pwq ; /* PW: only for unbound wqs */
2013-04-01 22:23:34 +04:00
2013-03-12 22:30:05 +04:00
# ifdef CONFIG_SYSFS
struct wq_device * wq_dev ; /* I: for sysfs interface */
# endif
2007-10-19 10:39:55 +04:00
# ifdef CONFIG_LOCKDEP
2019-02-15 02:00:54 +03:00
char * lock_name ;
struct lock_class_key key ;
2010-06-29 12:07:10 +04:00
struct lockdep_map lockdep_map ;
2007-10-19 10:39:55 +04:00
# endif
2013-04-01 22:23:34 +04:00
char name [ WQ_NAME_LEN ] ; /* I: workqueue name */
2013-04-01 22:23:35 +04:00
2015-03-09 16:22:28 +03:00
/*
2019-03-13 19:55:47 +03:00
* Destruction of workqueue_struct is RCU protected to allow walking
* the workqueues list without grabbing wq_pool_mutex .
2015-03-09 16:22:28 +03:00
* This is used to dump all workqueues from sysrq .
*/
struct rcu_head rcu ;
2013-04-01 22:23:35 +04:00
/* hot fields used during command issue, aligned to cacheline */
unsigned int flags ____cacheline_aligned ; /* WQ: WQ_* flags */
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
struct pool_workqueue __percpu __rcu * * cpu_pwq ; /* I: per-cpu pwqs */
2024-01-29 21:11:24 +03:00
struct wq_node_nr_active * node_nr_active [ ] ; /* I: per-node nr_active */
2005-04-17 02:20:36 +04:00
} ;
2023-08-08 04:57:24 +03:00
/*
* Each pod type describes how CPUs should be grouped for unbound workqueues .
* See the comment above workqueue_attrs - > affn_scope .
*/
struct wq_pod_type {
int nr_pods ; /* number of pods */
cpumask_var_t * pod_cpus ; /* pod -> cpus */
int * pod_node ; /* pod -> node */
int * cpu_pod ; /* cpu -> pod */
} ;
2024-03-25 20:21:02 +03:00
struct work_offq_data {
u32 pool_id ;
2024-03-25 20:21:03 +03:00
u32 disable ;
2024-03-25 20:21:02 +03:00
u32 flags ;
} ;
2023-08-08 04:57:24 +03:00
static const char * wq_affn_names [ WQ_AFFN_NR_TYPES ] = {
2024-02-21 08:36:13 +03:00
[ WQ_AFFN_DFL ] = " default " ,
[ WQ_AFFN_CPU ] = " cpu " ,
[ WQ_AFFN_SMT ] = " smt " ,
[ WQ_AFFN_CACHE ] = " cache " ,
[ WQ_AFFN_NUMA ] = " numa " ,
[ WQ_AFFN_SYSTEM ] = " system " ,
2023-08-08 04:57:24 +03:00
} ;
2013-04-01 22:23:32 +04:00
2023-05-18 06:02:08 +03:00
/*
* Per - cpu work items which run for longer than the following threshold are
* automatically considered CPU intensive and excluded from concurrency
* management to prevent them from noticeably delaying other per - cpu work items .
2023-07-18 01:50:02 +03:00
* ULONG_MAX indicates that the user hasn ' t overridden it with a boot parameter .
* The actual value is initialized in wq_cpu_intensive_thresh_init ( ) .
2023-05-18 06:02:08 +03:00
*/
2023-07-18 01:50:02 +03:00
static unsigned long wq_cpu_intensive_thresh_us = ULONG_MAX ;
2023-05-18 06:02:08 +03:00
module_param_named ( cpu_intensive_thresh_us , wq_cpu_intensive_thresh_us , ulong , 0644 ) ;
2024-02-22 10:28:08 +03:00
# ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
static unsigned int wq_cpu_intensive_warning_thresh = 4 ;
module_param_named ( cpu_intensive_warning_thresh , wq_cpu_intensive_warning_thresh , uint , 0644 ) ;
# endif
2023-05-18 06:02:08 +03:00
2013-04-08 15:15:40 +04:00
/* see the comment above the definition of WQ_POWER_EFFICIENT */
2015-05-27 04:39:39 +03:00
static bool wq_power_efficient = IS_ENABLED ( CONFIG_WQ_POWER_EFFICIENT_DEFAULT ) ;
2013-04-08 15:15:40 +04:00
module_param_named ( power_efficient , wq_power_efficient , bool , 0444 ) ;
2016-09-16 22:49:34 +03:00
static bool wq_online ; /* can kworkers be created yet? */
2024-02-21 08:36:13 +03:00
static bool wq_topo_initialized __read_mostly = false ;
static struct kmem_cache * pwq_cache ;
static struct wq_pod_type wq_pod_types [ WQ_AFFN_NR_TYPES ] ;
static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE ;
2016-09-16 22:49:32 +03:00
2023-08-08 04:57:23 +03:00
/* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
static struct workqueue_attrs * wq_update_pod_attrs_buf ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
2013-03-26 03:57:17 +04:00
static DEFINE_MUTEX ( wq_pool_mutex ) ; /* protects pools and workqueues list */
2018-05-18 18:47:13 +03:00
static DEFINE_MUTEX ( wq_pool_attach_mutex ) ; /* protects worker attach/detach */
2020-05-27 22:46:33 +03:00
static DEFINE_RAW_SPINLOCK ( wq_mayday_lock ) ; /* protects wq->maydays list */
2020-05-27 22:46:32 +03:00
/* wait for manager to go away */
static struct rcuwait manager_wait = __RCUWAIT_INITIALIZER ( manager_wait ) ;
2013-03-14 06:47:40 +04:00
2015-03-09 16:22:28 +03:00
static LIST_HEAD ( workqueues ) ; /* PR: list of all workqueues */
2013-03-26 03:57:17 +04:00
static bool workqueue_freezing ; /* PL: have wqs started freezing? */
2013-03-14 06:47:40 +04:00
2023-01-12 19:14:27 +03:00
/* PL&A: allowable cpus for unbound wqs and work items */
2016-02-10 01:59:38 +03:00
static cpumask_var_t wq_unbound_cpumask ;
2023-10-25 21:25:52 +03:00
/* PL: user requested unbound cpumask via sysfs */
static cpumask_var_t wq_requested_unbound_cpumask ;
/* PL: isolated cpumask to be excluded from unbound cpumask */
static cpumask_var_t wq_isolated_cpumask ;
2023-06-29 06:50:50 +03:00
/* for further constrain wq_unbound_cpumask by cmdline parameter*/
static struct cpumask wq_cmdline_cpumask __initdata ;
2016-02-10 01:59:38 +03:00
/* CPU where unbound work was last round robin scheduled from this CPU */
static DEFINE_PER_CPU ( int , wq_rr_cpu_last ) ;
2015-04-27 12:58:39 +03:00
2016-02-10 01:59:38 +03:00
/*
* Local execution of unbound work items is no longer guaranteed . The
* following always forces round - robin CPU selection on unbound work items
* to uncover usages which depend on it .
*/
# ifdef CONFIG_DEBUG_WQ_FORCE_RR_CPU
static bool wq_debug_force_rr_cpu = true ;
# else
static bool wq_debug_force_rr_cpu = false ;
# endif
module_param_named ( debug_force_rr_cpu , wq_debug_force_rr_cpu , bool , 0644 ) ;
2024-02-14 21:33:55 +03:00
/* to raise softirq for the BH worker pools on other CPUs */
static DEFINE_PER_CPU_SHARED_ALIGNED ( struct irq_work [ NR_STD_WORKER_POOLS ] ,
bh_pool_irq_works ) ;
2024-02-05 00:28:06 +03:00
/* the BH worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED ( struct worker_pool [ NR_STD_WORKER_POOLS ] ,
bh_worker_pools ) ;
2013-03-14 06:47:40 +04:00
/* the per-cpu worker pools */
2024-02-05 00:28:06 +03:00
static DEFINE_PER_CPU_SHARED_ALIGNED ( struct worker_pool [ NR_STD_WORKER_POOLS ] ,
cpu_worker_pools ) ;
2013-03-14 06:47:40 +04:00
2013-03-26 03:57:17 +04:00
static DEFINE_IDR ( worker_pool_idr ) ; /* PR: idr of all pools */
2013-03-14 06:47:40 +04:00
2013-03-26 03:57:17 +04:00
/* PL: hash of all unbound pools keyed by pool->attrs */
2013-03-12 22:30:03 +04:00
static DEFINE_HASHTABLE ( unbound_pool_hash , UNBOUND_POOL_HASH_ORDER ) ;
2013-03-14 03:51:36 +04:00
/* I: attributes used when instantiating standard unbound pools on demand */
2013-03-12 22:30:03 +04:00
static struct workqueue_attrs * unbound_std_wq_attrs [ NR_STD_WORKER_POOLS ] ;
2013-09-05 20:30:04 +04:00
/* I: attributes used when instantiating ordered pools on demand */
static struct workqueue_attrs * ordered_wq_attrs [ NR_STD_WORKER_POOLS ] ;
2023-08-08 04:57:23 +03:00
/*
* I : kthread_worker to release pwq ' s . pwq release needs to be bounced to a
* process context while holding a pool lock . Bounce to a dedicated kthread
* worker to avoid A - A deadlocks .
*/
2023-10-11 19:55:00 +03:00
static struct kthread_worker * pwq_release_worker __ro_after_init ;
2023-08-08 04:57:23 +03:00
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_wq __ro_after_init ;
2013-05-07 01:44:55 +04:00
EXPORT_SYMBOL ( system_wq ) ;
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_highpri_wq __ro_after_init ;
2012-08-15 18:25:39 +04:00
EXPORT_SYMBOL_GPL ( system_highpri_wq ) ;
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_long_wq __ro_after_init ;
2010-06-29 12:07:14 +04:00
EXPORT_SYMBOL_GPL ( system_long_wq ) ;
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_unbound_wq __ro_after_init ;
2010-07-02 12:03:51 +04:00
EXPORT_SYMBOL_GPL ( system_unbound_wq ) ;
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_freezable_wq __ro_after_init ;
2011-02-21 11:52:50 +03:00
EXPORT_SYMBOL_GPL ( system_freezable_wq ) ;
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_power_efficient_wq __ro_after_init ;
2013-04-24 15:42:54 +04:00
EXPORT_SYMBOL_GPL ( system_power_efficient_wq ) ;
2023-10-11 19:55:00 +03:00
struct workqueue_struct * system_freezable_power_efficient_wq __ro_after_init ;
2013-04-24 15:42:54 +04:00
EXPORT_SYMBOL_GPL ( system_freezable_power_efficient_wq ) ;
2024-02-05 00:28:06 +03:00
struct workqueue_struct * system_bh_wq ;
EXPORT_SYMBOL_GPL ( system_bh_wq ) ;
struct workqueue_struct * system_bh_highpri_wq ;
EXPORT_SYMBOL_GPL ( system_bh_highpri_wq ) ;
2010-06-29 12:07:14 +04:00
2013-03-14 06:47:40 +04:00
static int worker_thread ( void * __worker ) ;
2015-04-02 14:14:39 +03:00
static void workqueue_sysfs_unregister ( struct workqueue_struct * wq ) ;
2019-09-23 21:08:58 +03:00
static void show_pwq ( struct pool_workqueue * pwq ) ;
2021-10-20 06:09:00 +03:00
static void show_one_worker_pool ( struct worker_pool * pool ) ;
2013-03-14 06:47:40 +04:00
2010-10-05 12:41:14 +04:00
# define CREATE_TRACE_POINTS
# include <trace/events/workqueue.h>
2013-03-26 03:57:17 +04:00
# define assert_rcu_or_pool_mutex() \
2024-02-21 08:36:13 +03:00
RCU_LOCKDEP_WARN ( ! rcu_read_lock_any_held ( ) & & \
2015-06-19 01:50:02 +03:00
! lockdep_is_held ( & wq_pool_mutex ) , \
2019-03-13 19:55:47 +03:00
" RCU or wq_pool_mutex should be held " )
2013-03-14 06:47:40 +04:00
2015-05-12 15:32:29 +03:00
# define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \
2024-02-21 08:36:13 +03:00
RCU_LOCKDEP_WARN ( ! rcu_read_lock_any_held ( ) & & \
2015-06-19 01:50:02 +03:00
! lockdep_is_held ( & wq - > mutex ) & & \
! lockdep_is_held ( & wq_pool_mutex ) , \
2019-03-13 19:55:47 +03:00
" RCU, wq->mutex or wq_pool_mutex should be held " )
2015-05-12 15:32:29 +03:00
2024-02-05 00:28:06 +03:00
# define for_each_bh_worker_pool(pool, cpu) \
for ( ( pool ) = & per_cpu ( bh_worker_pools , cpu ) [ 0 ] ; \
( pool ) < & per_cpu ( bh_worker_pools , cpu ) [ NR_STD_WORKER_POOLS ] ; \
( pool ) + + )
2013-03-12 22:30:03 +04:00
# define for_each_cpu_worker_pool(pool, cpu) \
for ( ( pool ) = & per_cpu ( cpu_worker_pools , cpu ) [ 0 ] ; \
( pool ) < & per_cpu ( cpu_worker_pools , cpu ) [ NR_STD_WORKER_POOLS ] ; \
2013-03-12 22:30:03 +04:00
( pool ) + + )
2012-07-14 09:16:44 +04:00
2013-03-12 22:29:58 +04:00
/**
* for_each_pool - iterate through all worker_pools in the system
* @ pool : iteration cursor
2013-03-14 03:51:36 +04:00
* @ pi : integer used for iteration
2013-03-12 22:30:00 +04:00
*
2019-03-13 19:55:47 +03:00
* This must be called either with wq_pool_mutex held or RCU read
2013-03-26 03:57:17 +04:00
* locked . If the pool needs to be used beyond the locking in effect , the
* caller is responsible for guaranteeing that the pool stays online .
2013-03-12 22:30:00 +04:00
*
* The if / else clause exists only for the lockdep assertion and can be
* ignored .
2013-03-12 22:29:58 +04:00
*/
2013-03-14 03:51:36 +04:00
# define for_each_pool(pool, pi) \
idr_for_each_entry ( & worker_pool_idr , pool , pi ) \
2013-03-26 03:57:17 +04:00
if ( ( { assert_rcu_or_pool_mutex ( ) ; false ; } ) ) { } \
2013-03-12 22:30:00 +04:00
else
2013-03-12 22:29:58 +04:00
2013-03-20 00:45:21 +04:00
/**
* for_each_pool_worker - iterate through all workers of a worker_pool
* @ worker : iteration cursor
* @ pool : worker_pool to iterate workers of
*
2018-05-18 18:47:13 +03:00
* This must be called with wq_pool_attach_mutex .
2013-03-20 00:45:21 +04:00
*
* The if / else clause exists only for the lockdep assertion and can be
* ignored .
*/
2014-05-20 13:46:31 +04:00
# define for_each_pool_worker(worker, pool) \
list_for_each_entry ( ( worker ) , & ( pool ) - > workers , node ) \
2018-05-18 18:47:13 +03:00
if ( ( { lockdep_assert_held ( & wq_pool_attach_mutex ) ; false ; } ) ) { } \
2013-03-20 00:45:21 +04:00
else
2013-03-12 22:29:58 +04:00
/**
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
* @ pwq : iteration cursor
* @ wq : the target workqueue
2013-03-12 22:30:00 +04:00
*
2019-03-13 19:55:47 +03:00
* This must be called either with wq - > mutex held or RCU read locked .
2013-03-14 06:47:40 +04:00
* If the pwq needs to be used beyond the locking in effect , the caller is
* responsible for guaranteeing that the pwq stays online .
2013-03-12 22:30:00 +04:00
*
* The if / else clause exists only for the lockdep assertion and can be
* ignored .
2013-03-12 22:29:58 +04:00
*/
# define for_each_pwq(pwq, wq) \
2019-11-15 21:01:25 +03:00
list_for_each_entry_rcu ( ( pwq ) , & ( wq ) - > pwqs , pwqs_node , \
2019-08-15 17:18:42 +03:00
lockdep_is_held ( & ( wq - > mutex ) ) )
2010-07-02 12:03:51 +04:00
2009-11-15 19:09:48 +03:00
# ifdef CONFIG_DEBUG_OBJECTS_WORK
2020-08-15 03:40:27 +03:00
static const struct debug_obj_descr work_debug_descr ;
2009-11-15 19:09:48 +03:00
2011-03-07 11:58:33 +03:00
static void * work_debug_hint ( void * addr )
{
return ( ( struct work_struct * ) addr ) - > func ;
}
2016-05-20 03:09:41 +03:00
static bool work_is_static_object ( void * addr )
{
struct work_struct * work = addr ;
return test_bit ( WORK_STRUCT_STATIC_BIT , work_data_bits ( work ) ) ;
}
2009-11-15 19:09:48 +03:00
/*
* fixup_init is called when :
* - an active object is initialized
*/
2016-05-20 03:09:26 +03:00
static bool work_fixup_init ( void * addr , enum debug_obj_state state )
2009-11-15 19:09:48 +03:00
{
struct work_struct * work = addr ;
switch ( state ) {
case ODEBUG_STATE_ACTIVE :
cancel_work_sync ( work ) ;
debug_object_init ( work , & work_debug_descr ) ;
2016-05-20 03:09:26 +03:00
return true ;
2009-11-15 19:09:48 +03:00
default :
2016-05-20 03:09:26 +03:00
return false ;
2009-11-15 19:09:48 +03:00
}
}
/*
* fixup_free is called when :
* - an active object is freed
*/
2016-05-20 03:09:26 +03:00
static bool work_fixup_free ( void * addr , enum debug_obj_state state )
2009-11-15 19:09:48 +03:00
{
struct work_struct * work = addr ;
switch ( state ) {
case ODEBUG_STATE_ACTIVE :
cancel_work_sync ( work ) ;
debug_object_free ( work , & work_debug_descr ) ;
2016-05-20 03:09:26 +03:00
return true ;
2009-11-15 19:09:48 +03:00
default :
2016-05-20 03:09:26 +03:00
return false ;
2009-11-15 19:09:48 +03:00
}
}
2020-08-15 03:40:27 +03:00
static const struct debug_obj_descr work_debug_descr = {
2009-11-15 19:09:48 +03:00
. name = " work_struct " ,
2011-03-07 11:58:33 +03:00
. debug_hint = work_debug_hint ,
2016-05-20 03:09:41 +03:00
. is_static_object = work_is_static_object ,
2009-11-15 19:09:48 +03:00
. fixup_init = work_fixup_init ,
. fixup_free = work_fixup_free ,
} ;
static inline void debug_work_activate ( struct work_struct * work )
{
debug_object_activate ( work , & work_debug_descr ) ;
}
static inline void debug_work_deactivate ( struct work_struct * work )
{
debug_object_deactivate ( work , & work_debug_descr ) ;
}
void __init_work ( struct work_struct * work , int onstack )
{
if ( onstack )
debug_object_init_on_stack ( work , & work_debug_descr ) ;
else
debug_object_init ( work , & work_debug_descr ) ;
}
EXPORT_SYMBOL_GPL ( __init_work ) ;
void destroy_work_on_stack ( struct work_struct * work )
{
debug_object_free ( work , & work_debug_descr ) ;
}
EXPORT_SYMBOL_GPL ( destroy_work_on_stack ) ;
2014-03-23 18:20:44 +04:00
void destroy_delayed_work_on_stack ( struct delayed_work * work )
{
destroy_timer_on_stack ( & work - > timer ) ;
debug_object_free ( & work - > work , & work_debug_descr ) ;
}
EXPORT_SYMBOL_GPL ( destroy_delayed_work_on_stack ) ;
2009-11-15 19:09:48 +03:00
# else
static inline void debug_work_activate ( struct work_struct * work ) { }
static inline void debug_work_deactivate ( struct work_struct * work ) { }
# endif
2013-09-10 05:52:35 +04:00
/**
2021-07-31 03:01:29 +03:00
* worker_pool_assign_id - allocate ID and assign it to @ pool
2013-09-10 05:52:35 +04:00
* @ pool : the pool pointer of interest
*
* Returns 0 if ID in [ 0 , WORK_OFFQ_POOL_NONE ) is allocated and assigned
* successfully , - errno on failure .
*/
2013-01-24 23:01:33 +04:00
static int worker_pool_assign_id ( struct worker_pool * pool )
{
int ret ;
2013-03-26 03:57:17 +04:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-14 06:47:40 +04:00
2013-09-10 05:52:35 +04:00
ret = idr_alloc ( & worker_pool_idr , pool , 0 , WORK_OFFQ_POOL_NONE ,
GFP_KERNEL ) ;
Linux 3.9-rc5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
=q15K
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc5' into wq/for-3.10
Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.
The two conflicts and their resolutions:
* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);
do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
return ret < 0 ? ret : 0;
}
We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
lockdep_assert_held(&wq_pool_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}
* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
schedule();
<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);
schedule();
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-02 04:08:13 +04:00
if ( ret > = 0 ) {
2013-03-14 01:59:38 +04:00
pool - > id = ret ;
Linux 3.9-rc5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
=q15K
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc5' into wq/for-3.10
Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.
The two conflicts and their resolutions:
* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);
do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
return ret < 0 ? ret : 0;
}
We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
lockdep_assert_held(&wq_pool_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}
* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
schedule();
<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);
schedule();
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-02 04:08:13 +04:00
return 0 ;
}
2013-03-12 22:30:00 +04:00
return ret ;
2013-01-24 23:01:33 +04:00
}
2024-01-29 21:11:24 +03:00
static struct pool_workqueue __rcu * *
unbound_pwq_slot ( struct workqueue_struct * wq , int cpu )
{
if ( cpu > = 0 )
return per_cpu_ptr ( wq - > cpu_pwq , cpu ) ;
else
return & wq - > dfl_pwq ;
}
/* @cpu < 0 for dfl_pwq */
static struct pool_workqueue * unbound_pwq ( struct workqueue_struct * wq , int cpu )
{
return rcu_dereference_check ( * unbound_pwq_slot ( wq , cpu ) ,
lockdep_is_held ( & wq_pool_mutex ) | |
lockdep_is_held ( & wq - > mutex ) ) ;
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/**
* unbound_effective_cpumask - effective cpumask of an unbound workqueue
* @ wq : workqueue of interest
*
* @ wq - > unbound_attrs - > cpumask contains the cpumask requested by the user which
* is masked with wq_unbound_cpumask to determine the effective cpumask . The
* default pwq is always mapped to the pool with the current effective cpumask .
*/
static struct cpumask * unbound_effective_cpumask ( struct workqueue_struct * wq )
{
return unbound_pwq ( wq , - 1 ) - > pool - > attrs - > __pod_cpumask ;
}
2010-06-29 12:07:11 +04:00
static unsigned int work_color_to_flags ( int color )
{
return color < < WORK_STRUCT_COLOR_SHIFT ;
}
2021-08-17 04:32:35 +03:00
static int get_work_color ( unsigned long work_data )
2010-06-29 12:07:11 +04:00
{
2021-08-17 04:32:35 +03:00
return ( work_data > > WORK_STRUCT_COLOR_SHIFT ) &
2010-06-29 12:07:11 +04:00
( ( 1 < < WORK_STRUCT_COLOR_BITS ) - 1 ) ;
}
static int work_next_color ( int color )
{
return ( color + 1 ) % WORK_NR_COLORS ;
}
2005-04-17 02:20:36 +04:00
2024-03-25 20:21:03 +03:00
static unsigned long pool_offq_flags ( struct worker_pool * pool )
{
return ( pool - > flags & POOL_BH ) ? WORK_OFFQ_BH : 0 ;
}
2007-05-24 00:57:57 +04:00
/*
2013-02-14 07:29:12 +04:00
* While queued , % WORK_STRUCT_PWQ is set and non flag bits of a work ' s data
* contain the pointer to the queued pwq . Once execution starts , the flag
2013-01-24 23:01:33 +04:00
* is cleared and the high bits contain OFFQ flags and pool ID .
2010-06-29 12:07:13 +04:00
*
2024-02-21 08:36:14 +03:00
* set_work_pwq ( ) , set_work_pool_and_clear_pending ( ) and mark_work_canceling ( )
* can be used to set the pwq , pool or clear work - > data . These functions should
* only be called while the work is owned - ie . while the PENDING bit is set .
2010-06-29 12:07:13 +04:00
*
2013-02-14 07:29:12 +04:00
* get_work_pool ( ) and get_work_pwq ( ) can be used to obtain the pool or pwq
2013-01-24 23:01:33 +04:00
* corresponding to a work . Pool is available once the work has been
2013-02-14 07:29:12 +04:00
* queued anywhere after initialization until it is sync canceled . pwq is
2013-01-24 23:01:33 +04:00
* available only while the work item is queued .
2007-05-24 00:57:57 +04:00
*/
2024-02-21 08:36:15 +03:00
static inline void set_work_data ( struct work_struct * work , unsigned long data )
2006-11-22 17:54:49 +03:00
{
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( ! work_pending ( work ) ) ;
2024-02-21 08:36:15 +03:00
atomic_long_set ( & work - > data , data | work_static ( work ) ) ;
2010-06-29 12:07:13 +04:00
}
2006-11-22 17:54:49 +03:00
2013-02-14 07:29:12 +04:00
static void set_work_pwq ( struct work_struct * work , struct pool_workqueue * pwq ,
2024-02-21 08:36:15 +03:00
unsigned long flags )
2010-06-29 12:07:13 +04:00
{
2024-02-21 08:36:15 +03:00
set_work_data ( work , ( unsigned long ) pwq | WORK_STRUCT_PENDING |
WORK_STRUCT_PWQ | flags ) ;
2006-11-22 17:54:49 +03:00
}
2013-02-07 06:04:53 +04:00
static void set_work_pool_and_keep_pending ( struct work_struct * work ,
2024-02-21 08:36:15 +03:00
int pool_id , unsigned long flags )
2013-02-07 06:04:53 +04:00
{
2024-02-21 08:36:15 +03:00
set_work_data ( work , ( ( unsigned long ) pool_id < < WORK_OFFQ_POOL_SHIFT ) |
WORK_STRUCT_PENDING | flags ) ;
2013-02-07 06:04:53 +04:00
}
2013-01-24 23:01:33 +04:00
static void set_work_pool_and_clear_pending ( struct work_struct * work ,
2024-02-21 08:36:15 +03:00
int pool_id , unsigned long flags )
2010-06-29 12:07:13 +04:00
{
2012-08-14 04:08:19 +04:00
/*
* The following wmb is paired with the implied mb in
* test_and_set_bit ( PENDING ) and ensures all updates to @ work made
* here are visible to and precede any updates by the next PENDING
* owner .
*/
smp_wmb ( ) ;
2024-02-21 08:36:15 +03:00
set_work_data ( work , ( ( unsigned long ) pool_id < < WORK_OFFQ_POOL_SHIFT ) |
flags ) ;
workqueue: fix ghost PENDING flag while doing MQ IO
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-04-26 14:15:35 +03:00
/*
* The following mb guarantees that previous clear of a PENDING bit
* will not be reordered with any speculative LOADS or STORES from
* work - > current_func , which is executed afterwards . This possible
2019-02-19 18:53:27 +03:00
* reordering can lead to a missed execution on attempt to queue
workqueue: fix ghost PENDING flag while doing MQ IO
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-04-26 14:15:35 +03:00
* the same @ work . E . g . consider this case :
*
* CPU # 0 CPU # 1
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*
* 1 STORE event_indicated
* 2 queue_work_on ( ) {
* 3 test_and_set_bit ( PENDING )
* 4 } set_ . . . _and_clear_pending ( ) {
* 5 set_work_data ( ) # clear bit
* 6 smp_mb ( )
* 7 work - > current_func ( ) {
* 8 LOAD event_indicated
* }
*
* Without an explicit full barrier speculative LOAD on line 8 can
* be executed before CPU # 0 does STORE on line 1. If that happens ,
* CPU # 0 observes the PENDING bit is still set and new execution of
* a @ work is not queued in a hope , that CPU # 1 will eventually
* finish the queued @ work . Meanwhile CPU # 1 does not see
* event_indicated is set , because speculative LOAD was executed
* before actual STORE .
*/
smp_mb ( ) ;
2010-06-29 12:07:13 +04:00
}
2006-01-08 12:05:12 +03:00
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 22:08:14 +03:00
static inline struct pool_workqueue * work_struct_pwq ( unsigned long data )
{
2024-02-21 08:36:14 +03:00
return ( struct pool_workqueue * ) ( data & WORK_STRUCT_PWQ_MASK ) ;
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 22:08:14 +03:00
}
2013-02-14 07:29:12 +04:00
static struct pool_workqueue * get_work_pwq ( struct work_struct * work )
2007-05-09 13:34:12 +04:00
{
2010-07-22 16:14:25 +04:00
unsigned long data = atomic_long_read ( & work - > data ) ;
2010-06-29 12:07:13 +04:00
2013-02-14 07:29:12 +04:00
if ( data & WORK_STRUCT_PWQ )
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 22:08:14 +03:00
return work_struct_pwq ( data ) ;
2010-07-22 16:14:25 +04:00
else
return NULL ;
2010-04-23 19:40:40 +04:00
}
2013-01-24 23:01:33 +04:00
/**
* get_work_pool - return the worker_pool a given work was associated with
* @ work : the work item of interest
*
2013-03-26 03:57:17 +04:00
* Pools are created and destroyed under wq_pool_mutex , and allows read
2019-03-13 19:55:47 +03:00
* access under RCU read lock . As such , this function should be
* called under wq_pool_mutex or inside of a rcu_read_lock ( ) region .
2013-03-12 22:30:00 +04:00
*
* All fields of the returned pool are accessible as long as the above
* mentioned locking is in effect . If the returned pool needs to be used
* beyond the critical section , the caller is responsible for ensuring the
* returned pool is and stays online .
2013-08-01 01:59:24 +04:00
*
* Return : The worker_pool @ work was last associated with . % NULL if none .
2013-01-24 23:01:33 +04:00
*/
static struct worker_pool * get_work_pool ( struct work_struct * work )
2006-11-22 17:54:49 +03:00
{
2010-07-22 16:14:25 +04:00
unsigned long data = atomic_long_read ( & work - > data ) ;
2013-01-24 23:01:33 +04:00
int pool_id ;
2010-06-29 12:07:13 +04:00
2013-03-26 03:57:17 +04:00
assert_rcu_or_pool_mutex ( ) ;
2013-03-12 22:30:00 +04:00
2013-02-14 07:29:12 +04:00
if ( data & WORK_STRUCT_PWQ )
workqueue: clean up WORK_* constant types, clarify masking
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23 22:08:14 +03:00
return work_struct_pwq ( data ) - > pool ;
2010-06-29 12:07:13 +04:00
2013-01-24 23:01:33 +04:00
pool_id = data > > WORK_OFFQ_POOL_SHIFT ;
if ( pool_id = = WORK_OFFQ_POOL_NONE )
2010-06-29 12:07:13 +04:00
return NULL ;
2013-03-12 22:30:00 +04:00
return idr_find ( & worker_pool_idr , pool_id ) ;
2013-01-24 23:01:33 +04:00
}
2024-03-25 20:21:02 +03:00
static unsigned long shift_and_mask ( unsigned long v , u32 shift , u32 bits )
2013-01-24 23:01:33 +04:00
{
2024-03-25 20:21:02 +03:00
return ( v > > shift ) & ( ( 1 < < bits ) - 1 ) ;
2013-01-24 23:01:33 +04:00
}
2024-03-25 20:21:02 +03:00
static void work_offqd_unpack ( struct work_offq_data * offqd , unsigned long data )
2012-08-03 21:30:46 +04:00
{
2024-03-25 20:21:02 +03:00
WARN_ON_ONCE ( data & WORK_STRUCT_PWQ ) ;
2012-08-03 21:30:46 +04:00
2024-03-25 20:21:02 +03:00
offqd - > pool_id = shift_and_mask ( data , WORK_OFFQ_POOL_SHIFT ,
WORK_OFFQ_POOL_BITS ) ;
2024-03-25 20:21:03 +03:00
offqd - > disable = shift_and_mask ( data , WORK_OFFQ_DISABLE_SHIFT ,
WORK_OFFQ_DISABLE_BITS ) ;
2024-03-25 20:21:02 +03:00
offqd - > flags = data & WORK_OFFQ_FLAG_MASK ;
2012-08-03 21:30:46 +04:00
}
2024-03-25 20:21:02 +03:00
static unsigned long work_offqd_pack_flags ( struct work_offq_data * offqd )
2012-08-03 21:30:46 +04:00
{
2024-03-25 20:21:03 +03:00
return ( ( unsigned long ) offqd - > disable < < WORK_OFFQ_DISABLE_SHIFT ) |
( ( unsigned long ) offqd - > flags ) ;
2012-08-03 21:30:46 +04:00
}
2010-06-29 12:07:14 +04:00
/*
2012-07-14 09:16:45 +04:00
* Policy functions . These define the policies on how the global worker
* pools are managed . Unless noted otherwise , these functions assume that
2013-01-24 23:01:33 +04:00
* they ' re being called with pool - > lock held .
2010-06-29 12:07:14 +04:00
*/
[PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.
The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:
(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().
Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.
If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.
(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.
(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).
Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.
So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.
The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.
If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 14:33:26 +03:00
/*
2010-06-29 12:07:14 +04:00
* Need to wake up a worker ? Called from anything but currently
* running workers .
2012-07-13 01:46:37 +04:00
*
* Note that , because unbound workers never contribute to nr_running , this
2013-01-24 23:01:34 +04:00
* function will always return % true for unbound pools as long as the
2012-07-13 01:46:37 +04:00
* worklist isn ' t empty .
[PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.
The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:
(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().
Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.
If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.
(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.
(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).
Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.
So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.
The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.
If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 14:33:26 +03:00
*/
2012-07-13 01:46:37 +04:00
static bool need_more_worker ( struct worker_pool * pool )
2006-11-22 17:54:49 +03:00
{
2023-08-08 04:57:25 +03:00
return ! list_empty ( & pool - > worklist ) & & ! pool - > nr_running ;
2010-06-29 12:07:14 +04:00
}
[PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.
The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:
(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().
Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.
If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.
(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.
(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).
Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.
So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.
The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.
If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 14:33:26 +03:00
2010-06-29 12:07:14 +04:00
/* Can I start working? Called from busy but !running workers. */
2012-07-13 01:46:37 +04:00
static bool may_start_working ( struct worker_pool * pool )
2010-06-29 12:07:14 +04:00
{
2012-07-13 01:46:37 +04:00
return pool - > nr_idle ;
2010-06-29 12:07:14 +04:00
}
/* Do I need to keep working? Called from currently running workers. */
2012-07-13 01:46:37 +04:00
static bool keep_working ( struct worker_pool * pool )
2010-06-29 12:07:14 +04:00
{
2021-12-23 15:31:40 +03:00
return ! list_empty ( & pool - > worklist ) & & ( pool - > nr_running < = 1 ) ;
2010-06-29 12:07:14 +04:00
}
/* Do we need a new worker? Called from manager. */
2012-07-13 01:46:37 +04:00
static bool need_to_create_worker ( struct worker_pool * pool )
2010-06-29 12:07:14 +04:00
{
2012-07-13 01:46:37 +04:00
return need_more_worker ( pool ) & & ! may_start_working ( pool ) ;
2010-06-29 12:07:14 +04:00
}
2006-11-22 17:54:49 +03:00
2010-06-29 12:07:14 +04:00
/* Do we have too many workers and should some go away? */
2012-07-13 01:46:37 +04:00
static bool too_many_workers ( struct worker_pool * pool )
2010-06-29 12:07:14 +04:00
{
2017-10-09 18:04:13 +03:00
bool managing = pool - > flags & POOL_MANAGER_ACTIVE ;
2012-07-13 01:46:37 +04:00
int nr_idle = pool - > nr_idle + managing ; /* manager is considered idle */
int nr_busy = pool - > nr_workers - nr_idle ;
2010-06-29 12:07:14 +04:00
return nr_idle > 2 & & ( nr_idle - 2 ) * MAX_IDLE_WORKERS_RATIO > = nr_busy ;
2006-11-22 17:54:49 +03:00
}
2023-05-18 06:02:08 +03:00
/**
* worker_set_flags - set worker flags and adjust nr_running accordingly
* @ worker : self
* @ flags : flags to set
*
* Set @ flags in @ worker - > flags and adjust nr_running accordingly .
*/
static inline void worker_set_flags ( struct worker * worker , unsigned int flags )
{
struct worker_pool * pool = worker - > pool ;
2023-08-08 04:57:22 +03:00
lockdep_assert_held ( & pool - > lock ) ;
2023-05-18 06:02:08 +03:00
/* If transitioning into NOT_RUNNING, adjust nr_running. */
if ( ( flags & WORKER_NOT_RUNNING ) & &
! ( worker - > flags & WORKER_NOT_RUNNING ) ) {
pool - > nr_running - - ;
}
worker - > flags | = flags ;
}
/**
* worker_clr_flags - clear worker flags and adjust nr_running accordingly
* @ worker : self
* @ flags : flags to clear
*
* Clear @ flags in @ worker - > flags and adjust nr_running accordingly .
*/
static inline void worker_clr_flags ( struct worker * worker , unsigned int flags )
{
struct worker_pool * pool = worker - > pool ;
unsigned int oflags = worker - > flags ;
2023-08-08 04:57:22 +03:00
lockdep_assert_held ( & pool - > lock ) ;
2023-05-18 06:02:08 +03:00
worker - > flags & = ~ flags ;
/*
* If transitioning out of NOT_RUNNING , increment nr_running . Note
* that the nested NOT_RUNNING is not a noop . NOT_RUNNING is mask
* of multiple flags , not a single flag .
*/
if ( ( flags & WORKER_NOT_RUNNING ) & & ( oflags & WORKER_NOT_RUNNING ) )
if ( ! ( worker - > flags & WORKER_NOT_RUNNING ) )
pool - > nr_running + + ;
}
2023-08-08 04:57:23 +03:00
/* Return the first idle worker. Called with pool->lock held. */
static struct worker * first_idle_worker ( struct worker_pool * pool )
{
if ( unlikely ( list_empty ( & pool - > idle_list ) ) )
return NULL ;
return list_first_entry ( & pool - > idle_list , struct worker , entry ) ;
}
/**
* worker_enter_idle - enter idle state
* @ worker : worker which is entering idle state
*
* @ worker is entering idle state . Update stats and idle timer if
* necessary .
*
* LOCKING :
* raw_spin_lock_irq ( pool - > lock ) .
*/
static void worker_enter_idle ( struct worker * worker )
{
struct worker_pool * pool = worker - > pool ;
if ( WARN_ON_ONCE ( worker - > flags & WORKER_IDLE ) | |
WARN_ON_ONCE ( ! list_empty ( & worker - > entry ) & &
( worker - > hentry . next | | worker - > hentry . pprev ) ) )
return ;
/* can't use worker_set_flags(), also called from create_worker() */
worker - > flags | = WORKER_IDLE ;
pool - > nr_idle + + ;
worker - > last_active = jiffies ;
/* idle_list is LIFO */
list_add ( & worker - > entry , & pool - > idle_list ) ;
if ( too_many_workers ( pool ) & & ! timer_pending ( & pool - > idle_timer ) )
mod_timer ( & pool - > idle_timer , jiffies + IDLE_WORKER_TIMEOUT ) ;
/* Sanity check nr_running. */
WARN_ON_ONCE ( pool - > nr_workers = = pool - > nr_idle & & pool - > nr_running ) ;
}
/**
* worker_leave_idle - leave idle state
* @ worker : worker which is leaving idle state
*
* @ worker is leaving idle state . Update stats .
*
* LOCKING :
* raw_spin_lock_irq ( pool - > lock ) .
*/
static void worker_leave_idle ( struct worker * worker )
{
struct worker_pool * pool = worker - > pool ;
if ( WARN_ON_ONCE ( ! ( worker - > flags & WORKER_IDLE ) ) )
return ;
worker_clr_flags ( worker , WORKER_IDLE ) ;
pool - > nr_idle - - ;
list_del_init ( & worker - > entry ) ;
}
/**
* find_worker_executing_work - find worker which is executing a work
* @ pool : pool of interest
* @ work : work to find worker for
*
* Find a worker which is executing @ work on @ pool by searching
* @ pool - > busy_hash which is keyed by the address of @ work . For a worker
* to match , its current execution should match the address of @ work and
* its work function . This is to avoid unwanted dependency between
* unrelated work executions through a work item being recycled while still
* being executed .
*
* This is a bit tricky . A work item may be freed once its execution
* starts and nothing prevents the freed area from being recycled for
* another work item . If the same work item address ends up being reused
* before the original execution finishes , workqueue will identify the
* recycled work item as currently executing and make it wait until the
* current execution finishes , introducing an unwanted dependency .
*
* This function checks the work item address and work function to avoid
* false positives . Note that this isn ' t complete as one may construct a
* work function which can introduce dependency onto itself through a
* recycled work item . Well , if somebody wants to shoot oneself in the
* foot that badly , there ' s only so much we can do , and if such deadlock
* actually occurs , it should be easy to locate the culprit work function .
*
* CONTEXT :
* raw_spin_lock_irq ( pool - > lock ) .
*
* Return :
* Pointer to worker which is executing @ work if found , % NULL
* otherwise .
*/
static struct worker * find_worker_executing_work ( struct worker_pool * pool ,
struct work_struct * work )
{
struct worker * worker ;
hash_for_each_possible ( pool - > busy_hash , worker , hentry ,
( unsigned long ) work )
if ( worker - > current_work = = work & &
worker - > current_func = = work - > func )
return worker ;
return NULL ;
}
/**
* move_linked_works - move linked works to a list
* @ work : start of series of works to be scheduled
* @ head : target list to append @ work to
* @ nextp : out parameter for nested worklist walking
*
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
* Schedule linked works starting from @ work to @ head . Work series to be
* scheduled starts at @ work and includes any consecutive work with
* WORK_STRUCT_LINKED set in its predecessor . See assign_work ( ) for details on
* @ nextp .
2023-08-08 04:57:23 +03:00
*
* CONTEXT :
* raw_spin_lock_irq ( pool - > lock ) .
*/
static void move_linked_works ( struct work_struct * work , struct list_head * head ,
struct work_struct * * nextp )
{
struct work_struct * n ;
/*
* Linked worklist will always end before the end of the list ,
* use NULL for list head .
*/
list_for_each_entry_safe_from ( work , n , NULL , entry ) {
list_move_tail ( & work - > entry , head ) ;
if ( ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) )
break ;
}
/*
* If we ' re already inside safe list traversal and have moved
* multiple works to the scheduled queue , the next position
* needs to be updated .
*/
if ( nextp )
* nextp = n ;
}
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
/**
* assign_work - assign a work item and its linked work items to a worker
* @ work : work to assign
* @ worker : worker to assign to
* @ nextp : out parameter for nested worklist walking
*
* Assign @ work and its linked work items to @ worker . If @ work is already being
* executed by another worker in the same pool , it ' ll be punted there .
*
* If @ nextp is not NULL , it ' s updated to point to the next work of the last
* scheduled work . This allows assign_work ( ) to be nested inside
* list_for_each_entry_safe ( ) .
*
* Returns % true if @ work was successfully assigned to @ worker . % false if @ work
* was punted to another worker already executing it .
*/
static bool assign_work ( struct work_struct * work , struct worker * worker ,
struct work_struct * * nextp )
{
struct worker_pool * pool = worker - > pool ;
struct worker * collision ;
lockdep_assert_held ( & pool - > lock ) ;
/*
* A single work shouldn ' t be executed concurrently by multiple workers .
* __queue_work ( ) ensures that @ work doesn ' t jump to a different pool
* while still running in the previous pool . Here , we should ensure that
* @ work is not executed concurrently by multiple workers from the same
* pool . Check whether anyone is already processing the work . If so ,
* defer the work to the currently executing one .
*/
collision = find_worker_executing_work ( pool , work ) ;
if ( unlikely ( collision ) ) {
move_linked_works ( work , & collision - > scheduled , nextp ) ;
return false ;
}
move_linked_works ( work , & worker - > scheduled , nextp ) ;
return true ;
}
2024-02-14 21:33:55 +03:00
static struct irq_work * bh_pool_irq_work ( struct worker_pool * pool )
{
int high = pool - > attrs - > nice = = HIGHPRI_NICE_LEVEL ? 1 : 0 ;
return & per_cpu ( bh_pool_irq_works , pool - > cpu ) [ high ] ;
}
2024-02-16 08:10:01 +03:00
static void kick_bh_pool ( struct worker_pool * pool )
{
# ifdef CONFIG_SMP
2024-02-27 04:38:55 +03:00
/* see drain_dead_softirq_workfn() for BH_DRAINING */
if ( unlikely ( pool - > cpu ! = smp_processor_id ( ) & &
! ( pool - > flags & POOL_BH_DRAINING ) ) ) {
2024-02-16 08:10:01 +03:00
irq_work_queue_on ( bh_pool_irq_work ( pool ) , pool - > cpu ) ;
return ;
}
# endif
if ( pool - > attrs - > nice = = HIGHPRI_NICE_LEVEL )
raise_softirq_irqoff ( HI_SOFTIRQ ) ;
else
raise_softirq_irqoff ( TASKLET_SOFTIRQ ) ;
}
2023-08-08 04:57:23 +03:00
/**
2023-08-08 04:57:25 +03:00
* kick_pool - wake up an idle worker if necessary
* @ pool : pool to kick
2023-08-08 04:57:23 +03:00
*
2023-08-08 04:57:25 +03:00
* @ pool may have pending work items . Wake up worker if necessary . Returns
* whether a worker was woken up .
2023-08-08 04:57:23 +03:00
*/
2023-08-08 04:57:25 +03:00
static bool kick_pool ( struct worker_pool * pool )
2023-08-08 04:57:23 +03:00
{
struct worker * worker = first_idle_worker ( pool ) ;
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
struct task_struct * p ;
2023-08-08 04:57:23 +03:00
2023-08-08 04:57:25 +03:00
lockdep_assert_held ( & pool - > lock ) ;
if ( ! need_more_worker ( pool ) | | ! worker )
return false ;
2024-02-05 00:28:06 +03:00
if ( pool - > flags & POOL_BH ) {
2024-02-16 08:10:01 +03:00
kick_bh_pool ( pool ) ;
2024-02-05 00:28:06 +03:00
return true ;
}
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
p = worker - > task ;
# ifdef CONFIG_SMP
/*
* Idle @ worker is about to execute @ work and waking up provides an
* opportunity to migrate @ worker at a lower cost by setting the task ' s
* wake_cpu field . Let ' s see if we want to move @ worker to improve
* execution locality .
*
* We ' re waking the worker that went idle the latest and there ' s some
* chance that @ worker is marked idle but hasn ' t gone off CPU yet . If
* so , setting the wake_cpu won ' t do anything . As this is a best - effort
* optimization and the race window is narrow , let ' s leave as - is for
* now . If this becomes pronounced , we can skip over workers which are
* still on cpu when picking an idle worker .
*
* If @ pool has non - strict affinity , @ worker might have ended up outside
* its affinity scope . Repatriate .
*/
if ( ! pool - > attrs - > affn_strict & &
! cpumask_test_cpu ( p - > wake_cpu , pool - > attrs - > __pod_cpumask ) ) {
struct work_struct * work = list_first_entry ( & pool - > worklist ,
struct work_struct , entry ) ;
2024-04-23 09:19:05 +03:00
int wake_cpu = cpumask_any_and_distribute ( pool - > attrs - > __pod_cpumask ,
cpu_online_mask ) ;
if ( wake_cpu < nr_cpu_ids ) {
p - > wake_cpu = wake_cpu ;
get_work_pwq ( work ) - > stats [ PWQ_STAT_REPATRIATED ] + + ;
}
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
}
# endif
wake_up_process ( p ) ;
2023-08-08 04:57:25 +03:00
return true ;
2023-08-08 04:57:23 +03:00
}
2023-05-18 06:02:08 +03:00
# ifdef CONFIG_WQ_CPU_INTENSIVE_REPORT
/*
* Concurrency - managed per - cpu work items that hog CPU for longer than
* wq_cpu_intensive_thresh_us trigger the automatic CPU_INTENSIVE mechanism ,
* which prevents them from stalling other concurrency - managed work items . If a
* work function keeps triggering this mechanism , it ' s likely that the work item
* should be using an unbound workqueue instead .
*
* wq_cpu_intensive_report ( ) tracks work functions which trigger such conditions
* and report them so that they can be examined and converted to use unbound
* workqueues as appropriate . To avoid flooding the console , each violating work
* function is tracked and reported with exponential backoff .
*/
# define WCI_MAX_ENTS 128
struct wci_ent {
work_func_t func ;
atomic64_t cnt ;
struct hlist_node hash_node ;
} ;
static struct wci_ent wci_ents [ WCI_MAX_ENTS ] ;
static int wci_nr_ents ;
static DEFINE_RAW_SPINLOCK ( wci_lock ) ;
static DEFINE_HASHTABLE ( wci_hash , ilog2 ( WCI_MAX_ENTS ) ) ;
static struct wci_ent * wci_find_ent ( work_func_t func )
{
struct wci_ent * ent ;
hash_for_each_possible_rcu ( wci_hash , ent , hash_node ,
( unsigned long ) func ) {
if ( ent - > func = = func )
return ent ;
}
return NULL ;
}
static void wq_cpu_intensive_report ( work_func_t func )
{
struct wci_ent * ent ;
restart :
ent = wci_find_ent ( func ) ;
if ( ent ) {
u64 cnt ;
/*
2024-02-22 10:28:08 +03:00
* Start reporting from the warning_thresh and back off
2023-05-18 06:02:08 +03:00
* exponentially .
*/
cnt = atomic64_inc_return_relaxed ( & ent - > cnt ) ;
2024-02-22 10:28:08 +03:00
if ( wq_cpu_intensive_warning_thresh & &
cnt > = wq_cpu_intensive_warning_thresh & &
is_power_of_2 ( cnt + 1 - wq_cpu_intensive_warning_thresh ) )
2023-05-18 06:02:08 +03:00
printk_deferred ( KERN_WARNING " workqueue: %ps hogged CPU for >%luus %llu times, consider switching to WQ_UNBOUND \n " ,
ent - > func , wq_cpu_intensive_thresh_us ,
atomic64_read ( & ent - > cnt ) ) ;
return ;
}
/*
* @ func is a new violation . Allocate a new entry for it . If wcn_ents [ ]
* is exhausted , something went really wrong and we probably made enough
* noise already .
*/
if ( wci_nr_ents > = WCI_MAX_ENTS )
return ;
raw_spin_lock ( & wci_lock ) ;
if ( wci_nr_ents > = WCI_MAX_ENTS ) {
raw_spin_unlock ( & wci_lock ) ;
return ;
}
if ( wci_find_ent ( func ) ) {
raw_spin_unlock ( & wci_lock ) ;
goto restart ;
}
ent = & wci_ents [ wci_nr_ents + + ] ;
ent - > func = func ;
2024-02-22 10:28:08 +03:00
atomic64_set ( & ent - > cnt , 0 ) ;
2023-05-18 06:02:08 +03:00
hash_add_rcu ( wci_hash , & ent - > hash_node , ( unsigned long ) func ) ;
raw_spin_unlock ( & wci_lock ) ;
2024-02-22 10:28:08 +03:00
goto restart ;
2023-05-18 06:02:08 +03:00
}
# else /* CONFIG_WQ_CPU_INTENSIVE_REPORT */
static void wq_cpu_intensive_report ( work_func_t func ) { }
# endif /* CONFIG_WQ_CPU_INTENSIVE_REPORT */
2010-06-29 12:07:13 +04:00
/**
2019-03-13 19:55:48 +03:00
* wq_worker_running - a worker is running again
2010-06-29 12:07:14 +04:00
* @ task : task waking up
*
2019-03-13 19:55:48 +03:00
* This function is called when a worker returns from schedule ( )
2010-06-29 12:07:14 +04:00
*/
2019-03-13 19:55:48 +03:00
void wq_worker_running ( struct task_struct * task )
2010-06-29 12:07:14 +04:00
{
struct worker * worker = kthread_data ( task ) ;
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 06:53:39 +03:00
if ( ! READ_ONCE ( worker - > sleeping ) )
2019-03-13 19:55:48 +03:00
return ;
workqueue: Fix unbind_workers() VS wq_worker_running() race
At CPU-hotplug time, unbind_worker() may preempt a worker while it is
waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running()
-------------- -------------------
if (!(worker->flags & WORKER_NOT_RUNNING))
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of 1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == 1", further queued work may
end up not being served, because no worker is awaken at work insert time.
This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in
wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with
unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-01 18:19:44 +03:00
/*
* If preempted by unbind_workers ( ) between the WORKER_NOT_RUNNING check
* and the nr_running increment below , we may ruin the nr_running reset
* and leave with an unexpected pool - > nr_running = = 1 on the newly unbound
* pool . Protect against such race .
*/
preempt_disable ( ) ;
2019-03-13 19:55:48 +03:00
if ( ! ( worker - > flags & WORKER_NOT_RUNNING ) )
2021-12-23 15:31:40 +03:00
worker - > pool - > nr_running + + ;
workqueue: Fix unbind_workers() VS wq_worker_running() race
At CPU-hotplug time, unbind_worker() may preempt a worker while it is
waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running()
-------------- -------------------
if (!(worker->flags & WORKER_NOT_RUNNING))
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of 1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == 1", further queued work may
end up not being served, because no worker is awaken at work insert time.
This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in
wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with
unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-01 18:19:44 +03:00
preempt_enable ( ) ;
2023-05-18 06:02:08 +03:00
/*
* CPU intensive auto - detection cares about how long a work item hogged
* CPU without sleeping . Reset the starting timestamp on wakeup .
*/
worker - > current_at = worker - > task - > se . sum_exec_runtime ;
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 06:53:39 +03:00
WRITE_ONCE ( worker - > sleeping , 0 ) ;
2010-06-29 12:07:14 +04:00
}
/**
* wq_worker_sleeping - a worker is going to sleep
* @ task : task going to sleep
*
2019-03-13 19:55:48 +03:00
* This function is called from schedule ( ) when a busy worker is
2021-12-07 10:35:37 +03:00
* going to sleep .
2010-06-29 12:07:14 +04:00
*/
2019-03-13 19:55:48 +03:00
void wq_worker_sleeping ( struct task_struct * task )
2010-06-29 12:07:14 +04:00
{
2021-12-23 15:31:39 +03:00
struct worker * worker = kthread_data ( task ) ;
2013-01-18 05:16:24 +04:00
struct worker_pool * pool ;
2010-06-29 12:07:14 +04:00
2013-01-18 05:16:24 +04:00
/*
* Rescuers , which may not have all the fields set up like normal
* workers , also reach here , let ' s not access anything before
* checking NOT_RUNNING .
*/
workqueue: It is likely that WORKER_NOT_RUNNING is true
Running the annotate branch profiler on three boxes, including my
main box that runs firefox, evolution, xchat, and is part of the distcc farm,
showed this with the likelys in the workqueue code:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
96 996253 99 wq_worker_sleeping workqueue.c 703
96 996247 99 wq_worker_waking_up workqueue.c 677
The likely()s in this case were assuming that WORKER_NOT_RUNNING will
most likely be false. But this is not the case. The reason is
(and shown by adding trace_printks and testing it) that most of the time
WORKER_PREP is set.
In worker_thread() we have:
worker_clr_flags(worker, WORKER_PREP);
[ do work stuff ]
worker_set_flags(worker, WORKER_PREP, false);
(that 'false' means not to wake up an idle worker)
The wq_worker_sleeping() is called from schedule when a worker thread
is putting itself to sleep. Which happens most of the time outside
of that [ do work stuff ].
The wq_worker_waking_up is called by the wakeup worker code, which
is also callod outside that [ do work stuff ].
Thus, the likely and unlikely used by those two functions are actually
backwards.
Remove the annotation and let gcc figure it out.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-04 07:12:33 +03:00
if ( worker - > flags & WORKER_NOT_RUNNING )
2019-03-13 19:55:48 +03:00
return ;
2010-06-29 12:07:14 +04:00
2013-01-18 05:16:24 +04:00
pool = worker - > pool ;
2020-03-28 02:29:59 +03:00
/* Return if preempted before wq_worker_running() was reached */
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 06:53:39 +03:00
if ( READ_ONCE ( worker - > sleeping ) )
2019-03-13 19:55:48 +03:00
return ;
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 06:53:39 +03:00
WRITE_ONCE ( worker - > sleeping , 1 ) ;
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 12:07:14 +04:00
2021-12-01 18:19:45 +03:00
/*
* Recheck in case unbind_workers ( ) preempted us . We don ' t
* want to decrement nr_running after the worker is unbound
* and nr_running has been reset .
*/
if ( worker - > flags & WORKER_NOT_RUNNING ) {
raw_spin_unlock_irq ( & pool - > lock ) ;
return ;
}
2021-12-23 15:31:40 +03:00
pool - > nr_running - - ;
2023-08-08 04:57:25 +03:00
if ( kick_pool ( pool ) )
2023-05-18 06:02:08 +03:00
worker - > current_pwq - > stats [ PWQ_STAT_CM_WAKEUP ] + + ;
2023-08-08 04:57:25 +03:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 12:07:14 +04:00
}
2023-05-18 06:02:08 +03:00
/**
* wq_worker_tick - a scheduler tick occurred while a kworker is running
* @ task : task currently running
*
2024-03-08 14:18:08 +03:00
* Called from sched_tick ( ) . We ' re in the IRQ context and the current
2023-05-18 06:02:08 +03:00
* worker ' s fields which follow the ' K ' locking rule can be accessed safely .
*/
void wq_worker_tick ( struct task_struct * task )
{
struct worker * worker = kthread_data ( task ) ;
struct pool_workqueue * pwq = worker - > current_pwq ;
struct worker_pool * pool = worker - > pool ;
if ( ! pwq )
return ;
2023-05-18 06:02:09 +03:00
pwq - > stats [ PWQ_STAT_CPU_TIME ] + = TICK_USEC ;
2023-05-25 07:00:38 +03:00
if ( ! wq_cpu_intensive_thresh_us )
return ;
2023-05-18 06:02:08 +03:00
/*
* If the current worker is concurrency managed and hogged the CPU for
* longer than wq_cpu_intensive_thresh_us , it ' s automatically marked
* CPU_INTENSIVE to avoid stalling other concurrency - managed work items .
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 06:53:39 +03:00
*
* Set @ worker - > sleeping means that @ worker is in the process of
* switching out voluntarily and won ' t be contributing to
* @ pool - > nr_running until it wakes up . As wq_worker_sleeping ( ) also
* decrements - > nr_running , setting CPU_INTENSIVE here can lead to
* double decrements . The task is releasing the CPU anyway . Let ' s skip .
* We probably want to make this prettier in the future .
2023-05-18 06:02:08 +03:00
*/
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
Currently, pool->nr_running can be modified from timer tick, that means the
timer tick can run nested inside a not-irq-protected section that's in the
process of modifying nr_running. Consider the following scenario:
CPU0
kworker/0:2 (events)
worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
->pool->nr_running++; (1)
process_one_work()
->worker->current_func(work);
->schedule()
->wq_worker_sleeping()
->worker->sleeping = 1;
->pool->nr_running--; (0)
....
->wq_worker_running()
....
CPU0 by interrupt:
wq_worker_tick()
->worker_set_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running--; (-1)
->worker->flags |= WORKER_CPU_INTENSIVE;
....
->if (!(worker->flags & WORKER_NOT_RUNNING))
->pool->nr_running++; (will not execute)
->worker->sleeping = 0;
....
->worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
->pool->nr_running++; (0)
....
worker_set_flags(worker, WORKER_PREP);
->pool->nr_running--; (-1)
....
worker_enter_idle()
->WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
if the nr_workers is equal to nr_idle, due to the nr_running is not zero,
will trigger WARN_ON_ONCE().
[ 2.460602] WARNING: CPU: 0 PID: 63 at kernel/workqueue.c:1999 worker_enter_idle+0xb2/0xc0
[ 2.462163] Modules linked in:
[ 2.463401] CPU: 0 PID: 63 Comm: kworker/0:2 Not tainted 6.4.0-rc2-next-20230519 #1
[ 2.463771] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 2.465127] Workqueue: 0x0 (events)
[ 2.465678] RIP: 0010:worker_enter_idle+0xb2/0xc0
...
[ 2.472614] Call Trace:
[ 2.473152] <TASK>
[ 2.474182] worker_thread+0x71/0x430
[ 2.474992] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 2.475263] kthread+0x103/0x120
[ 2.475493] ? __pfx_worker_thread+0x10/0x10
[ 2.476355] ? __pfx_kthread+0x10/0x10
[ 2.476635] ret_from_fork+0x2c/0x50
[ 2.477051] </TASK>
This commit therefore add the check of worker->sleeping in wq_worker_tick(),
if the worker->sleeping is not zero, directly return.
tj: Updated comment and description.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Closes: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20230519/testrun/17078554/suite/boot/test/clang-nightly-lkftconfig/log
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-05-24 06:53:39 +03:00
if ( ( worker - > flags & WORKER_NOT_RUNNING ) | | READ_ONCE ( worker - > sleeping ) | |
2023-05-18 06:02:08 +03:00
worker - > task - > se . sum_exec_runtime - worker - > current_at <
wq_cpu_intensive_thresh_us * NSEC_PER_USEC )
return ;
raw_spin_lock ( & pool - > lock ) ;
worker_set_flags ( worker , WORKER_CPU_INTENSIVE ) ;
2023-05-18 06:02:08 +03:00
wq_cpu_intensive_report ( worker - > current_func ) ;
2023-05-18 06:02:08 +03:00
pwq - > stats [ PWQ_STAT_CPU_INTENSIVE ] + + ;
2023-08-08 04:57:25 +03:00
if ( kick_pool ( pool ) )
2023-05-18 06:02:08 +03:00
pwq - > stats [ PWQ_STAT_CM_WAKEUP ] + + ;
raw_spin_unlock ( & pool - > lock ) ;
}
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-02 01:20:42 +03:00
/**
* wq_worker_last_func - retrieve worker ' s last work function
2019-03-19 20:45:09 +03:00
* @ task : Task to retrieve last work function of .
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-02 01:20:42 +03:00
*
* Determine the last function a worker executed . This is called from
* the scheduler to get a worker ' s last known identity .
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( rq - > lock )
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-02 01:20:42 +03:00
*
2019-03-08 03:29:30 +03:00
* This function is called during schedule ( ) when a kworker is going
* to sleep . It ' s used by psi to identify aggregation workers during
* dequeuing , to allow periodic aggregation to shut - off when that
* worker is the last task in the system or cgroup to go to sleep .
*
* As this function doesn ' t involve any workqueue - related locking , it
* only returns stable values when called from inside the scheduler ' s
* queuing and dequeuing paths , when @ task , which must be a kworker ,
* is guaranteed to not be processing any works .
*
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-02 01:20:42 +03:00
* Return :
* The last work function % current executed as a worker , NULL if it
* hasn ' t executed any work yet .
*/
work_func_t wq_worker_last_func ( struct task_struct * task )
{
struct worker * worker = kthread_data ( task ) ;
return worker - > last_func ;
}
2024-01-29 21:11:24 +03:00
/**
* wq_node_nr_active - Determine wq_node_nr_active to use
* @ wq : workqueue of interest
* @ node : NUMA node , can be % NUMA_NO_NODE
*
* Determine wq_node_nr_active to use for @ wq on @ node . Returns :
*
* - % NULL for per - cpu workqueues as they don ' t need to use shared nr_active .
*
* - node_nr_active [ nr_node_ids ] if @ node is % NUMA_NO_NODE .
*
* - Otherwise , node_nr_active [ @ node ] .
*/
static struct wq_node_nr_active * wq_node_nr_active ( struct workqueue_struct * wq ,
int node )
{
if ( ! ( wq - > flags & WQ_UNBOUND ) )
return NULL ;
if ( node = = NUMA_NO_NODE )
node = nr_node_ids ;
return wq - > node_nr_active [ node ] ;
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/**
* wq_update_node_max_active - Update per - node max_actives to use
* @ wq : workqueue to update
* @ off_cpu : CPU that ' s going down , - 1 if a CPU is not going down
*
* Update @ wq - > node_nr_active [ ] - > max . @ wq must be unbound . max_active is
* distributed among nodes according to the proportions of numbers of online
* cpus . The result is always between @ wq - > min_active and max_active .
*/
static void wq_update_node_max_active ( struct workqueue_struct * wq , int off_cpu )
{
struct cpumask * effective = unbound_effective_cpumask ( wq ) ;
int min_active = READ_ONCE ( wq - > min_active ) ;
int max_active = READ_ONCE ( wq - > max_active ) ;
int total_cpus , node ;
lockdep_assert_held ( & wq - > mutex ) ;
workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.
However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.
This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/workqueue.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_SYSTEM] = "system",
};
+static bool wq_topo_initialized = false;
+
/*
* Per-cpu work items which run for longer than the following threshold are
* automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
lockdep_assert_held(&wq->mutex);
+ if (!wq_topo_initialized)
+ return;
+
if (!cpumask_test_cpu(off_cpu, effective))
off_cpu = -1;
@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)
static void init_node_nr_active(struct wq_node_nr_active *nna)
{
+ nna->max = WQ_DFL_MIN_ACTIVE;
atomic_set(&nna->nr, 0);
raw_spin_lock_init(&nna->lock);
INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
+ wq_topo_initialized = true;
+
mutex_lock(&wq_pool_mutex);
/*
2024-01-31 08:06:43 +03:00
if ( ! wq_topo_initialized )
return ;
2024-01-31 07:55:55 +03:00
if ( off_cpu > = 0 & & ! cpumask_test_cpu ( off_cpu , effective ) )
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
off_cpu = - 1 ;
total_cpus = cpumask_weight_and ( effective , cpu_online_mask ) ;
if ( off_cpu > = 0 )
total_cpus - - ;
2024-04-24 16:51:54 +03:00
/* If all CPUs of the wq get offline, use the default values */
if ( unlikely ( ! total_cpus ) ) {
for_each_node ( node )
wq_node_nr_active ( wq , node ) - > max = min_active ;
wq_node_nr_active ( wq , NUMA_NO_NODE ) - > max = max_active ;
return ;
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
for_each_node ( node ) {
int node_cpus ;
node_cpus = cpumask_weight_and ( effective , cpumask_of_node ( node ) ) ;
if ( off_cpu > = 0 & & cpu_to_node ( off_cpu ) = = node )
node_cpus - - ;
wq_node_nr_active ( wq , node ) - > max =
clamp ( DIV_ROUND_UP ( max_active * node_cpus , total_cpus ) ,
min_active , max_active ) ;
}
2024-04-23 03:43:48 +03:00
wq_node_nr_active ( wq , NUMA_NO_NODE ) - > max = max_active ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
}
2013-03-12 22:30:04 +04:00
/**
* get_pwq - get an extra reference on the specified pool_workqueue
* @ pwq : pool_workqueue to get
*
* Obtain an extra reference on @ pwq . The caller should guarantee that
* @ pwq has positive refcnt and be holding the matching pool - > lock .
*/
static void get_pwq ( struct pool_workqueue * pwq )
{
lockdep_assert_held ( & pwq - > pool - > lock ) ;
WARN_ON_ONCE ( pwq - > refcnt < = 0 ) ;
pwq - > refcnt + + ;
}
/**
* put_pwq - put a pool_workqueue reference
* @ pwq : pool_workqueue to put
*
* Drop a reference of @ pwq . If its refcnt reaches zero , schedule its
* destruction . The caller should be holding the matching pool - > lock .
*/
static void put_pwq ( struct pool_workqueue * pwq )
{
lockdep_assert_held ( & pwq - > pool - > lock ) ;
if ( likely ( - - pwq - > refcnt ) )
return ;
/*
2023-08-08 04:57:23 +03:00
* @ pwq can ' t be released under pool - > lock , bounce to a dedicated
* kthread_worker to avoid A - A deadlocks .
2013-03-12 22:30:04 +04:00
*/
2023-08-08 04:57:23 +03:00
kthread_queue_work ( pwq_release_worker , & pwq - > release_work ) ;
2013-03-12 22:30:04 +04:00
}
2013-04-01 22:23:35 +04:00
/**
* put_pwq_unlocked - put_pwq ( ) with surrounding pool lock / unlock
* @ pwq : pool_workqueue to put ( can be % NULL )
*
* put_pwq ( ) with locking . This function also allows % NULL @ pwq .
*/
static void put_pwq_unlocked ( struct pool_workqueue * pwq )
{
if ( pwq ) {
/*
2019-03-13 19:55:47 +03:00
* As both pwqs and pools are RCU protected , the
2013-04-01 22:23:35 +04:00
* following lock operations are safe .
*/
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
2013-04-01 22:23:35 +04:00
put_pwq ( pwq ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2013-04-01 22:23:35 +04:00
}
}
2024-01-29 21:11:24 +03:00
static bool pwq_is_empty ( struct pool_workqueue * pwq )
{
return ! pwq - > nr_active & & list_empty ( & pwq - > inactive_works ) ;
}
2024-01-29 21:11:24 +03:00
static void __pwq_activate_work ( struct pool_workqueue * pwq ,
struct work_struct * work )
2012-08-03 21:30:46 +04:00
{
2024-01-29 21:11:24 +03:00
unsigned long * wdb = work_data_bits ( work ) ;
WARN_ON_ONCE ( ! ( * wdb & WORK_STRUCT_INACTIVE ) ) ;
2012-08-03 21:30:46 +04:00
trace_workqueue_activate_work ( work ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
if ( list_empty ( & pwq - > pool - > worklist ) )
pwq - > pool - > watchdog_ts = jiffies ;
2013-02-14 07:29:12 +04:00
move_linked_works ( work , & pwq - > pool - > worklist , NULL ) ;
2024-01-29 21:11:24 +03:00
__clear_bit ( WORK_STRUCT_INACTIVE_BIT , wdb ) ;
2024-01-29 21:11:24 +03:00
}
/**
* pwq_activate_work - Activate a work item if inactive
* @ pwq : pool_workqueue @ work belongs to
* @ work : work item to activate
*
* Returns % true if activated . % false if already active .
*/
static bool pwq_activate_work ( struct pool_workqueue * pwq ,
struct work_struct * work )
{
struct worker_pool * pool = pwq - > pool ;
2024-01-29 21:11:24 +03:00
struct wq_node_nr_active * nna ;
2024-01-29 21:11:24 +03:00
lockdep_assert_held ( & pool - > lock ) ;
if ( ! ( * work_data_bits ( work ) & WORK_STRUCT_INACTIVE ) )
return false ;
2024-01-29 21:11:24 +03:00
nna = wq_node_nr_active ( pwq - > wq , pool - > node ) ;
if ( nna )
atomic_inc ( & nna - > nr ) ;
2013-02-14 07:29:12 +04:00
pwq - > nr_active + + ;
2024-01-29 21:11:24 +03:00
__pwq_activate_work ( pwq , work ) ;
return true ;
2012-08-03 21:30:46 +04:00
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
static bool tryinc_node_nr_active ( struct wq_node_nr_active * nna )
{
int max = READ_ONCE ( nna - > max ) ;
while ( true ) {
int old , tmp ;
old = atomic_read ( & nna - > nr ) ;
if ( old > = max )
return false ;
tmp = atomic_cmpxchg_relaxed ( & nna - > nr , old , old + 1 ) ;
if ( tmp = = old )
return true ;
}
}
2024-01-29 21:11:24 +03:00
/**
* pwq_tryinc_nr_active - Try to increment nr_active for a pwq
* @ pwq : pool_workqueue of interest
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* @ fill : max_active may have increased , try to increase concurrency level
2024-01-29 21:11:24 +03:00
*
* Try to increment nr_active for @ pwq . Returns % true if an nr_active count is
* successfully obtained . % false otherwise .
*/
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
static bool pwq_tryinc_nr_active ( struct pool_workqueue * pwq , bool fill )
2024-01-29 21:11:24 +03:00
{
struct workqueue_struct * wq = pwq - > wq ;
struct worker_pool * pool = pwq - > pool ;
2024-01-29 21:11:24 +03:00
struct wq_node_nr_active * nna = wq_node_nr_active ( wq , pool - > node ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
bool obtained = false ;
2024-01-29 21:11:24 +03:00
lockdep_assert_held ( & pool - > lock ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( ! nna ) {
2024-02-05 00:28:06 +03:00
/* BH or per-cpu workqueue, pwq->nr_active is sufficient */
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
obtained = pwq - > nr_active < READ_ONCE ( wq - > max_active ) ;
goto out ;
}
2024-02-08 22:12:20 +03:00
if ( unlikely ( pwq - > plugged ) )
return false ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/*
* Unbound workqueue uses per - node shared nr_active $ nna . If @ pwq is
* already waiting on $ nna , pwq_dec_nr_active ( ) will maintain the
* concurrency level . Don ' t jump the line .
*
* We need to ignore the pending test after max_active has increased as
* pwq_dec_nr_active ( ) can only maintain the concurrency level but not
* increase it . This is indicated by @ fill .
*/
if ( ! list_empty ( & pwq - > pending_node ) & & likely ( ! fill ) )
goto out ;
obtained = tryinc_node_nr_active ( nna ) ;
if ( obtained )
goto out ;
/*
* Lockless acquisition failed . Lock , add ourself to $ nna - > pending_pwqs
* and try again . The smp_mb ( ) is paired with the implied memory barrier
* of atomic_dec_return ( ) in pwq_dec_nr_active ( ) to ensure that either
* we see the decremented $ nna - > nr or they see non - empty
* $ nna - > pending_pwqs .
*/
raw_spin_lock ( & nna - > lock ) ;
if ( list_empty ( & pwq - > pending_node ) )
list_add_tail ( & pwq - > pending_node , & nna - > pending_pwqs ) ;
else if ( likely ( ! fill ) )
goto out_unlock ;
smp_mb ( ) ;
obtained = tryinc_node_nr_active ( nna ) ;
2024-01-29 21:11:24 +03:00
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/*
* If @ fill , @ pwq might have already been pending . Being spuriously
* pending in cold paths doesn ' t affect anything . Let ' s leave it be .
*/
if ( obtained & & likely ( ! fill ) )
list_del_init ( & pwq - > pending_node ) ;
out_unlock :
raw_spin_unlock ( & nna - > lock ) ;
out :
if ( obtained )
2024-01-29 21:11:24 +03:00
pwq - > nr_active + + ;
return obtained ;
}
/**
* pwq_activate_first_inactive - Activate the first inactive work item on a pwq
* @ pwq : pool_workqueue of interest
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* @ fill : max_active may have increased , try to increase concurrency level
2024-01-29 21:11:24 +03:00
*
* Activate the first inactive work item of @ pwq if available and allowed by
* max_active limit .
*
* Returns % true if an inactive work item has been activated . % false if no
* inactive work item is found or max_active limit is reached .
*/
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
static bool pwq_activate_first_inactive ( struct pool_workqueue * pwq , bool fill )
2024-01-29 21:11:24 +03:00
{
struct work_struct * work =
list_first_entry_or_null ( & pwq - > inactive_works ,
struct work_struct , entry ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( work & & pwq_tryinc_nr_active ( pwq , fill ) ) {
2024-01-29 21:11:24 +03:00
__pwq_activate_work ( pwq , work ) ;
return true ;
} else {
return false ;
}
}
2024-02-08 22:12:20 +03:00
/**
2024-02-09 20:06:11 +03:00
* unplug_oldest_pwq - unplug the oldest pool_workqueue
* @ wq : workqueue_struct where its oldest pwq is to be unplugged
2024-02-08 22:12:20 +03:00
*
2024-02-09 20:06:11 +03:00
* This function should only be called for ordered workqueues where only the
* oldest pwq is unplugged , the others are plugged to suspend execution to
* ensure proper work item ordering : :
2024-02-08 22:12:20 +03:00
*
* dfl_pwq - - - - - - - - - - - - - - + [ P ] - plugged
* |
* v
* pwqs - > A - > B [ P ] - > C [ P ] ( newest )
* | | |
* 1 3 5
* | | |
* 2 4 6
2024-02-09 20:06:11 +03:00
*
* When the oldest pwq is drained and removed , this function should be called
* to unplug the next oldest one to start its work item execution . Note that
* pwq ' s are linked into wq - > pwqs with the oldest first , so the first one in
* the list is the oldest .
2024-02-08 22:12:20 +03:00
*/
static void unplug_oldest_pwq ( struct workqueue_struct * wq )
{
struct pool_workqueue * pwq ;
lockdep_assert_held ( & wq - > mutex ) ;
/* Caller should make sure that pwqs isn't empty before calling */
pwq = list_first_entry_or_null ( & wq - > pwqs , struct pool_workqueue ,
pwqs_node ) ;
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
if ( pwq - > plugged ) {
pwq - > plugged = false ;
if ( pwq_activate_first_inactive ( pwq , true ) )
kick_pool ( pwq - > pool ) ;
}
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/**
* node_activate_pending_pwq - Activate a pending pwq on a wq_node_nr_active
* @ nna : wq_node_nr_active to activate a pending pwq for
* @ caller_pool : worker_pool the caller is locking
*
* Activate a pwq in @ nna - > pending_pwqs . Called with @ caller_pool locked .
* @ caller_pool may be unlocked and relocked to lock other worker_pools .
*/
static void node_activate_pending_pwq ( struct wq_node_nr_active * nna ,
struct worker_pool * caller_pool )
{
struct worker_pool * locked_pool = caller_pool ;
struct pool_workqueue * pwq ;
struct work_struct * work ;
lockdep_assert_held ( & caller_pool - > lock ) ;
raw_spin_lock ( & nna - > lock ) ;
retry :
pwq = list_first_entry_or_null ( & nna - > pending_pwqs ,
struct pool_workqueue , pending_node ) ;
if ( ! pwq )
goto out_unlock ;
/*
* If @ pwq is for a different pool than @ locked_pool , we need to lock
* @ pwq - > pool - > lock . Let ' s trylock first . If unsuccessful , do the unlock
* / lock dance . For that , we also need to release @ nna - > lock as it ' s
* nested inside pool locks .
*/
if ( pwq - > pool ! = locked_pool ) {
raw_spin_unlock ( & locked_pool - > lock ) ;
locked_pool = pwq - > pool ;
if ( ! raw_spin_trylock ( & locked_pool - > lock ) ) {
raw_spin_unlock ( & nna - > lock ) ;
raw_spin_lock ( & locked_pool - > lock ) ;
raw_spin_lock ( & nna - > lock ) ;
goto retry ;
}
}
/*
* $ pwq may not have any inactive work items due to e . g . cancellations .
* Drop it from pending_pwqs and see if there ' s another one .
*/
work = list_first_entry_or_null ( & pwq - > inactive_works ,
struct work_struct , entry ) ;
if ( ! work ) {
list_del_init ( & pwq - > pending_node ) ;
goto retry ;
}
/*
* Acquire an nr_active count and activate the inactive work item . If
* $ pwq still has inactive work items , rotate it to the end of the
* pending_pwqs so that we round - robin through them . This means that
* inactive work items are not activated in queueing order which is fine
* given that there has never been any ordering across different pwqs .
*/
if ( likely ( tryinc_node_nr_active ( nna ) ) ) {
pwq - > nr_active + + ;
__pwq_activate_work ( pwq , work ) ;
if ( list_empty ( & pwq - > inactive_works ) )
list_del_init ( & pwq - > pending_node ) ;
else
list_move_tail ( & pwq - > pending_node , & nna - > pending_pwqs ) ;
/* if activating a foreign pool, make sure it's running */
if ( pwq - > pool ! = caller_pool )
kick_pool ( pwq - > pool ) ;
}
out_unlock :
raw_spin_unlock ( & nna - > lock ) ;
if ( locked_pool ! = caller_pool ) {
raw_spin_unlock ( & locked_pool - > lock ) ;
raw_spin_lock ( & caller_pool - > lock ) ;
}
}
2024-01-29 21:11:24 +03:00
/**
* pwq_dec_nr_active - Retire an active count
* @ pwq : pool_workqueue of interest
*
* Decrement @ pwq ' s nr_active and try to activate the first inactive work item .
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* For unbound workqueues , this function may temporarily drop @ pwq - > pool - > lock .
2024-01-29 21:11:24 +03:00
*/
static void pwq_dec_nr_active ( struct pool_workqueue * pwq )
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2012-09-18 21:40:00 +04:00
{
2024-01-29 21:11:24 +03:00
struct worker_pool * pool = pwq - > pool ;
2024-01-29 21:11:24 +03:00
struct wq_node_nr_active * nna = wq_node_nr_active ( pwq - > wq , pool - > node ) ;
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2012-09-18 21:40:00 +04:00
2024-01-29 21:11:24 +03:00
lockdep_assert_held ( & pool - > lock ) ;
2024-01-29 21:11:24 +03:00
/*
* @ pwq - > nr_active should be decremented for both percpu and unbound
* workqueues .
*/
2024-01-29 21:11:24 +03:00
pwq - > nr_active - - ;
2024-01-29 21:11:24 +03:00
/*
* For a percpu workqueue , it ' s simple . Just need to kick the first
* inactive work item on @ pwq itself .
*/
if ( ! nna ) {
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
pwq_activate_first_inactive ( pwq , false ) ;
2024-01-29 21:11:24 +03:00
return ;
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/*
* If @ pwq is for an unbound workqueue , it ' s more complicated because
* multiple pwqs and pools may be sharing the nr_active count . When a
* pwq needs to wait for an nr_active count , it puts itself on
* $ nna - > pending_pwqs . The following atomic_dec_return ( ) ' s implied
* memory barrier is paired with smp_mb ( ) in pwq_tryinc_nr_active ( ) to
* guarantee that either we see non - empty pending_pwqs or they see
* decremented $ nna - > nr .
*
* $ nna - > max may change as CPUs come online / offline and @ pwq - > wq ' s
* max_active gets updated . However , it is guaranteed to be equal to or
* larger than @ pwq - > wq - > min_active which is above zero unless freezing .
* This maintains the forward progress guarantee .
*/
if ( atomic_dec_return ( & nna - > nr ) > = READ_ONCE ( nna - > max ) )
return ;
if ( ! list_empty ( & nna - > pending_pwqs ) )
node_activate_pending_pwq ( nna , pool ) ;
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2012-09-18 21:40:00 +04:00
}
2012-08-03 21:30:46 +04:00
/**
2013-02-14 07:29:12 +04:00
* pwq_dec_nr_in_flight - decrement pwq ' s nr_in_flight
* @ pwq : pwq of interest
2021-08-17 04:32:35 +03:00
* @ work_data : work_data of work which left the queue
2012-08-03 21:30:46 +04:00
*
* A work either has completed or is removed from pending queue ,
2013-02-14 07:29:12 +04:00
* decrement nr_in_flight of its pwq and handle workqueue flushing .
2012-08-03 21:30:46 +04:00
*
2024-01-29 21:11:24 +03:00
* NOTE :
* For unbound workqueues , this function may temporarily drop @ pwq - > pool - > lock
* and thus should be called after all other state updates for the in - flight
* work item is complete .
*
2012-08-03 21:30:46 +04:00
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) .
2012-08-03 21:30:46 +04:00
*/
2021-08-17 04:32:35 +03:00
static void pwq_dec_nr_in_flight ( struct pool_workqueue * pwq , unsigned long work_data )
2012-08-03 21:30:46 +04:00
{
2021-08-17 04:32:35 +03:00
int color = get_work_color ( work_data ) ;
2024-01-29 21:11:24 +03:00
if ( ! ( work_data & WORK_STRUCT_INACTIVE ) )
pwq_dec_nr_active ( pwq ) ;
2021-08-17 04:32:37 +03:00
2013-02-14 07:29:12 +04:00
pwq - > nr_in_flight [ color ] - - ;
2012-08-03 21:30:46 +04:00
/* is flush in progress and are we at the flushing tip? */
2013-02-14 07:29:12 +04:00
if ( likely ( pwq - > flush_color ! = color ) )
2013-03-12 22:30:04 +04:00
goto out_put ;
2012-08-03 21:30:46 +04:00
/* are there still in-flight works? */
2013-02-14 07:29:12 +04:00
if ( pwq - > nr_in_flight [ color ] )
2013-03-12 22:30:04 +04:00
goto out_put ;
2012-08-03 21:30:46 +04:00
2013-02-14 07:29:12 +04:00
/* this pwq is done, clear flush_color */
pwq - > flush_color = - 1 ;
2012-08-03 21:30:46 +04:00
/*
2013-02-14 07:29:12 +04:00
* If this was the last pwq , wake up the first flusher . It
2012-08-03 21:30:46 +04:00
* will handle the rest .
*/
2013-02-14 07:29:12 +04:00
if ( atomic_dec_and_test ( & pwq - > wq - > nr_pwqs_to_flush ) )
complete ( & pwq - > wq - > first_flusher - > done ) ;
2013-03-12 22:30:04 +04:00
out_put :
put_pwq ( pwq ) ;
2012-08-03 21:30:46 +04:00
}
2012-08-03 21:30:46 +04:00
/**
2012-08-03 21:30:46 +04:00
* try_to_grab_pending - steal work item from worklist and disable irq
2012-08-03 21:30:46 +04:00
* @ work : work item to steal
2024-02-21 08:36:14 +03:00
* @ cflags : % WORK_CANCEL_ flags
2024-02-21 08:36:14 +03:00
* @ irq_flags : place to store irq state
2012-08-03 21:30:46 +04:00
*
* Try to grab PENDING bit of @ work . This function can handle @ work in any
2013-08-01 01:59:24 +04:00
* stable state - idle , on timer or on worklist .
2012-08-03 21:30:46 +04:00
*
2013-08-01 01:59:24 +04:00
* Return :
2020-09-29 14:12:51 +03:00
*
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
2012-08-03 21:30:46 +04:00
* 1 if @ work was pending and we successfully stole PENDING
* 0 if @ work was idle and we claimed PENDING
* - EAGAIN if PENDING couldn ' t be grabbed at the moment , safe to busy - retry
2020-09-29 14:12:51 +03:00
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
2012-08-03 21:30:46 +04:00
*
2013-08-01 01:59:24 +04:00
* Note :
2012-08-03 21:30:46 +04:00
* On > = 0 return , the caller owns @ work ' s PENDING bit . To avoid getting
2012-08-22 00:18:24 +04:00
* interrupted while holding PENDING and @ work off queue , irq must be
* disabled on entry . This , combined with delayed_work - > timer being
* irqsafe , ensures that we return - EAGAIN for finite short period of time .
2012-08-03 21:30:46 +04:00
*
* On successful return , > = 0 , irq is disabled and the caller is
2024-02-21 08:36:14 +03:00
* responsible for releasing it using local_irq_restore ( * @ irq_flags ) .
2012-08-03 21:30:46 +04:00
*
2012-08-22 00:18:24 +04:00
* This function is safe to call from any context including IRQ handler .
2012-08-03 21:30:46 +04:00
*/
2024-02-21 08:36:14 +03:00
static int try_to_grab_pending ( struct work_struct * work , u32 cflags ,
2024-02-21 08:36:14 +03:00
unsigned long * irq_flags )
2012-08-03 21:30:46 +04:00
{
2013-01-24 23:01:33 +04:00
struct worker_pool * pool ;
2013-02-14 07:29:12 +04:00
struct pool_workqueue * pwq ;
2012-08-03 21:30:46 +04:00
2024-02-21 08:36:14 +03:00
local_irq_save ( * irq_flags ) ;
2012-08-03 21:30:46 +04:00
2012-08-03 21:30:46 +04:00
/* try to steal the timer if it exists */
2024-02-21 08:36:14 +03:00
if ( cflags & WORK_CANCEL_DELAYED ) {
2012-08-03 21:30:46 +04:00
struct delayed_work * dwork = to_delayed_work ( work ) ;
2012-08-22 00:18:24 +04:00
/*
* dwork - > timer is irqsafe . If del_timer ( ) fails , it ' s
* guaranteed that the timer is not queued anywhere and not
* running on the local CPU .
*/
2012-08-03 21:30:46 +04:00
if ( likely ( del_timer ( & dwork - > timer ) ) )
return 1 ;
}
/* try to claim PENDING the normal way */
2012-08-03 21:30:46 +04:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) )
return 0 ;
2019-03-13 19:55:47 +03:00
rcu_read_lock ( ) ;
2012-08-03 21:30:46 +04:00
/*
* The queueing is in progress , or it is already queued . Try to
* steal it from - > worklist without clearing WORK_STRUCT_PENDING .
*/
2013-01-24 23:01:33 +04:00
pool = get_work_pool ( work ) ;
if ( ! pool )
2012-08-03 21:30:46 +04:00
goto fail ;
2012-08-03 21:30:46 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock ( & pool - > lock ) ;
workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 06:04:53 +04:00
/*
2013-02-14 07:29:12 +04:00
* work - > data is guaranteed to point to pwq only while the work
* item is queued on pwq - > wq , and both updating work - > data to point
* to pwq on queueing and to pool on dequeueing are done under
* pwq - > pool - > lock . This in turn guarantees that , if work - > data
* points to pwq which is associated with a locked pool , the work
workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 06:04:53 +04:00
* item is currently queued on that pool .
*/
2013-02-14 07:29:12 +04:00
pwq = get_work_pwq ( work ) ;
if ( pwq & & pwq - > pool = = pool ) {
2024-02-05 00:14:21 +03:00
unsigned long work_data ;
2013-02-07 06:04:53 +04:00
debug_work_deactivate ( work ) ;
/*
2021-08-17 04:32:37 +03:00
* A cancelable inactive work item must be in the
* pwq - > inactive_works since a queued barrier can ' t be
* canceled ( see the comments in insert_wq_barrier ( ) ) .
*
2021-08-17 04:32:34 +03:00
* An inactive work item cannot be grabbed directly because
2021-08-17 04:32:38 +03:00
* it might have linked barrier work items which , if left
2021-08-17 04:32:34 +03:00
* on the inactive_works list , will confuse pwq - > nr_active
2013-02-07 06:04:53 +04:00
* management later on and cause stall . Make sure the work
* item is activated before grabbing .
*/
2024-01-29 21:11:24 +03:00
pwq_activate_work ( pwq , work ) ;
2013-02-07 06:04:53 +04:00
list_del_init ( & work - > entry ) ;
2024-02-05 00:14:21 +03:00
/*
* work - > data points to pwq iff queued . Let ' s point to pool . As
* this destroys work - > data needed by the next step , stash it .
*/
work_data = * work_data_bits ( work ) ;
2024-03-25 20:21:03 +03:00
set_work_pool_and_keep_pending ( work , pool - > id ,
pool_offq_flags ( pool ) ) ;
2013-02-07 06:04:53 +04:00
2024-01-29 21:11:24 +03:00
/* must be the last step, see the function comment */
2024-02-05 00:14:21 +03:00
pwq_dec_nr_in_flight ( pwq , work_data ) ;
2024-01-29 21:11:24 +03:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock ( & pool - > lock ) ;
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2013-02-07 06:04:53 +04:00
return 1 ;
2012-08-03 21:30:46 +04:00
}
2020-05-27 22:46:33 +03:00
raw_spin_unlock ( & pool - > lock ) ;
2012-08-03 21:30:46 +04:00
fail :
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2024-02-21 08:36:14 +03:00
local_irq_restore ( * irq_flags ) ;
2012-08-03 21:30:46 +04:00
return - EAGAIN ;
2012-08-03 21:30:46 +04:00
}
2024-02-21 08:36:14 +03:00
/**
* work_grab_pending - steal work item from worklist and disable irq
* @ work : work item to steal
* @ cflags : % WORK_CANCEL_ flags
* @ irq_flags : place to store IRQ state
*
* Grab PENDING bit of @ work . @ work can be in any stable state - idle , on timer
* or on worklist .
*
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
* Can be called from any context . IRQ is disabled on return with IRQ state
2024-02-21 08:36:14 +03:00
* stored in * @ irq_flags . The caller is responsible for re - enabling it using
* local_irq_restore ( ) .
*
* Returns % true if @ work was pending . % false if idle .
*/
static bool work_grab_pending ( struct work_struct * work , u32 cflags ,
unsigned long * irq_flags )
{
int ret ;
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
while ( true ) {
ret = try_to_grab_pending ( work , cflags , irq_flags ) ;
if ( ret > = 0 )
return ret ;
cpu_relax ( ) ;
}
2024-02-21 08:36:14 +03:00
}
2010-06-29 12:07:10 +04:00
/**
2013-01-24 23:01:34 +04:00
* insert_work - insert a work into a pool
2013-02-14 07:29:12 +04:00
* @ pwq : pwq @ work belongs to
2010-06-29 12:07:10 +04:00
* @ work : work to insert
* @ head : insertion point
* @ extra_flags : extra WORK_STRUCT_ * flags to set
*
2013-02-14 07:29:12 +04:00
* Insert @ work which belongs to @ pwq after @ head . @ extra_flags is or ' d to
2013-01-24 23:01:34 +04:00
* work_struct flags .
2010-06-29 12:07:10 +04:00
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 12:07:10 +04:00
*/
2013-02-14 07:29:12 +04:00
static void insert_work ( struct pool_workqueue * pwq , struct work_struct * work ,
struct list_head * head , unsigned int extra_flags )
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:33:52 +04:00
{
2023-08-08 04:57:22 +03:00
debug_work_activate ( work ) ;
2010-06-29 12:07:14 +04:00
2020-12-15 06:09:09 +03:00
/* record the work call stack in order to print it in KASAN reports */
workqueue, kasan: avoid alloc_pages() when recording stack
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
| kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
| it tries to allocate memory attempting to acquire spinlock in page
| allocation code while holding workqueue pool raw_spinlock.
|
| There are several instances of this problem when block layer tries
| to __queue_work(). Call trace from one of these instances is below:
|
| kblockd_mod_delayed_work_on()
| mod_delayed_work_on()
| __queue_delayed_work()
| __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
| insert_work()
| kasan_record_aux_stack()
| kasan_save_stack()
| stack_depot_save()
| alloc_pages()
| __alloc_pages()
| get_page_from_freelist()
| rm_queue()
| rm_queue_pcplist()
| local_lock_irqsave(&pagesets.lock, flags);
| [ BUG: Invalid wait context triggered ]
The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
In general, however, it is not even possible to use either GFP_ATOMIC
nor GFP_NOWAIT in certain non-preemptive contexts, including
raw_spin_locks (see gfp.h and commmit ab00db216c9c7).
Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using
kasan_record_aux_stack_noalloc().
While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit).
For frequent calls from the same location, it therefore becomes
extremely unlikely that kasan_record_aux_stack_noalloc() fails.
Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 23:35:50 +03:00
kasan_record_aux_stack_noalloc ( work ) ;
2020-12-15 06:09:09 +03:00
2010-06-29 12:07:10 +04:00
/* we own @work, set data and link */
2013-02-14 07:29:12 +04:00
set_work_pwq ( work , pwq , extra_flags ) ;
2008-07-25 12:47:47 +04:00
list_add_tail ( & work - > entry , head ) ;
2013-03-12 22:30:04 +04:00
get_pwq ( pwq ) ;
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:33:52 +04:00
}
2010-12-20 21:32:04 +03:00
/*
* Test whether @ work is being queued from another work executing on the
2013-02-14 07:29:10 +04:00
* same workqueue .
2010-12-20 21:32:04 +03:00
*/
static bool is_chained_work ( struct workqueue_struct * wq )
{
2013-02-14 07:29:10 +04:00
struct worker * worker ;
worker = current_wq_worker ( ) ;
/*
2019-03-02 00:57:25 +03:00
* Return % true iff I ' m a worker executing a work item on @ wq . If
2013-02-14 07:29:10 +04:00
* I ' m @ worker , it ' s safe to dereference it without locking .
*/
2013-02-14 07:29:12 +04:00
return worker & & worker - > current_pwq - > wq = = wq ;
2010-12-20 21:32:04 +03:00
}
2016-02-10 01:59:38 +03:00
/*
* When queueing an unbound work item to a wq , prefer local CPU if allowed
* by wq_unbound_cpumask . Otherwise , round robin among the allowed ones to
* avoid perturbing sensitive tasks .
*/
static int wq_select_unbound_cpu ( int cpu )
{
int new_cpu ;
2016-02-10 01:59:38 +03:00
if ( likely ( ! wq_debug_force_rr_cpu ) ) {
if ( cpumask_test_cpu ( cpu , wq_unbound_cpumask ) )
return cpu ;
2023-02-26 19:53:20 +03:00
} else {
pr_warn_once ( " workqueue: round-robin CPU selection forced, expect performance impact \n " ) ;
2016-02-10 01:59:38 +03:00
}
2016-02-10 01:59:38 +03:00
new_cpu = __this_cpu_read ( wq_rr_cpu_last ) ;
new_cpu = cpumask_next_and ( new_cpu , wq_unbound_cpumask , cpu_online_mask ) ;
if ( unlikely ( new_cpu > = nr_cpu_ids ) ) {
new_cpu = cpumask_first_and ( wq_unbound_cpumask , cpu_online_mask ) ;
if ( unlikely ( new_cpu > = nr_cpu_ids ) )
return cpu ;
}
__this_cpu_write ( wq_rr_cpu_last , new_cpu ) ;
return new_cpu ;
}
2013-03-12 22:29:59 +04:00
static void __queue_work ( int cpu , struct workqueue_struct * wq ,
2005-04-17 02:20:36 +04:00
struct work_struct * work )
{
2013-02-14 07:29:12 +04:00
struct pool_workqueue * pwq ;
2023-08-08 04:57:22 +03:00
struct worker_pool * last_pool , * pool ;
2010-08-25 12:33:56 +04:00
unsigned int work_flags ;
2012-08-15 18:25:37 +04:00
unsigned int req_cpu = cpu ;
2012-08-03 21:30:45 +04:00
/*
* While a work item is PENDING & & off queue , a task trying to
* steal the PENDING will busy - loop waiting for it to either get
* queued or lose PENDING . Grabbing PENDING and queueing should
* happen with IRQ disabled .
*/
2017-11-06 18:01:19 +03:00
lockdep_assert_irqs_disabled ( ) ;
2005-04-17 02:20:36 +04:00
2022-12-13 07:39:36 +03:00
/*
* For a draining wq , only works from the same workqueue are
* allowed . The __WQ_DESTROYING helps to spot the issue that
* queues a new work item to a wq after destroy_workqueue ( wq ) .
*/
if ( unlikely ( wq - > flags & ( __WQ_DESTROYING | __WQ_DRAINING ) & &
WARN_ON_ONCE ( ! is_chained_work ( wq ) ) ) )
2010-08-24 16:22:47 +04:00
return ;
2019-03-13 19:55:47 +03:00
rcu_read_lock ( ) ;
2013-03-12 22:30:04 +04:00
retry :
2013-03-12 22:30:04 +04:00
/* pwq which will be used unless @work is executing elsewhere */
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
if ( req_cpu = = WORK_CPU_UNBOUND ) {
if ( wq - > flags & WQ_UNBOUND )
2020-01-25 04:14:45 +03:00
cpu = wq_select_unbound_cpu ( raw_smp_processor_id ( ) ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
else
2020-01-25 04:14:45 +03:00
cpu = raw_smp_processor_id ( ) ;
}
workqueue: make all workqueues non-reentrant
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-21 01:51:23 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
pwq = rcu_dereference ( * per_cpu_ptr ( wq - > cpu_pwq , cpu ) ) ;
2023-08-08 04:57:22 +03:00
pool = pwq - > pool ;
2013-03-12 22:30:04 +04:00
/*
* If @ work was previously on a different pool , it might still be
* running there , in which case the work needs to be queued on that
* pool to guarantee non - reentrancy .
*/
last_pool = get_work_pool ( work ) ;
2023-08-08 04:57:22 +03:00
if ( last_pool & & last_pool ! = pool ) {
2013-03-12 22:30:04 +04:00
struct worker * worker ;
2010-06-29 12:07:13 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock ( & last_pool - > lock ) ;
2010-06-29 12:07:13 +04:00
2013-03-12 22:30:04 +04:00
worker = find_worker_executing_work ( last_pool , work ) ;
2010-06-29 12:07:13 +04:00
2013-03-12 22:30:04 +04:00
if ( worker & & worker - > current_pwq - > wq = = wq ) {
pwq = worker - > current_pwq ;
2023-08-08 04:57:22 +03:00
pool = pwq - > pool ;
WARN_ON_ONCE ( pool ! = last_pool ) ;
2012-08-03 21:30:45 +04:00
} else {
2013-03-12 22:30:04 +04:00
/* meh... not running there, queue here */
2020-05-27 22:46:33 +03:00
raw_spin_unlock ( & last_pool - > lock ) ;
2023-08-08 04:57:22 +03:00
raw_spin_lock ( & pool - > lock ) ;
2012-08-03 21:30:45 +04:00
}
2010-07-02 12:03:51 +04:00
} else {
2023-08-08 04:57:22 +03:00
raw_spin_lock ( & pool - > lock ) ;
2010-06-29 12:07:13 +04:00
}
2013-03-12 22:30:04 +04:00
/*
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
* pwq is determined and locked . For unbound pools , we could have raced
* with pwq release and it could already be dead . If its refcnt is zero ,
* repeat pwq selection . Note that unbound pwqs never die without
* another pwq replacing it in cpu_pwq or while work items are executing
* on it , so the retrying is guaranteed to make forward - progress .
2013-03-12 22:30:04 +04:00
*/
if ( unlikely ( ! pwq - > refcnt ) ) {
if ( wq - > flags & WQ_UNBOUND ) {
2023-08-08 04:57:22 +03:00
raw_spin_unlock ( & pool - > lock ) ;
2013-03-12 22:30:04 +04:00
cpu_relax ( ) ;
goto retry ;
}
/* oops */
WARN_ONCE ( true , " workqueue: per-cpu pwq for %s on cpu%d has 0 refcnt " ,
wq - > name , cpu ) ;
}
2013-02-14 07:29:12 +04:00
/* pwq determined, queue */
trace_workqueue_queue_work ( req_cpu , pwq , work ) ;
2010-06-29 12:07:13 +04:00
2019-03-13 19:55:47 +03:00
if ( WARN_ON ( ! list_empty ( & work - > entry ) ) )
goto out ;
2010-06-29 12:07:12 +04:00
2013-02-14 07:29:12 +04:00
pwq - > nr_in_flight [ pwq - > work_color ] + + ;
work_flags = work_color_to_flags ( pwq - > work_color ) ;
2010-06-29 12:07:12 +04:00
2024-01-29 21:11:24 +03:00
/*
* Limit the number of concurrently active work items to max_active .
* @ work must also queue behind existing inactive work items to maintain
* ordering when max_active changes . See wq_adjust_max_active ( ) .
*/
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( list_empty ( & pwq - > inactive_works ) & & pwq_tryinc_nr_active ( pwq , false ) ) {
2023-08-08 04:57:22 +03:00
if ( list_empty ( & pool - > worklist ) )
pool - > watchdog_ts = jiffies ;
2010-10-05 12:49:55 +04:00
trace_workqueue_activate_work ( work ) ;
2023-08-08 04:57:22 +03:00
insert_work ( pwq , work , & pool - > worklist , work_flags ) ;
2023-08-08 04:57:25 +03:00
kick_pool ( pool ) ;
2010-08-25 12:33:56 +04:00
} else {
2021-08-17 04:32:34 +03:00
work_flags | = WORK_STRUCT_INACTIVE ;
2023-08-08 04:57:22 +03:00
insert_work ( pwq , work , & pwq - > inactive_works , work_flags ) ;
2010-08-25 12:33:56 +04:00
}
2010-06-29 12:07:12 +04:00
2019-03-13 19:55:47 +03:00
out :
2023-08-08 04:57:22 +03:00
raw_spin_unlock ( & pool - > lock ) ;
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
}
2024-03-25 20:21:03 +03:00
static bool clear_pending_if_disabled ( struct work_struct * work )
{
unsigned long data = * work_data_bits ( work ) ;
struct work_offq_data offqd ;
if ( likely ( ( data & WORK_STRUCT_PWQ ) | |
! ( data & WORK_OFFQ_DISABLE_MASK ) ) )
return false ;
work_offqd_unpack ( & offqd , data ) ;
set_work_pool_and_clear_pending ( work , offqd . pool_id ,
work_offqd_pack_flags ( & offqd ) ) ;
return true ;
}
2006-07-30 14:03:42 +04:00
/**
2008-07-24 08:28:39 +04:00
* queue_work_on - queue work on specific cpu
* @ cpu : CPU number to execute work on
2006-07-30 14:03:42 +04:00
* @ wq : workqueue to use
* @ work : work to queue
*
2008-07-24 08:28:39 +04:00
* We queue the work to a specific CPU , the caller must ensure it
2021-12-01 04:00:30 +03:00
* can ' t go away . Callers that fail to ensure that the specified
* CPU cannot go away will execute on a randomly chosen CPU .
2023-04-29 02:47:07 +03:00
* But note well that callers specifying a CPU that never has been
* online will get a splat .
2013-08-01 01:59:24 +04:00
*
* Return : % false if @ work was already on a queue , % true otherwise .
2005-04-17 02:20:36 +04:00
*/
2012-08-03 21:30:44 +04:00
bool queue_work_on ( int cpu , struct workqueue_struct * wq ,
struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
2012-08-03 21:30:44 +04:00
bool ret = false ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2008-07-25 12:47:53 +04:00
2024-02-21 08:36:14 +03:00
local_irq_save ( irq_flags ) ;
2008-07-24 08:28:39 +04:00
2024-03-25 20:21:03 +03:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) & &
! clear_pending_if_disabled ( work ) ) {
2010-06-29 12:07:10 +04:00
__queue_work ( cpu , wq , work ) ;
2012-08-03 21:30:44 +04:00
ret = true ;
2008-07-24 08:28:39 +04:00
}
2008-07-25 12:47:53 +04:00
2024-02-21 08:36:14 +03:00
local_irq_restore ( irq_flags ) ;
2005-04-17 02:20:36 +04:00
return ret ;
}
2013-05-07 01:44:55 +04:00
EXPORT_SYMBOL ( queue_work_on ) ;
2005-04-17 02:20:36 +04:00
2019-01-22 21:39:26 +03:00
/**
2023-08-08 04:57:23 +03:00
* select_numa_node_cpu - Select a CPU based on NUMA node
2019-01-22 21:39:26 +03:00
* @ node : NUMA node ID that we want to select a CPU from
*
* This function will attempt to find a " random " cpu available on a given
* node . If there are no CPUs available on the given node it will return
* WORK_CPU_UNBOUND indicating that we should just schedule to any
* available CPU if we need to schedule this work .
*/
2023-08-08 04:57:23 +03:00
static int select_numa_node_cpu ( int node )
2019-01-22 21:39:26 +03:00
{
int cpu ;
/* Delay binding to CPU if node is not valid or online */
if ( node < 0 | | node > = MAX_NUMNODES | | ! node_online ( node ) )
return WORK_CPU_UNBOUND ;
/* Use local node/cpu if we are already there */
cpu = raw_smp_processor_id ( ) ;
if ( node = = cpu_to_node ( cpu ) )
return cpu ;
/* Use "random" otherwise know as "first" online CPU of node */
cpu = cpumask_any_and ( cpumask_of_node ( node ) , cpu_online_mask ) ;
/* If CPU is valid return that, otherwise just defer */
return cpu < nr_cpu_ids ? cpu : WORK_CPU_UNBOUND ;
}
/**
* queue_work_node - queue work on a " random " cpu for a given NUMA node
* @ node : NUMA node that we are targeting the work for
* @ wq : workqueue to use
* @ work : work to queue
*
* We queue the work to a " random " CPU within a given NUMA node . The basic
* idea here is to provide a way to somehow associate work with a given
* NUMA node .
*
* This function will only make a best effort attempt at getting this onto
* the right NUMA node . If no node is requested or the requested node is
* offline then we just fall back to standard queue_work behavior .
*
* Currently the " random " CPU ends up being the first available CPU in the
* intersection of cpu_online_mask and the cpumask of the node , unless we
* are running on the node . In that case we just use the current CPU .
*
* Return : % false if @ work was already on a queue , % true otherwise .
*/
bool queue_work_node ( int node , struct workqueue_struct * wq ,
struct work_struct * work )
{
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2019-01-22 21:39:26 +03:00
bool ret = false ;
/*
* This current implementation is specific to unbound workqueues .
* Specifically we only return the first available CPU for a given
* node instead of cycling through individual CPUs within the node .
*
* If this is used with a per - cpu workqueue then the logic in
* workqueue_select_cpu_near would need to be updated to allow for
* some round robin type logic .
*/
WARN_ON_ONCE ( ! ( wq - > flags & WQ_UNBOUND ) ) ;
2024-02-21 08:36:14 +03:00
local_irq_save ( irq_flags ) ;
2019-01-22 21:39:26 +03:00
2024-03-25 20:21:03 +03:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) & &
! clear_pending_if_disabled ( work ) ) {
2023-08-08 04:57:23 +03:00
int cpu = select_numa_node_cpu ( node ) ;
2019-01-22 21:39:26 +03:00
__queue_work ( cpu , wq , work ) ;
ret = true ;
}
2024-02-21 08:36:14 +03:00
local_irq_restore ( irq_flags ) ;
2019-01-22 21:39:26 +03:00
return ret ;
}
EXPORT_SYMBOL_GPL ( queue_work_node ) ;
2017-10-05 02:27:07 +03:00
void delayed_work_timer_fn ( struct timer_list * t )
2005-04-17 02:20:36 +04:00
{
2017-10-05 02:27:07 +03:00
struct delayed_work * dwork = from_timer ( dwork , t , timer ) ;
2005-04-17 02:20:36 +04:00
2012-08-22 00:18:24 +04:00
/* should have been called from irqsafe timer with irq already off */
2013-02-07 06:04:53 +04:00
__queue_work ( dwork - > cpu , dwork - > wq , & dwork - > work ) ;
2005-04-17 02:20:36 +04:00
}
2013-01-24 16:36:31 +04:00
EXPORT_SYMBOL ( delayed_work_timer_fn ) ;
2005-04-17 02:20:36 +04:00
2012-08-03 21:30:46 +04:00
static void __queue_delayed_work ( int cpu , struct workqueue_struct * wq ,
struct delayed_work * dwork , unsigned long delay )
2005-04-17 02:20:36 +04:00
{
2012-08-03 21:30:46 +04:00
struct timer_list * timer = & dwork - > timer ;
struct work_struct * work = & dwork - > work ;
2017-03-06 23:33:42 +03:00
WARN_ON_ONCE ( ! wq ) ;
2022-09-09 00:54:56 +03:00
WARN_ON_ONCE ( timer - > function ! = delayed_work_timer_fn ) ;
2012-12-04 19:40:39 +04:00
WARN_ON_ONCE ( timer_pending ( timer ) ) ;
WARN_ON_ONCE ( ! list_empty ( & work - > entry ) ) ;
2012-08-03 21:30:46 +04:00
2012-12-02 04:23:42 +04:00
/*
* If @ delay is 0 , queue @ dwork - > work immediately . This is for
* both optimization and correctness . The earliest @ timer can
* expire is on the closest next tick and delayed_work users depend
* on that there ' s no such delay when @ delay is 0.
*/
if ( ! delay ) {
__queue_work ( cpu , wq , & dwork - > work ) ;
return ;
}
2013-02-07 06:04:53 +04:00
dwork - > wq = wq ;
2012-08-08 20:38:42 +04:00
dwork - > cpu = cpu ;
2012-08-03 21:30:46 +04:00
timer - > expires = jiffies + delay ;
workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.
This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.
So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.
As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-30 04:00:46 +03:00
if ( housekeeping_enabled ( HK_TYPE_TIMER ) ) {
/* If the current cpu is a housekeeping cpu, use it. */
cpu = smp_processor_id ( ) ;
if ( ! housekeeping_test_cpu ( cpu , HK_TYPE_TIMER ) )
cpu = housekeeping_any_cpu ( HK_TYPE_TIMER ) ;
2016-02-10 00:11:26 +03:00
add_timer_on ( timer , cpu ) ;
workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.
This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.
So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.
As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-30 04:00:46 +03:00
} else {
if ( likely ( cpu = = WORK_CPU_UNBOUND ) )
A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer wheel
of a CPU which is likely to be busy at the time of expiry. This is done
to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is close
to zero.
2) Due to #1 it is possible that timers are accumulated on a
single target CPU
3) The required computation in the enqueue path is just overhead for
dubious value especially under the consideration that the vast
majority of timer wheel timers are either canceled or rearmed
before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on which
they get armed.
This is achieved by having separate wheels for CPU pinned timers and
global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they expire.
- If the first expiring timer is a global timer, then the expiry time
is propagated into the timer pull hierarchy and the CPU makes sure
to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to the
point where no further aggregation of groups is required, i.e. the
number of levels is log8(NR_CPUS). The magic number of eight has been
established by experimention, but can be adjusted if needed.
In each group one busy CPU acts as the migrator. It's only one CPU to
avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether there
are other CPUs in the group which have gone idle and have global timers
to expire. If there are global timers to expire, the migrator locks the
remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can require
to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point the
CPU is the systemwide migrator at the top of the hierarchy and it
therefore cannot delegate to the hierarchy. It needs to arm its own
timer device to expire either at the first expiring timer in the
hierarchy or at the first CPU local timer, which ever expires first.
This completely removes the overhead from the enqueue path, which is
e.g. for networking a true hotpath and trades it for a slightly more
complex idle path.
This has been in development for a couple of years and the final series
has been extensively tested by various teams from silicon vendors and
ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them to
power down a die completely on a mult-die socket for the first time in
a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific overloaded
netperf test which is currently investigated, but the rest is either
positive or neutral performance wise and positive on the power
management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware timers
and interpolated them back to clock MONOTONIC. The changes address a
few corner cases in the interpolation code which got the math and logic
wrong.
- Simplifcation of the clocksource watchdog retry logic to automatically
adjust to handle larger systems correctly instead of having more
incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ
6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY
mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt
1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG
D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST
s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv
lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP
ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ
FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz
S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN
RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh
rQ23UBms6ZRR+A==
=iQlu
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer
wheel of a CPU which is likely to be busy at the time of expiry.
This is done to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is
close to zero.
2) Due to #1 it is possible that timers are accumulated on
a single target CPU
3) The required computation in the enqueue path is just overhead
for dubious value especially under the consideration that the
vast majority of timer wheel timers are either canceled or
rearmed before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on
which they get armed.
This is achieved by having separate wheels for CPU pinned timers
and global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they
expire.
- If the first expiring timer is a global timer, then the expiry
time is propagated into the timer pull hierarchy and the CPU
makes sure to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to
the point where no further aggregation of groups is required, i.e.
the number of levels is log8(NR_CPUS). The magic number of eight
has been established by experimention, but can be adjusted if
needed.
In each group one busy CPU acts as the migrator. It's only one CPU
to avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether
there are other CPUs in the group which have gone idle and have
global timers to expire. If there are global timers to expire, the
migrator locks the remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can
require to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point
the CPU is the systemwide migrator at the top of the hierarchy and
it therefore cannot delegate to the hierarchy. It needs to arm its
own timer device to expire either at the first expiring timer in
the hierarchy or at the first CPU local timer, which ever expires
first.
This completely removes the overhead from the enqueue path, which
is e.g. for networking a true hotpath and trades it for a slightly
more complex idle path.
This has been in development for a couple of years and the final
series has been extensively tested by various teams from silicon
vendors and ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them
to power down a die completely on a mult-die socket for the first
time in a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific
overloaded netperf test which is currently investigated, but the
rest is either positive or neutral performance wise and positive on
the power management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware
timers and interpolated them back to clock MONOTONIC. The changes
address a few corner cases in the interpolation code which got the
math and logic wrong.
- Simplifcation of the clocksource watchdog retry logic to
automatically adjust to handle larger systems correctly instead of
having more incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place"
* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
timer/migration: Fix quick check reporting late expiry
tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
vdso/datapage: Quick fix - use asm/page-def.h for ARM64
timers: Assert no next dyntick timer look-up while CPU is offline
tick: Assume timekeeping is correctly handed over upon last offline idle call
tick: Shut down low-res tick from dying CPU
tick: Split nohz and highres features from nohz_mode
tick: Move individual bit features to debuggable mask accesses
tick: Move got_idle_tick away from common flags
tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
tick: Start centralizing tick related CPU hotplug operations
tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
tick: Use IS_ENABLED() whenever possible
tick/sched: Remove useless oneshot ifdeffery
tick/nohz: Remove duplicate between lowres and highres handlers
tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
hrtimer: Select housekeeping CPU during migration
...
2024-03-12 00:38:26 +03:00
add_timer_global ( timer ) ;
workqueue: Avoid using isolated cpus' timers on queue_delayed_work
When __queue_delayed_work() is called, it chooses a cpu for handling the
timer interrupt. As of today, it will pick either the cpu passed as
parameter or the last cpu used for this.
This is not good if a system does use CPU isolation, because it can take
away some valuable cpu time to:
1 - deal with the timer interrupt,
2 - schedule-out the desired task,
3 - queue work on a random workqueue, and
4 - schedule the desired task back to the cpu.
So to fix this, during __queue_delayed_work(), if cpu isolation is in
place, pick a random non-isolated cpu to handle the timer interrupt.
As an optimization, if the current cpu is not isolated, use it instead
of looking for another candidate.
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-30 04:00:46 +03:00
else
add_timer_on ( timer , cpu ) ;
}
2005-04-17 02:20:36 +04:00
}
2006-07-30 14:03:42 +04:00
/**
* queue_delayed_work_on - queue work on specific CPU after delay
* @ cpu : CPU number to execute work on
* @ wq : workqueue to use
2006-12-22 12:06:52 +03:00
* @ dwork : work to queue
2006-07-30 14:03:42 +04:00
* @ delay : number of jiffies to wait before queueing
*
2013-08-01 01:59:24 +04:00
* Return : % false if @ work was already on a queue , % true otherwise . If
2012-08-03 21:30:46 +04:00
* @ delay is zero and @ dwork is idle , it will be scheduled for immediate
* execution .
2006-07-30 14:03:42 +04:00
*/
2012-08-03 21:30:44 +04:00
bool queue_delayed_work_on ( int cpu , struct workqueue_struct * wq ,
struct delayed_work * dwork , unsigned long delay )
2006-06-29 00:50:33 +04:00
{
2006-11-22 17:54:01 +03:00
struct work_struct * work = & dwork - > work ;
2012-08-03 21:30:44 +04:00
bool ret = false ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2006-06-29 00:50:33 +04:00
2012-08-03 21:30:45 +04:00
/* read the comment in __queue_work() */
2024-02-21 08:36:14 +03:00
local_irq_save ( irq_flags ) ;
2006-06-29 00:50:33 +04:00
2024-03-25 20:21:03 +03:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) & &
! clear_pending_if_disabled ( work ) ) {
2012-08-03 21:30:46 +04:00
__queue_delayed_work ( cpu , wq , dwork , delay ) ;
2012-08-03 21:30:44 +04:00
ret = true ;
2006-06-29 00:50:33 +04:00
}
2008-05-01 15:35:14 +04:00
2024-02-21 08:36:14 +03:00
local_irq_restore ( irq_flags ) ;
2006-06-29 00:50:33 +04:00
return ret ;
}
2013-05-07 01:44:55 +04:00
EXPORT_SYMBOL ( queue_delayed_work_on ) ;
2010-07-02 12:03:51 +04:00
2012-08-03 21:30:47 +04:00
/**
* mod_delayed_work_on - modify delay of or queue a delayed work on specific CPU
* @ cpu : CPU number to execute work on
* @ wq : workqueue to use
* @ dwork : work to queue
* @ delay : number of jiffies to wait before queueing
*
* If @ dwork is idle , equivalent to queue_delayed_work_on ( ) ; otherwise ,
* modify @ dwork ' s timer so that it expires after @ delay . If @ delay is
* zero , @ work is guaranteed to be scheduled immediately regardless of its
* current state .
*
2013-08-01 01:59:24 +04:00
* Return : % false if @ dwork was idle and queued , % true if @ dwork was
2012-08-03 21:30:47 +04:00
* pending and its timer was modified .
*
2012-08-22 00:18:24 +04:00
* This function is safe to call from any context including IRQ handler .
2012-08-03 21:30:47 +04:00
* See try_to_grab_pending ( ) for details .
*/
bool mod_delayed_work_on ( int cpu , struct workqueue_struct * wq ,
struct delayed_work * dwork , unsigned long delay )
{
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
bool ret ;
2010-07-02 12:03:51 +04:00
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
ret = work_grab_pending ( & dwork - > work , WORK_CANCEL_DELAYED , & irq_flags ) ;
2007-05-09 13:34:16 +04:00
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
if ( ! clear_pending_if_disabled ( & dwork - > work ) )
2012-08-03 21:30:47 +04:00
__queue_delayed_work ( cpu , wq , dwork , delay ) ;
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
local_irq_restore ( irq_flags ) ;
2006-06-29 00:50:33 +04:00
return ret ;
}
2012-08-03 21:30:47 +04:00
EXPORT_SYMBOL_GPL ( mod_delayed_work_on ) ;
2018-03-14 22:45:13 +03:00
static void rcu_work_rcufn ( struct rcu_head * rcu )
{
struct rcu_work * rwork = container_of ( rcu , struct rcu_work , rcu ) ;
/* read the comment in __queue_work() */
local_irq_disable ( ) ;
__queue_work ( WORK_CPU_UNBOUND , rwork - > wq , & rwork - > work ) ;
local_irq_enable ( ) ;
}
/**
* queue_rcu_work - queue work after a RCU grace period
* @ wq : workqueue to use
* @ rwork : work to queue
*
* Return : % false if @ rwork was already pending , % true otherwise . Note
* that a full RCU grace period is guaranteed only after a % true return .
2019-03-02 00:57:25 +03:00
* While @ rwork is guaranteed to be executed after a % false return , the
2018-03-14 22:45:13 +03:00
* execution may happen before a full RCU grace period has passed .
*/
bool queue_rcu_work ( struct workqueue_struct * wq , struct rcu_work * rwork )
{
struct work_struct * work = & rwork - > work ;
2024-03-25 20:21:03 +03:00
/*
* rcu_work can ' t be canceled or disabled . Warn if the user reached
* inside @ rwork and disabled the inner work .
*/
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) & &
! WARN_ON_ONCE ( clear_pending_if_disabled ( work ) ) ) {
2018-03-14 22:45:13 +03:00
rwork - > wq = wq ;
workqueue: Make queue_rcu_work() use call_rcu_hurry()
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.
And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.
Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-16 19:23:03 +03:00
call_rcu_hurry ( & rwork - > rcu , rcu_work_rcufn ) ;
2018-03-14 22:45:13 +03:00
return true ;
}
return false ;
}
EXPORT_SYMBOL ( queue_rcu_work ) ;
2014-07-15 13:24:15 +04:00
static struct worker * alloc_worker ( int node )
2010-06-29 12:07:11 +04:00
{
struct worker * worker ;
2014-07-15 13:24:15 +04:00
worker = kzalloc_node ( sizeof ( * worker ) , GFP_KERNEL , node ) ;
2010-06-29 12:07:12 +04:00
if ( worker ) {
INIT_LIST_HEAD ( & worker - > entry ) ;
2010-06-29 12:07:12 +04:00
INIT_LIST_HEAD ( & worker - > scheduled ) ;
2014-05-20 13:46:31 +04:00
INIT_LIST_HEAD ( & worker - > node ) ;
2010-06-29 12:07:14 +04:00
/* on creation a worker is in !idle && prep state */
worker - > flags = WORKER_PREP ;
2010-06-29 12:07:12 +04:00
}
2010-06-29 12:07:11 +04:00
return worker ;
}
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
static cpumask_t * pool_allowed_cpus ( struct worker_pool * pool )
{
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
if ( pool - > cpu < 0 & & pool - > attrs - > affn_strict )
return pool - > attrs - > __pod_cpumask ;
else
return pool - > attrs - > cpumask ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
}
2014-05-20 13:46:35 +04:00
/**
* worker_attach_to_pool ( ) - attach a worker to a pool
* @ worker : worker to be attached
* @ pool : the target pool
*
* Attach @ worker to @ pool . Once attached , the % WORKER_UNBOUND flag and
* cpu - binding of @ worker are kept coordinated with the pool across
* cpu - [ un ] hotplugs .
*/
static void worker_attach_to_pool ( struct worker * worker ,
2024-02-05 00:28:06 +03:00
struct worker_pool * pool )
2014-05-20 13:46:35 +04:00
{
2018-05-18 18:47:13 +03:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2014-05-20 13:46:35 +04:00
/*
2024-02-05 00:28:06 +03:00
* The wq_pool_attach_mutex ensures % POOL_DISASSOCIATED remains stable
* across this function . See the comments above the flag definition for
* details . BH workers are , while per - CPU , always DISASSOCIATED .
2014-05-20 13:46:35 +04:00
*/
2024-02-05 00:28:06 +03:00
if ( pool - > flags & POOL_DISASSOCIATED ) {
2014-05-20 13:46:35 +04:00
worker - > flags | = WORKER_UNBOUND ;
2024-02-05 00:28:06 +03:00
} else {
WARN_ON_ONCE ( pool - > flags & POOL_BH ) ;
2021-01-12 13:26:49 +03:00
kthread_set_per_cpu ( worker - > task , pool - > cpu ) ;
2024-02-05 00:28:06 +03:00
}
2014-05-20 13:46:35 +04:00
2021-01-15 21:08:36 +03:00
if ( worker - > rescue_wq )
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
set_cpus_allowed_ptr ( worker - > task , pool_allowed_cpus ( pool ) ) ;
2021-01-15 21:08:36 +03:00
2014-05-20 13:46:35 +04:00
list_add_tail ( & worker - > node , & pool - > workers ) ;
2018-05-18 18:47:13 +03:00
worker - > pool = pool ;
2014-05-20 13:46:35 +04:00
2018-05-18 18:47:13 +03:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2014-05-20 13:46:35 +04:00
}
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
/**
* worker_detach_from_pool ( ) - detach a worker from its pool
* @ worker : worker which is attached to its pool
*
2014-05-20 13:46:35 +04:00
* Undo the attaching which had been done in worker_attach_to_pool ( ) . The
* caller worker shouldn ' t access to the pool after detached except it has
* other reference to the pool .
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
*/
2018-05-18 18:47:13 +03:00
static void worker_detach_from_pool ( struct worker * worker )
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
{
2018-05-18 18:47:13 +03:00
struct worker_pool * pool = worker - > pool ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
struct completion * detach_completion = NULL ;
2024-02-05 00:28:06 +03:00
/* there is one permanent BH worker per CPU which should never detach */
WARN_ON_ONCE ( pool - > flags & POOL_BH ) ;
2018-05-18 18:47:13 +03:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2018-05-18 18:47:13 +03:00
2021-01-12 13:26:49 +03:00
kthread_set_per_cpu ( worker - > task , - 1 ) ;
2014-05-20 13:46:31 +04:00
list_del ( & worker - > node ) ;
2018-05-18 18:47:13 +03:00
worker - > pool = NULL ;
2023-01-12 19:14:31 +03:00
if ( list_empty ( & pool - > workers ) & & list_empty ( & pool - > dying_workers ) )
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
detach_completion = pool - > detach_completion ;
2018-05-18 18:47:13 +03:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
2014-06-03 11:32:52 +04:00
/* clear leftover flags without pool->lock after it is detached */
worker - > flags & = ~ ( WORKER_UNBOUND | WORKER_REBOUND ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
if ( detach_completion )
complete ( detach_completion ) ;
}
2010-06-29 12:07:11 +04:00
/**
* create_worker - create a new workqueue worker
2012-07-13 01:46:37 +04:00
* @ pool : pool the new worker will belong to
2010-06-29 12:07:11 +04:00
*
2014-07-22 09:03:02 +04:00
* Create and start a new worker which is attached to @ pool .
2010-06-29 12:07:11 +04:00
*
* CONTEXT :
* Might sleep . Does GFP_KERNEL allocations .
*
2013-08-01 01:59:24 +04:00
* Return :
2010-06-29 12:07:11 +04:00
* Pointer to the newly created worker .
*/
2012-07-17 23:39:27 +04:00
static struct worker * create_worker ( struct worker_pool * pool )
2010-06-29 12:07:11 +04:00
{
2021-08-04 06:50:36 +03:00
struct worker * worker ;
int id ;
2023-10-09 20:09:46 +03:00
char id_buf [ 23 ] ;
2010-06-29 12:07:11 +04:00
2014-05-20 13:46:32 +04:00
/* ID is needed to determine kthread name */
2021-08-04 06:50:36 +03:00
id = ida_alloc ( & pool - > worker_ida , GFP_KERNEL ) ;
2023-03-07 15:53:32 +03:00
if ( id < 0 ) {
pr_err_once ( " workqueue: Failed to allocate a worker ID: %pe \n " ,
ERR_PTR ( id ) ) ;
2021-08-04 06:50:36 +03:00
return NULL ;
2023-03-07 15:53:32 +03:00
}
2010-06-29 12:07:11 +04:00
2014-07-15 13:24:15 +04:00
worker = alloc_worker ( pool - > node ) ;
2023-03-07 15:53:32 +03:00
if ( ! worker ) {
pr_err_once ( " workqueue: Failed to allocate a worker \n " ) ;
2010-06-29 12:07:11 +04:00
goto fail ;
2023-03-07 15:53:32 +03:00
}
2010-06-29 12:07:11 +04:00
worker - > id = id ;
2024-02-05 00:28:06 +03:00
if ( ! ( pool - > flags & POOL_BH ) ) {
if ( pool - > cpu > = 0 )
snprintf ( id_buf , sizeof ( id_buf ) , " %d:%d%s " , pool - > cpu , id ,
pool - > attrs - > nice < 0 ? " H " : " " ) ;
else
snprintf ( id_buf , sizeof ( id_buf ) , " u%d:%d " , pool - > id , id ) ;
worker - > task = kthread_create_on_node ( worker_thread , worker ,
pool - > node , " kworker/%s " , id_buf ) ;
if ( IS_ERR ( worker - > task ) ) {
if ( PTR_ERR ( worker - > task ) = = - EINTR ) {
pr_err ( " workqueue: Interrupted when creating a worker thread \" kworker/%s \" \n " ,
id_buf ) ;
} else {
pr_err_once ( " workqueue: Failed to create a worker thread: %pe " ,
worker - > task ) ;
}
goto fail ;
2023-03-07 15:53:33 +03:00
}
2010-06-29 12:07:11 +04:00
2024-02-05 00:28:06 +03:00
set_user_nice ( worker - > task , pool - > attrs - > nice ) ;
kthread_bind_mask ( worker - > task , pool_allowed_cpus ( pool ) ) ;
}
2013-11-14 15:56:18 +04:00
2014-05-20 13:46:31 +04:00
/* successful, attach the worker to the pool */
2014-05-20 13:46:35 +04:00
worker_attach_to_pool ( worker , pool ) ;
2013-03-20 00:45:21 +04:00
2014-07-22 09:03:02 +04:00
/* start the newly created worker */
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2023-08-08 04:57:25 +03:00
2014-07-22 09:03:02 +04:00
worker - > pool - > nr_workers + + ;
worker_enter_idle ( worker ) ;
2023-08-08 04:57:25 +03:00
/*
* @ worker is waiting on a completion in kthread ( ) and will trigger hung
2024-01-27 00:55:46 +03:00
* check if not woken up soon . As kick_pool ( ) is noop if @ pool is empty ,
* wake it up explicitly .
2023-08-08 04:57:25 +03:00
*/
2024-02-05 00:28:06 +03:00
if ( worker - > task )
wake_up_process ( worker - > task ) ;
2023-08-08 04:57:25 +03:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2014-07-22 09:03:02 +04:00
2010-06-29 12:07:11 +04:00
return worker ;
2013-03-20 00:45:21 +04:00
2010-06-29 12:07:11 +04:00
fail :
2021-08-04 06:50:36 +03:00
ida_free ( & pool - > worker_ida , id ) ;
2010-06-29 12:07:11 +04:00
kfree ( worker ) ;
return NULL ;
}
2023-01-12 19:14:28 +03:00
static void unbind_worker ( struct worker * worker )
{
lockdep_assert_held ( & wq_pool_attach_mutex ) ;
kthread_set_per_cpu ( worker - > task , - 1 ) ;
if ( cpumask_intersects ( wq_unbound_cpumask , cpu_active_mask ) )
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task , wq_unbound_cpumask ) < 0 ) ;
else
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task , cpu_possible_mask ) < 0 ) ;
}
2023-01-12 19:14:31 +03:00
static void wake_dying_workers ( struct list_head * cull_list )
{
struct worker * worker , * tmp ;
list_for_each_entry_safe ( worker , tmp , cull_list , entry ) {
list_del_init ( & worker - > entry ) ;
unbind_worker ( worker ) ;
/*
* If the worker was somehow already running , then it had to be
* in pool - > idle_list when set_worker_dying ( ) happened or we
* wouldn ' t have gotten here .
*
* Thus , the worker must either have observed the WORKER_DIE
* flag , or have set its state to TASK_IDLE . Either way , the
* below will be observed by the worker and is safe to do
* outside of pool - > lock .
*/
wake_up_process ( worker - > task ) ;
}
}
2010-06-29 12:07:11 +04:00
/**
2023-01-12 19:14:31 +03:00
* set_worker_dying - Tag a worker for destruction
2010-06-29 12:07:11 +04:00
* @ worker : worker to be destroyed
2023-01-12 19:14:31 +03:00
* @ list : transfer worker away from its pool - > idle_list and into list
2010-06-29 12:07:11 +04:00
*
2023-01-12 19:14:31 +03:00
* Tag @ worker for destruction and adjust @ pool stats accordingly . The worker
* should be idle .
2010-06-29 12:07:12 +04:00
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 12:07:11 +04:00
*/
2023-01-12 19:14:31 +03:00
static void set_worker_dying ( struct worker * worker , struct list_head * list )
2010-06-29 12:07:11 +04:00
{
2012-07-13 01:46:37 +04:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 12:07:11 +04:00
2013-03-14 06:47:39 +04:00
lockdep_assert_held ( & pool - > lock ) ;
2023-01-12 19:14:31 +03:00
lockdep_assert_held ( & wq_pool_attach_mutex ) ;
2013-03-14 06:47:39 +04:00
2010-06-29 12:07:11 +04:00
/* sanity check frenzy */
2013-03-12 22:29:57 +04:00
if ( WARN_ON ( worker - > current_work ) | |
2014-05-20 13:46:28 +04:00
WARN_ON ( ! list_empty ( & worker - > scheduled ) ) | |
WARN_ON ( ! ( worker - > flags & WORKER_IDLE ) ) )
2013-03-12 22:29:57 +04:00
return ;
2010-06-29 12:07:11 +04:00
2014-05-20 13:46:28 +04:00
pool - > nr_workers - - ;
pool - > nr_idle - - ;
2014-02-15 18:02:28 +04:00
2010-07-02 12:03:50 +04:00
worker - > flags | = WORKER_DIE ;
2023-01-12 19:14:31 +03:00
list_move ( & worker - > entry , list ) ;
list_move ( & worker - > node , & pool - > dying_workers ) ;
2010-06-29 12:07:11 +04:00
}
2023-01-12 19:14:29 +03:00
/**
* idle_worker_timeout - check if some idle workers can now be deleted .
* @ t : The pool ' s idle_timer that just expired
*
* The timer is armed in worker_enter_idle ( ) . Note that it isn ' t disarmed in
* worker_leave_idle ( ) , as a worker flicking between idle and active while its
* pool is at the too_many_workers ( ) tipping point would cause too much timer
* housekeeping overhead . Since IDLE_WORKER_TIMEOUT is long enough , we just let
* it expire and re - evaluate things from there .
*/
2017-10-17 01:58:25 +03:00
static void idle_worker_timeout ( struct timer_list * t )
2010-06-29 12:07:14 +04:00
{
2017-10-17 01:58:25 +03:00
struct worker_pool * pool = from_timer ( pool , t , idle_timer ) ;
2023-01-12 19:14:29 +03:00
bool do_cull = false ;
if ( work_pending ( & pool - > idle_cull_work ) )
return ;
2010-06-29 12:07:14 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 12:07:14 +04:00
2023-01-12 19:14:29 +03:00
if ( too_many_workers ( pool ) ) {
2010-06-29 12:07:14 +04:00
struct worker * worker ;
unsigned long expires ;
/* idle_list is kept in LIFO order, check the last one */
2024-03-08 12:42:53 +03:00
worker = list_last_entry ( & pool - > idle_list , struct worker , entry ) ;
2023-01-12 19:14:29 +03:00
expires = worker - > last_active + IDLE_WORKER_TIMEOUT ;
do_cull = ! time_before ( jiffies , expires ) ;
if ( ! do_cull )
mod_timer ( & pool - > idle_timer , expires ) ;
}
raw_spin_unlock_irq ( & pool - > lock ) ;
if ( do_cull )
queue_work ( system_unbound_wq , & pool - > idle_cull_work ) ;
}
/**
* idle_cull_fn - cull workers that have been idle for too long .
* @ work : the pool ' s work for handling these idle workers
*
* This goes through a pool ' s idle workers and gets rid of those that have been
* idle for at least IDLE_WORKER_TIMEOUT seconds .
2023-01-12 19:14:31 +03:00
*
* We don ' t want to disturb isolated CPUs because of a pcpu kworker being
* culled , so this also resets worker affinity . This requires a sleepable
* context , hence the split between timer callback and work item .
2023-01-12 19:14:29 +03:00
*/
static void idle_cull_fn ( struct work_struct * work )
{
struct worker_pool * pool = container_of ( work , struct worker_pool , idle_cull_work ) ;
2023-08-04 06:22:15 +03:00
LIST_HEAD ( cull_list ) ;
2023-01-12 19:14:29 +03:00
2023-01-12 19:14:31 +03:00
/*
* Grabbing wq_pool_attach_mutex here ensures an already - running worker
* cannot proceed beyong worker_detach_from_pool ( ) in its self - destruct
* path . This is required as a previously - preempted worker could run after
* set_worker_dying ( ) has happened but before wake_dying_workers ( ) did .
*/
mutex_lock ( & wq_pool_attach_mutex ) ;
2023-01-12 19:14:29 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
while ( too_many_workers ( pool ) ) {
struct worker * worker ;
unsigned long expires ;
2024-03-08 12:42:53 +03:00
worker = list_last_entry ( & pool - > idle_list , struct worker , entry ) ;
2010-06-29 12:07:14 +04:00
expires = worker - > last_active + IDLE_WORKER_TIMEOUT ;
2014-05-20 13:46:30 +04:00
if ( time_before ( jiffies , expires ) ) {
2012-07-13 01:46:37 +04:00
mod_timer ( & pool - > idle_timer , expires ) ;
2014-05-20 13:46:30 +04:00
break ;
2006-12-07 07:37:26 +03:00
}
2014-05-20 13:46:30 +04:00
2023-01-12 19:14:31 +03:00
set_worker_dying ( worker , & cull_list ) ;
2010-06-29 12:07:14 +04:00
}
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2023-01-12 19:14:31 +03:00
wake_dying_workers ( & cull_list ) ;
mutex_unlock ( & wq_pool_attach_mutex ) ;
2010-06-29 12:07:14 +04:00
}
2006-12-07 07:37:26 +03:00
2013-03-12 22:29:59 +04:00
static void send_mayday ( struct work_struct * work )
2010-06-29 12:07:14 +04:00
{
2013-02-14 07:29:12 +04:00
struct pool_workqueue * pwq = get_work_pwq ( work ) ;
struct workqueue_struct * wq = pwq - > wq ;
2013-03-12 22:29:59 +04:00
2013-03-14 06:47:40 +04:00
lockdep_assert_held ( & wq_mayday_lock ) ;
2010-06-29 12:07:14 +04:00
2013-03-12 22:30:03 +04:00
if ( ! wq - > rescuer )
2013-03-12 22:29:59 +04:00
return ;
2010-06-29 12:07:14 +04:00
/* mayday mayday mayday */
2013-03-12 22:29:59 +04:00
if ( list_empty ( & pwq - > mayday_node ) ) {
2014-04-18 19:04:16 +04:00
/*
* If @ pwq is for an unbound wq , its base ref may be put at
* any time due to an attribute change . Pin @ pwq until the
* rescuer is done with it .
*/
get_pwq ( pwq ) ;
2013-03-12 22:29:59 +04:00
list_add_tail ( & pwq - > mayday_node , & wq - > maydays ) ;
2010-06-29 12:07:14 +04:00
wake_up_process ( wq - > rescuer - > task ) ;
2023-05-18 06:02:08 +03:00
pwq - > stats [ PWQ_STAT_MAYDAY ] + + ;
2013-03-12 22:29:59 +04:00
}
2010-06-29 12:07:14 +04:00
}
2017-10-17 01:58:25 +03:00
static void pool_mayday_timeout ( struct timer_list * t )
2010-06-29 12:07:14 +04:00
{
2017-10-17 01:58:25 +03:00
struct worker_pool * pool = from_timer ( pool , t , mayday_timer ) ;
2010-06-29 12:07:14 +04:00
struct work_struct * work ;
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
raw_spin_lock ( & wq_mayday_lock ) ; /* for wq->maydays */
2010-06-29 12:07:14 +04:00
2012-07-13 01:46:37 +04:00
if ( need_to_create_worker ( pool ) ) {
2010-06-29 12:07:14 +04:00
/*
* We ' ve been trying to create a new worker but
* haven ' t been successful . We might be hitting an
* allocation deadlock . Send distress signals to
* rescuers .
*/
2012-07-13 01:46:37 +04:00
list_for_each_entry ( work , & pool - > worklist , entry )
2010-06-29 12:07:14 +04:00
send_mayday ( work ) ;
2005-04-17 02:20:36 +04:00
}
2010-06-29 12:07:14 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock ( & wq_mayday_lock ) ;
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 12:07:14 +04:00
2012-07-13 01:46:37 +04:00
mod_timer ( & pool - > mayday_timer , jiffies + MAYDAY_INTERVAL ) ;
2005-04-17 02:20:36 +04:00
}
2010-06-29 12:07:14 +04:00
/**
* maybe_create_worker - create a new worker if necessary
2012-07-13 01:46:37 +04:00
* @ pool : pool to create a new worker for
2010-06-29 12:07:14 +04:00
*
2012-07-13 01:46:37 +04:00
* Create a new worker for @ pool if necessary . @ pool is guaranteed to
2010-06-29 12:07:14 +04:00
* have at least one idle worker on return from this function . If
* creating a new worker takes longer than MAYDAY_INTERVAL , mayday is
2012-07-13 01:46:37 +04:00
* sent to all rescuers with works scheduled on @ pool to resolve
2010-06-29 12:07:14 +04:00
* possible allocation deadlock .
*
2013-03-14 03:51:36 +04:00
* On return , need_to_create_worker ( ) is guaranteed to be % false and
* may_start_working ( ) % true .
2010-06-29 12:07:14 +04:00
*
* LOCKING :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) which may be released and regrabbed
2010-06-29 12:07:14 +04:00
* multiple times . Does GFP_KERNEL allocations . Called only from
* manager .
*/
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 22:21:16 +03:00
static void maybe_create_worker ( struct worker_pool * pool )
2013-01-24 23:01:33 +04:00
__releases ( & pool - > lock )
__acquires ( & pool - > lock )
2005-04-17 02:20:36 +04:00
{
2010-06-29 12:07:14 +04:00
restart :
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-07-14 13:31:20 +04:00
2010-06-29 12:07:14 +04:00
/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
2012-07-13 01:46:37 +04:00
mod_timer ( & pool - > mayday_timer , jiffies + MAYDAY_INITIAL_TIMEOUT ) ;
2010-06-29 12:07:14 +04:00
while ( true ) {
2014-07-22 09:03:02 +04:00
if ( create_worker ( pool ) | | ! need_to_create_worker ( pool ) )
2010-06-29 12:07:14 +04:00
break ;
2005-04-17 02:20:36 +04:00
2014-06-03 11:32:17 +04:00
schedule_timeout_interruptible ( CREATE_COOLDOWN ) ;
2010-07-14 13:31:20 +04:00
2012-07-13 01:46:37 +04:00
if ( ! need_to_create_worker ( pool ) )
2010-06-29 12:07:14 +04:00
break ;
}
2012-07-13 01:46:37 +04:00
del_timer_sync ( & pool - > mayday_timer ) ;
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2014-07-22 09:03:02 +04:00
/*
* This is necessary even after a new worker was just successfully
* created as @ pool - > lock was dropped and the new worker might have
* already become busy .
*/
2012-07-13 01:46:37 +04:00
if ( need_to_create_worker ( pool ) )
2010-06-29 12:07:14 +04:00
goto restart ;
}
2010-06-29 12:07:11 +04:00
/**
2010-06-29 12:07:14 +04:00
* manage_workers - manage worker pool
* @ worker : self
2010-06-29 12:07:11 +04:00
*
2013-01-24 23:01:34 +04:00
* Assume the manager role and manage the worker pool @ worker belongs
2010-06-29 12:07:14 +04:00
* to . At any given time , there can be only zero or one manager per
2013-01-24 23:01:34 +04:00
* pool . The exclusion is handled automatically by this function .
2010-06-29 12:07:14 +04:00
*
* The caller can safely start processing works on false return . On
* true return , it ' s guaranteed that need_to_create_worker ( ) is false
* and may_start_working ( ) is true .
2010-06-29 12:07:11 +04:00
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) which may be released and regrabbed
2010-06-29 12:07:14 +04:00
* multiple times . Does GFP_KERNEL allocations .
*
2013-08-01 01:59:24 +04:00
* Return :
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 22:21:16 +03:00
* % false if the pool doesn ' t need management and the caller can safely
* start processing works , % true if management function was performed and
* the conditions that the caller verified before calling the function may
* no longer be true .
2010-06-29 12:07:11 +04:00
*/
2010-06-29 12:07:14 +04:00
static bool manage_workers ( struct worker * worker )
2010-06-29 12:07:11 +04:00
{
2012-07-13 01:46:37 +04:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 12:07:11 +04:00
2017-10-09 18:04:13 +03:00
if ( pool - > flags & POOL_MANAGER_ACTIVE )
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 22:21:16 +03:00
return false ;
2017-10-09 18:04:13 +03:00
pool - > flags | = POOL_MANAGER_ACTIVE ;
2015-03-09 16:22:28 +03:00
pool - > manager = worker ;
2010-06-29 12:07:12 +04:00
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 22:21:16 +03:00
maybe_create_worker ( pool ) ;
2010-06-29 12:07:14 +04:00
2015-03-09 16:22:28 +03:00
pool - > manager = NULL ;
2017-10-09 18:04:13 +03:00
pool - > flags & = ~ POOL_MANAGER_ACTIVE ;
2020-05-27 22:46:32 +03:00
rcuwait_wake_up ( & manager_wait ) ;
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 22:21:16 +03:00
return true ;
2010-06-29 12:07:11 +04:00
}
2010-06-29 12:07:10 +04:00
/**
* process_one_work - process single work
2010-06-29 12:07:11 +04:00
* @ worker : self
2010-06-29 12:07:10 +04:00
* @ work : work to process
*
* Process @ work . This function contains all the logics necessary to
* process a single work including synchronization against and
* interaction with other workers on the same cpu , queueing and
* flushing . As long as context requirement is met , any worker can
* call this function to process a work .
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) which is released and regrabbed .
2010-06-29 12:07:10 +04:00
*/
2010-06-29 12:07:11 +04:00
static void process_one_work ( struct worker * worker , struct work_struct * work )
2013-01-24 23:01:33 +04:00
__releases ( & pool - > lock )
__acquires ( & pool - > lock )
2010-06-29 12:07:10 +04:00
{
2013-02-14 07:29:12 +04:00
struct pool_workqueue * pwq = get_work_pwq ( work ) ;
2012-07-13 01:46:37 +04:00
struct worker_pool * pool = worker - > pool ;
2021-08-17 04:32:35 +03:00
unsigned long work_data ;
2024-02-05 00:28:06 +03:00
int lockdep_start_depth , rcu_start_depth ;
2024-02-27 04:38:55 +03:00
bool bh_draining = pool - > flags & POOL_BH_DRAINING ;
2010-06-29 12:07:10 +04:00
# ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
* inside the function that is called from it , this we need to
* take into account for lockdep too . To avoid bogus " held
* lock freed " warnings as well as problems when looking into
* work - > lockdep_map , make a copy and use that here .
*/
lockdep: fix oops in processing workqueue
Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.
Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.
Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.
Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].
Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().
* Patch orginally from Peter. Hugh modified it a bit and wrote the
description.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-15 19:06:19 +04:00
struct lockdep_map lockdep_map ;
lockdep_copy_map ( & lockdep_map , & work - > lockdep_map ) ;
2010-06-29 12:07:10 +04:00
# endif
2014-06-03 11:33:28 +04:00
/* ensure we're on the correct CPU */
2014-06-03 11:33:28 +04:00
WARN_ON_ONCE ( ! ( pool - > flags & POOL_DISASSOCIATED ) & &
2013-01-24 23:01:33 +04:00
raw_smp_processor_id ( ) ! = pool - > cpu ) ;
2012-07-17 23:39:27 +04:00
2012-08-03 21:30:45 +04:00
/* claim and dequeue */
2010-06-29 12:07:10 +04:00
debug_work_deactivate ( work ) ;
2013-01-24 23:01:33 +04:00
hash_add ( pool - > busy_hash , & worker - > hentry , ( unsigned long ) work ) ;
2010-06-29 12:07:11 +04:00
worker - > current_work = work ;
2012-12-18 22:35:02 +04:00
worker - > current_func = work - > func ;
2013-02-14 07:29:12 +04:00
worker - > current_pwq = pwq ;
2024-02-05 00:28:06 +03:00
if ( worker - > task )
worker - > current_at = worker - > task - > se . sum_exec_runtime ;
2021-08-17 04:32:35 +03:00
work_data = * work_data_bits ( work ) ;
2021-08-17 04:32:38 +03:00
worker - > current_color = get_work_color ( work_data ) ;
2010-06-29 12:07:13 +04:00
2018-05-18 18:47:13 +03:00
/*
* Record wq name for cmdline and debug reporting , may get
* overridden through set_worker_desc ( ) .
*/
strscpy ( worker - > desc , pwq - > wq - > name , WORKER_DESC_LEN ) ;
2010-06-29 12:07:10 +04:00
list_del_init ( & work - > entry ) ;
2010-06-29 12:07:15 +04:00
/*
2014-07-22 09:02:00 +04:00
* CPU intensive works don ' t participate in concurrency management .
* They ' re the scheduler ' s responsibility . This takes @ worker out
* of concurrency management and the next code block will chain
* execution of the pending work items .
2010-06-29 12:07:15 +04:00
*/
2023-05-18 06:02:08 +03:00
if ( unlikely ( pwq - > wq - > flags & WQ_CPU_INTENSIVE ) )
2014-07-22 09:02:00 +04:00
worker_set_flags ( worker , WORKER_CPU_INTENSIVE ) ;
2010-06-29 12:07:15 +04:00
2012-07-13 01:46:37 +04:00
/*
2023-08-08 04:57:25 +03:00
* Kick @ pool if necessary . It ' s always noop for per - cpu worker pools
* since nr_running would always be > = 1 at this point . This is used to
* chain execution of the pending work items for WORKER_NOT_RUNNING
* workers such as the UNBOUND and CPU_INTENSIVE ones .
2012-07-13 01:46:37 +04:00
*/
2023-08-08 04:57:25 +03:00
kick_pool ( pool ) ;
2012-07-13 01:46:37 +04:00
2012-08-03 21:30:45 +04:00
/*
2013-01-24 23:01:33 +04:00
* Record the last pool and clear PENDING which should be the last
2013-01-24 23:01:33 +04:00
* update to @ work . Also , do this inside @ pool - > lock so that
2012-08-14 04:08:19 +04:00
* PENDING and queued state changes happen together while IRQ is
* disabled .
2012-08-03 21:30:45 +04:00
*/
2024-03-25 20:21:03 +03:00
set_work_pool_and_clear_pending ( work , pool - > id , pool_offq_flags ( pool ) ) ;
2010-06-29 12:07:10 +04:00
2023-08-26 17:51:03 +03:00
pwq - > stats [ PWQ_STAT_STARTED ] + + ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 12:07:10 +04:00
2024-02-05 00:28:06 +03:00
rcu_start_depth = rcu_preempt_depth ( ) ;
lockdep_start_depth = lockdep_depth ( current ) ;
2024-02-27 04:38:55 +03:00
/* see drain_dead_softirq_workfn() */
if ( ! bh_draining )
lock_map_acquire ( & pwq - > wq - > lockdep_map ) ;
2010-06-29 12:07:10 +04:00
lock_map_acquire ( & lockdep_map ) ;
2017-08-23 14:23:30 +03:00
/*
2017-08-29 11:59:39 +03:00
* Strictly speaking we should mark the invariant state without holding
* any locks , that is , before these two lock_map_acquire ( ) ' s .
2017-08-23 14:23:30 +03:00
*
* However , that would result in :
*
* A ( W1 )
* WFC ( C )
* A ( W1 )
* C ( C )
*
* Which would create W1 - > C - > W1 dependencies , even though there is no
* actual deadlock possible . There are two solutions , using a
* read - recursive acquire on the work ( queue ) ' locks ' , but this will then
2017-08-29 11:59:39 +03:00
* hit the lockdep limitation on recursive locks , or simply discard
2017-08-23 14:23:30 +03:00
* these locks .
*
* AFAICT there is no possible deadlock scenario between the
* flush_work ( ) and complete ( ) primitives ( except for single - threaded
* workqueues ) , so hiding them isn ' t a problem .
*/
2017-08-29 11:59:39 +03:00
lockdep_invariant_state ( true ) ;
2010-08-22 00:07:26 +04:00
trace_workqueue_execute_start ( work ) ;
2012-12-18 22:35:02 +04:00
worker - > current_func ( work ) ;
2010-08-22 00:07:26 +04:00
/*
* While we must be careful to not use " work " after this , the trace
* point will only record its address .
*/
2020-01-14 01:52:39 +03:00
trace_workqueue_execute_end ( work , worker - > current_func ) ;
2023-05-18 06:02:08 +03:00
pwq - > stats [ PWQ_STAT_COMPLETED ] + + ;
2010-06-29 12:07:10 +04:00
lock_map_release ( & lockdep_map ) ;
2024-02-27 04:38:55 +03:00
if ( ! bh_draining )
lock_map_release ( & pwq - > wq - > lockdep_map ) ;
2010-06-29 12:07:10 +04:00
2024-02-05 00:28:06 +03:00
if ( unlikely ( ( worker - > task & & in_atomic ( ) ) | |
lockdep_depth ( current ) ! = lockdep_start_depth | |
rcu_preempt_depth ( ) ! = rcu_start_depth ) ) {
pr_err ( " BUG: workqueue leaked atomic, lock or RCU: %s[%d] \n "
" preempt=0x%08x lock=%d->%d RCU=%d->%d workfn=%ps \n " ,
current - > comm , task_pid_nr ( current ) , preempt_count ( ) ,
lockdep_start_depth , lockdep_depth ( current ) ,
rcu_start_depth , rcu_preempt_depth ( ) ,
worker - > current_func ) ;
2010-06-29 12:07:10 +04:00
debug_show_held_locks ( current ) ;
dump_stack ( ) ;
}
2013-08-29 01:33:37 +04:00
/*
2019-10-15 22:18:21 +03:00
* The following prevents a kworker from hogging CPU on ! PREEMPTION
2013-08-29 01:33:37 +04:00
* kernels , where a requeueing work item waiting for something to
* happen could deadlock with stop_machine as such work item could
* indefinitely requeue itself while all other CPUs are trapped in
2014-10-05 21:24:21 +04:00
* stop_machine . At the same time , report a quiescent RCU state so
* the same condition doesn ' t freeze RCU .
2013-08-29 01:33:37 +04:00
*/
2024-02-05 00:28:06 +03:00
if ( worker - > task )
cond_resched ( ) ;
2013-08-29 01:33:37 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 12:07:10 +04:00
2023-05-18 06:02:08 +03:00
/*
* In addition to % WQ_CPU_INTENSIVE , @ worker may also have been marked
* CPU intensive by wq_worker_tick ( ) if @ work hogged CPU longer than
* wq_cpu_intensive_thresh_us . Clear it .
*/
worker_clr_flags ( worker , WORKER_CPU_INTENSIVE ) ;
2010-06-29 12:07:15 +04:00
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-02 01:20:42 +03:00
/* tag the worker for identification in schedule() */
worker - > last_func = worker - > current_func ;
2010-06-29 12:07:10 +04:00
/* we're done with it, release */
2012-12-17 19:01:23 +04:00
hash_del ( & worker - > hentry ) ;
2010-06-29 12:07:11 +04:00
worker - > current_work = NULL ;
2012-12-18 22:35:02 +04:00
worker - > current_func = NULL ;
2013-02-14 07:29:12 +04:00
worker - > current_pwq = NULL ;
2021-08-17 04:32:38 +03:00
worker - > current_color = INT_MAX ;
2024-01-29 21:11:24 +03:00
/* must be the last step, see the function comment */
2021-08-17 04:32:35 +03:00
pwq_dec_nr_in_flight ( pwq , work_data ) ;
2010-06-29 12:07:10 +04:00
}
2010-06-29 12:07:12 +04:00
/**
* process_scheduled_works - process scheduled works
* @ worker : self
*
* Process all scheduled works . Please note that the scheduled list
* may change while processing a work , so this function repeatedly
* fetches a work from the top and executes it .
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) which may be released and regrabbed
2010-06-29 12:07:12 +04:00
* multiple times .
*/
static void process_scheduled_works ( struct worker * worker )
2005-04-17 02:20:36 +04:00
{
2023-08-08 04:57:22 +03:00
struct work_struct * work ;
bool first = true ;
while ( ( work = list_first_entry_or_null ( & worker - > scheduled ,
struct work_struct , entry ) ) ) {
if ( first ) {
worker - > pool - > watchdog_ts = jiffies ;
first = false ;
}
2010-06-29 12:07:11 +04:00
process_one_work ( worker , work ) ;
2005-04-17 02:20:36 +04:00
}
}
2018-05-21 18:04:35 +03:00
static void set_pf_worker ( bool val )
{
mutex_lock ( & wq_pool_attach_mutex ) ;
if ( val )
current - > flags | = PF_WQ_WORKER ;
else
current - > flags & = ~ PF_WQ_WORKER ;
mutex_unlock ( & wq_pool_attach_mutex ) ;
}
2010-06-29 12:07:10 +04:00
/**
* worker_thread - the worker thread function
2010-06-29 12:07:11 +04:00
* @ __worker : self
2010-06-29 12:07:10 +04:00
*
2013-03-14 03:51:36 +04:00
* The worker thread function . All workers belong to a worker_pool -
* either a per - cpu one or dynamic unbound one . These workers process all
* work items regardless of their specific target workqueue . The only
* exception is work items which belong to workqueues with a rescuer which
* will be explained in rescuer_thread ( ) .
2013-08-01 01:59:24 +04:00
*
* Return : 0
2010-06-29 12:07:10 +04:00
*/
2010-06-29 12:07:11 +04:00
static int worker_thread ( void * __worker )
2005-04-17 02:20:36 +04:00
{
2010-06-29 12:07:11 +04:00
struct worker * worker = __worker ;
2012-07-13 01:46:37 +04:00
struct worker_pool * pool = worker - > pool ;
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:14 +04:00
/* tell the scheduler that this is a workqueue worker */
2018-05-21 18:04:35 +03:00
set_pf_worker ( true ) ;
2010-06-29 12:07:12 +04:00
woke_up :
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2005-04-17 02:20:36 +04:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
/* am I supposed to die? */
if ( unlikely ( worker - > flags & WORKER_DIE ) ) {
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2018-05-21 18:04:35 +03:00
set_pf_worker ( false ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
set_task_comm ( worker - > task , " kworker/dying " ) ;
2021-08-04 06:50:36 +03:00
ida_free ( & pool - > worker_ida , worker - > id ) ;
2018-05-18 18:47:13 +03:00
worker_detach_from_pool ( worker ) ;
2023-01-12 19:14:31 +03:00
WARN_ON_ONCE ( ! list_empty ( & worker - > entry ) ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
kfree ( worker ) ;
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
return 0 ;
2010-06-29 12:07:12 +04:00
}
2010-06-29 12:07:12 +04:00
2010-06-29 12:07:12 +04:00
worker_leave_idle ( worker ) ;
2010-06-29 12:07:12 +04:00
recheck :
2010-06-29 12:07:14 +04:00
/* no more worker necessary? */
2012-07-13 01:46:37 +04:00
if ( ! need_more_worker ( pool ) )
2010-06-29 12:07:14 +04:00
goto sleep ;
/* do we need to manage? */
2012-07-13 01:46:37 +04:00
if ( unlikely ( ! may_start_working ( pool ) ) & & manage_workers ( worker ) )
2010-06-29 12:07:14 +04:00
goto recheck ;
2010-06-29 12:07:12 +04:00
/*
* - > scheduled list can only be filled while a worker is
* preparing to process a work or actually processing it .
* Make sure nobody diddled with it while I was sleeping .
*/
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( ! list_empty ( & worker - > scheduled ) ) ;
2010-06-29 12:07:12 +04:00
2010-06-29 12:07:14 +04:00
/*
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
* Finish PREP stage . We ' re guaranteed to have at least one idle
* worker or that someone else has already assumed the manager
* role . This is where @ worker starts participating in concurrency
* management if applicable and concurrency management is restored
* after being rebound . See rebind_workers ( ) for details .
2010-06-29 12:07:14 +04:00
*/
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
worker_clr_flags ( worker , WORKER_PREP | WORKER_REBOUND ) ;
2010-06-29 12:07:14 +04:00
do {
2010-06-29 12:07:12 +04:00
struct work_struct * work =
2012-07-13 01:46:37 +04:00
list_first_entry ( & pool - > worklist ,
2010-06-29 12:07:12 +04:00
struct work_struct , entry ) ;
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( assign_work ( work , worker , NULL ) )
process_scheduled_works ( worker ) ;
2012-07-13 01:46:37 +04:00
} while ( keep_working ( pool ) ) ;
2010-06-29 12:07:14 +04:00
2014-07-22 09:02:00 +04:00
worker_set_flags ( worker , WORKER_PREP ) ;
2010-07-02 12:03:51 +04:00
sleep :
2010-06-29 12:07:12 +04:00
/*
2013-01-24 23:01:33 +04:00
* pool - > lock is held and there ' s no work to process and no need to
* manage , sleep . Workers are woken up only while holding
* pool - > lock or from local cpu , so setting the current state
* before releasing pool - > lock is enough to prevent losing any
* event .
2010-06-29 12:07:12 +04:00
*/
worker_enter_idle ( worker ) ;
2017-08-23 14:58:44 +03:00
__set_current_state ( TASK_IDLE ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 12:07:12 +04:00
schedule ( ) ;
goto woke_up ;
2005-04-17 02:20:36 +04:00
}
2010-06-29 12:07:14 +04:00
/**
* rescuer_thread - the rescuer thread function
2013-01-18 05:16:24 +04:00
* @ __rescuer : self
2010-06-29 12:07:14 +04:00
*
* Workqueue rescuer thread function . There ' s one rescuer for each
2013-03-12 22:30:03 +04:00
* workqueue which has WQ_MEM_RECLAIM set .
2010-06-29 12:07:14 +04:00
*
2013-01-24 23:01:34 +04:00
* Regular work processing on a pool may block trying to create a new
2010-06-29 12:07:14 +04:00
* worker which uses GFP_KERNEL allocation which has slight chance of
* developing into deadlock if some works currently on the same queue
* need to be processed to satisfy the GFP_KERNEL allocation . This is
* the problem rescuer solves .
*
2013-01-24 23:01:34 +04:00
* When such condition is possible , the pool summons rescuers of all
* workqueues which have works queued on the pool and let them process
2010-06-29 12:07:14 +04:00
* those works so that forward progress can be guaranteed .
*
* This should happen rarely .
2013-08-01 01:59:24 +04:00
*
* Return : 0
2010-06-29 12:07:14 +04:00
*/
2013-01-18 05:16:24 +04:00
static int rescuer_thread ( void * __rescuer )
2010-06-29 12:07:14 +04:00
{
2013-01-18 05:16:24 +04:00
struct worker * rescuer = __rescuer ;
struct workqueue_struct * wq = rescuer - > rescue_wq ;
2014-04-18 19:04:16 +04:00
bool should_stop ;
2010-06-29 12:07:14 +04:00
set_user_nice ( current , RESCUER_NICE_LEVEL ) ;
2013-01-18 05:16:24 +04:00
/*
* Mark rescuer as worker too . As WORKER_PREP is never cleared , it
* doesn ' t participate in concurrency management .
*/
2018-05-21 18:04:35 +03:00
set_pf_worker ( true ) ;
2010-06-29 12:07:14 +04:00
repeat :
2017-08-23 14:58:44 +03:00
set_current_state ( TASK_IDLE ) ;
2010-06-29 12:07:14 +04:00
2014-04-18 19:04:16 +04:00
/*
* By the time the rescuer is requested to stop , the workqueue
* shouldn ' t have any work pending , but @ wq - > maydays may still have
* pwq ( s ) queued . This can happen by non - rescuer workers consuming
* all the work items before the rescuer got to them . Go through
* @ wq - > maydays processing before acting on should_stop so that the
* list is always empty on exit .
*/
should_stop = kthread_should_stop ( ) ;
2010-06-29 12:07:14 +04:00
2013-03-12 22:29:59 +04:00
/* see whether any pwq is asking for help */
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & wq_mayday_lock ) ;
2013-03-12 22:29:59 +04:00
while ( ! list_empty ( & wq - > maydays ) ) {
struct pool_workqueue * pwq = list_first_entry ( & wq - > maydays ,
struct pool_workqueue , mayday_node ) ;
2013-02-14 07:29:12 +04:00
struct worker_pool * pool = pwq - > pool ;
2010-06-29 12:07:14 +04:00
struct work_struct * work , * n ;
__set_current_state ( TASK_RUNNING ) ;
2013-03-12 22:29:59 +04:00
list_del_init ( & pwq - > mayday_node ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & wq_mayday_lock ) ;
2010-06-29 12:07:14 +04:00
2014-05-20 13:46:36 +04:00
worker_attach_to_pool ( rescuer , pool ) ;
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 12:07:14 +04:00
/*
* Slurp in all works issued via this workqueue and
* process ' em .
*/
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
WARN_ON_ONCE ( ! list_empty ( & rescuer - > scheduled ) ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
list_for_each_entry_safe ( work , n , & pool - > worklist , entry ) {
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( get_work_pwq ( work ) = = pwq & &
assign_work ( work , rescuer , & n ) )
2023-05-18 06:02:08 +03:00
pwq - > stats [ PWQ_STAT_RESCUED ] + + ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
}
2010-06-29 12:07:14 +04:00
workqueue: Factor out work to worker assignment and collision handling
The two work execution paths in worker_thread() and rescuer_thread() use
move_linked_works() to claim work items from @pool->worklist. Once claimed,
process_schedule_works() is called which invokes process_one_work() on each
work item. process_one_work() then uses find_worker_executing_work() to
detect and handle collisions - situations where the work item to be executed
is still running on another worker.
This works fine, but, to improve work execution locality, we want to
establish work to worker association earlier and know for sure that the
worker is going to excute the work once asssigned, which requires performing
collision handling earlier while trying to assign the work item to the
worker.
This patch introduces assign_work() which assigns a work item to a worker
using move_linked_works() and then performs collision handling. As collision
handling is handled earlier, process_one_work() no longer needs to worry
about them.
After the this patch, collision checks for linked work items are skipped,
which should be fine as they can't be queued multiple times concurrently.
For work items running from rescuers, the timing of collision handling may
change but the invariant that the work items go through collision handling
before starting execution does not.
This patch shouldn't cause noticeable behavior changes, especially given
that worker_thread() behavior remains the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( ! list_empty ( & rescuer - > scheduled ) ) {
2014-12-08 20:39:16 +03:00
process_scheduled_works ( rescuer ) ;
/*
* The above execution of rescued work items could
* have created more to rescue through
2021-08-17 04:32:34 +03:00
* pwq_activate_first_inactive ( ) or chained
2014-12-08 20:39:16 +03:00
* queueing . Let ' s put @ pwq back on mayday list so
* that such back - to - back work items , which may be
* being used to relieve memory pressure , don ' t
* incur MAYDAY_INTERVAL delay inbetween .
*/
2020-05-29 09:58:59 +03:00
if ( pwq - > nr_active & & need_to_create_worker ( pool ) ) {
2020-05-27 22:46:33 +03:00
raw_spin_lock ( & wq_mayday_lock ) ;
2019-09-25 16:59:15 +03:00
/*
* Queue iff we aren ' t racing destruction
* and somebody else hasn ' t queued it already .
*/
if ( wq - > rescuer & & list_empty ( & pwq - > mayday_node ) ) {
get_pwq ( pwq ) ;
list_add_tail ( & pwq - > mayday_node , & wq - > maydays ) ;
}
2020-05-27 22:46:33 +03:00
raw_spin_unlock ( & wq_mayday_lock ) ;
2014-12-08 20:39:16 +03:00
}
}
2011-02-14 16:04:46 +03:00
2014-04-18 19:04:16 +04:00
/*
* Put the reference grabbed by send_mayday ( ) . @ pool won ' t
2014-07-22 09:03:47 +04:00
* go away while we ' re still attached to it .
2014-04-18 19:04:16 +04:00
*/
put_pwq ( pwq ) ;
2011-02-14 16:04:46 +03:00
/*
2023-08-08 04:57:25 +03:00
* Leave this pool . Notify regular workers ; otherwise , we end up
* with 0 concurrency and stalling the execution .
2011-02-14 16:04:46 +03:00
*/
2023-08-08 04:57:25 +03:00
kick_pool ( pool ) ;
2011-02-14 16:04:46 +03:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2014-07-22 09:03:47 +04:00
2018-05-18 18:47:13 +03:00
worker_detach_from_pool ( rescuer ) ;
2014-07-22 09:03:47 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & wq_mayday_lock ) ;
2010-06-29 12:07:14 +04:00
}
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & wq_mayday_lock ) ;
2013-03-12 22:29:59 +04:00
2014-04-18 19:04:16 +04:00
if ( should_stop ) {
__set_current_state ( TASK_RUNNING ) ;
2018-05-21 18:04:35 +03:00
set_pf_worker ( false ) ;
2014-04-18 19:04:16 +04:00
return 0 ;
}
2013-01-18 05:16:24 +04:00
/* rescuers should never participate in concurrency management */
WARN_ON_ONCE ( ! ( rescuer - > flags & WORKER_NOT_RUNNING ) ) ;
2010-06-29 12:07:14 +04:00
schedule ( ) ;
goto repeat ;
2005-04-17 02:20:36 +04:00
}
2024-02-05 00:28:06 +03:00
static void bh_worker ( struct worker * worker )
{
struct worker_pool * pool = worker - > pool ;
int nr_restarts = BH_WORKER_RESTARTS ;
unsigned long end = jiffies + BH_WORKER_JIFFIES ;
raw_spin_lock_irq ( & pool - > lock ) ;
worker_leave_idle ( worker ) ;
/*
* This function follows the structure of worker_thread ( ) . See there for
* explanations on each step .
*/
if ( ! need_more_worker ( pool ) )
goto done ;
WARN_ON_ONCE ( ! list_empty ( & worker - > scheduled ) ) ;
worker_clr_flags ( worker , WORKER_PREP | WORKER_REBOUND ) ;
do {
struct work_struct * work =
list_first_entry ( & pool - > worklist ,
struct work_struct , entry ) ;
if ( assign_work ( work , worker , NULL ) )
process_scheduled_works ( worker ) ;
} while ( keep_working ( pool ) & &
- - nr_restarts & & time_before ( jiffies , end ) ) ;
worker_set_flags ( worker , WORKER_PREP ) ;
done :
worker_enter_idle ( worker ) ;
kick_pool ( pool ) ;
raw_spin_unlock_irq ( & pool - > lock ) ;
}
/*
* TODO : Convert all tasklet users to workqueue and use softirq directly .
*
* This is currently called from tasklet [ _hi ] action ( ) and thus is also called
* whenever there are tasklets to run . Let ' s do an early exit if there ' s nothing
* queued . Once conversion from tasklet is complete , the need_more_worker ( ) test
* can be dropped .
*
* After full conversion , we ' ll add worker - > softirq_action , directly use the
* softirq action and obtain the worker pointer from the softirq_action pointer .
*/
void workqueue_softirq_action ( bool highpri )
{
struct worker_pool * pool =
& per_cpu ( bh_worker_pools , smp_processor_id ( ) ) [ highpri ] ;
if ( need_more_worker ( pool ) )
bh_worker ( list_first_entry ( & pool - > workers , struct worker , node ) ) ;
}
2024-02-27 04:38:55 +03:00
struct wq_drain_dead_softirq_work {
struct work_struct work ;
struct worker_pool * pool ;
struct completion done ;
} ;
static void drain_dead_softirq_workfn ( struct work_struct * work )
{
struct wq_drain_dead_softirq_work * dead_work =
container_of ( work , struct wq_drain_dead_softirq_work , work ) ;
struct worker_pool * pool = dead_work - > pool ;
bool repeat ;
/*
* @ pool ' s CPU is dead and we want to execute its still pending work
* items from this BH work item which is running on a different CPU . As
* its CPU is dead , @ pool can ' t be kicked and , as work execution path
* will be nested , a lockdep annotation needs to be suppressed . Mark
* @ pool with % POOL_BH_DRAINING for the special treatments .
*/
raw_spin_lock_irq ( & pool - > lock ) ;
pool - > flags | = POOL_BH_DRAINING ;
raw_spin_unlock_irq ( & pool - > lock ) ;
bh_worker ( list_first_entry ( & pool - > workers , struct worker , node ) ) ;
raw_spin_lock_irq ( & pool - > lock ) ;
pool - > flags & = ~ POOL_BH_DRAINING ;
repeat = need_more_worker ( pool ) ;
raw_spin_unlock_irq ( & pool - > lock ) ;
/*
* bh_worker ( ) might hit consecutive execution limit and bail . If there
* still are pending work items , reschedule self and return so that we
* don ' t hog this CPU ' s BH .
*/
if ( repeat ) {
if ( pool - > attrs - > nice = = HIGHPRI_NICE_LEVEL )
queue_work ( system_bh_highpri_wq , work ) ;
else
queue_work ( system_bh_wq , work ) ;
} else {
complete ( & dead_work - > done ) ;
}
}
/*
* @ cpu is dead . Drain the remaining BH work items on the current CPU . It ' s
* possible to allocate dead_work per CPU and avoid flushing . However , then we
* have to worry about draining overlapping with CPU coming back online or
* nesting ( one CPU ' s dead_work queued on another CPU which is also dead and so
* on ) . Let ' s keep it simple and drain them synchronously . These are BH work
* items which shouldn ' t be requeued on the same pool . Shouldn ' t take long .
*/
void workqueue_softirq_dead ( unsigned int cpu )
{
int i ;
for ( i = 0 ; i < NR_STD_WORKER_POOLS ; i + + ) {
struct worker_pool * pool = & per_cpu ( bh_worker_pools , cpu ) [ i ] ;
struct wq_drain_dead_softirq_work dead_work ;
if ( ! need_more_worker ( pool ) )
continue ;
2024-03-08 12:42:50 +03:00
INIT_WORK_ONSTACK ( & dead_work . work , drain_dead_softirq_workfn ) ;
2024-02-27 04:38:55 +03:00
dead_work . pool = pool ;
init_completion ( & dead_work . done ) ;
if ( pool - > attrs - > nice = = HIGHPRI_NICE_LEVEL )
queue_work ( system_bh_highpri_wq , & dead_work . work ) ;
else
queue_work ( system_bh_wq , & dead_work . work ) ;
wait_for_completion ( & dead_work . done ) ;
2024-04-08 11:44:04 +03:00
destroy_work_on_stack ( & dead_work . work ) ;
2024-02-27 04:38:55 +03:00
}
}
2015-12-07 18:58:57 +03:00
/**
* check_flush_dependency - check for flush dependency sanity
* @ target_wq : workqueue being flushed
* @ target_work : work item being flushed ( NULL for workqueue flushes )
*
* % current is trying to flush the whole @ target_wq or @ target_work on it .
* If @ target_wq doesn ' t have % WQ_MEM_RECLAIM , verify that % current is not
* reclaiming memory or running on a workqueue which doesn ' t have
* % WQ_MEM_RECLAIM as that can break forward - progress guarantee leading to
* a deadlock .
*/
static void check_flush_dependency ( struct workqueue_struct * target_wq ,
struct work_struct * target_work )
{
work_func_t target_func = target_work ? target_work - > func : NULL ;
struct worker * worker ;
if ( target_wq - > flags & WQ_MEM_RECLAIM )
return ;
worker = current_wq_worker ( ) ;
WARN_ONCE ( current - > flags & PF_MEMALLOC ,
2019-03-25 22:32:28 +03:00
" workqueue: PF_MEMALLOC task %d(%s) is flushing !WQ_MEM_RECLAIM %s:%ps " ,
2015-12-07 18:58:57 +03:00
current - > pid , current - > comm , target_wq - > name , target_func ) ;
2016-01-29 13:59:46 +03:00
WARN_ONCE ( worker & & ( ( worker - > current_pwq - > wq - > flags &
( WQ_MEM_RECLAIM | __WQ_LEGACY ) ) = = WQ_MEM_RECLAIM ) ,
2019-03-25 22:32:28 +03:00
" workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps " ,
2015-12-07 18:58:57 +03:00
worker - > current_pwq - > wq - > name , worker - > current_func ,
target_wq - > name , target_func ) ;
}
2007-05-09 13:33:51 +04:00
struct wq_barrier {
struct work_struct work ;
struct completion done ;
2015-03-09 16:22:28 +03:00
struct task_struct * task ; /* purely informational */
2007-05-09 13:33:51 +04:00
} ;
static void wq_barrier_func ( struct work_struct * work )
{
struct wq_barrier * barr = container_of ( work , struct wq_barrier , work ) ;
complete ( & barr - > done ) ;
}
2010-06-29 12:07:10 +04:00
/**
* insert_wq_barrier - insert a barrier work
2013-02-14 07:29:12 +04:00
* @ pwq : pwq to insert barrier into
2010-06-29 12:07:10 +04:00
* @ barr : wq_barrier to insert
2010-06-29 12:07:12 +04:00
* @ target : target work to attach @ barr to
* @ worker : worker currently executing @ target , NULL if @ target is not executing
2010-06-29 12:07:10 +04:00
*
2010-06-29 12:07:12 +04:00
* @ barr is linked to @ target such that @ barr is completed only after
* @ target finishes execution . Please note that the ordering
* guarantee is observed only with respect to @ target and on the local
* cpu .
*
* Currently , a queued barrier can ' t be canceled . This is because
* try_to_grab_pending ( ) can ' t determine whether the work to be
* grabbed is at the head of the queue and thus can ' t clear LINKED
* flag of the previous work while there must be a valid next work
* after a work with LINKED flag set .
*
* Note that when @ worker is non - NULL , @ target may be modified
2013-02-14 07:29:12 +04:00
* underneath us , so we can ' t reliably determine pwq from @ target .
2010-06-29 12:07:10 +04:00
*
* CONTEXT :
2020-05-27 22:46:33 +03:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 12:07:10 +04:00
*/
2013-02-14 07:29:12 +04:00
static void insert_wq_barrier ( struct pool_workqueue * pwq ,
2010-06-29 12:07:12 +04:00
struct wq_barrier * barr ,
struct work_struct * target , struct worker * worker )
2007-05-09 13:33:51 +04:00
{
2024-02-05 00:28:06 +03:00
static __maybe_unused struct lock_class_key bh_key , thr_key ;
2021-08-17 04:32:38 +03:00
unsigned int work_flags = 0 ;
unsigned int work_color ;
2010-06-29 12:07:12 +04:00
struct list_head * head ;
2009-11-15 19:09:48 +03:00
/*
2013-01-24 23:01:33 +04:00
* debugobject calls are safe here even with pool - > lock locked
2009-11-15 19:09:48 +03:00
* as we know for sure that this will not trigger any of the
* checks and call back into the fixup functions where we
* might deadlock .
2024-02-05 00:28:06 +03:00
*
* BH and threaded workqueues need separate lockdep keys to avoid
* spuriously triggering " inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W}
* usage " .
2009-11-15 19:09:48 +03:00
*/
2024-02-05 00:28:06 +03:00
INIT_WORK_ONSTACK_KEY ( & barr - > work , wq_barrier_func ,
( pwq - > wq - > flags & WQ_BH ) ? & bh_key : & thr_key ) ;
2010-06-29 12:07:10 +04:00
__set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( & barr - > work ) ) ;
locking/lockdep: Explicitly initialize wq_barrier::done::map
With the new lockdep crossrelease feature, which checks completions usage,
a false positive is reported in the workqueue code:
> Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
> Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
> Task C : acquired of lock#3 -> wait for completion of barr->done
> (Task C is in lru_add_drain_all_cpuslocked())
> Worker D : wait for wfc.work to be released -> will complete barr->done
Such a dead lock can not happen because Task C's barr->done and Worker D's
barr->done can not be the same instance.
The reason of this false positive is we initialize all wq_barrier::done
at insert_wq_barrier() via init_completion(), which makes them belong to
the same lock class, therefore, impossible circles are reported.
To fix this, explicitly initialize the lockdep map for wq_barrier::done
in insert_wq_barrier(), so that the lock class key of wq_barrier::done
is a subkey of the corresponding work_struct, as a result we won't build
a dependency between a wq_barrier with a unrelated work, and we can
differ wq barriers based on the related works, so the false positive
above is avoided.
Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
to make the code simple and away from unnecessary #ifdefs.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 12:46:12 +03:00
2017-10-25 11:56:04 +03:00
init_completion_map ( & barr - > done , & target - > lockdep_map ) ;
2015-03-09 16:22:28 +03:00
barr - > task = current ;
2007-05-09 13:33:54 +04:00
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/* The barrier work item does not participate in nr_active. */
2021-08-17 04:32:37 +03:00
work_flags | = WORK_STRUCT_INACTIVE ;
2010-06-29 12:07:12 +04:00
/*
* If @ target is currently being executed , schedule the
* barrier to the worker ; otherwise , put it after @ target .
*/
2021-08-17 04:32:38 +03:00
if ( worker ) {
2010-06-29 12:07:12 +04:00
head = worker - > scheduled . next ;
2021-08-17 04:32:38 +03:00
work_color = worker - > current_color ;
} else {
2010-06-29 12:07:12 +04:00
unsigned long * bits = work_data_bits ( target ) ;
head = target - > entry . next ;
/* there can already be other linked works, inherit and set */
2021-08-17 04:32:36 +03:00
work_flags | = * bits & WORK_STRUCT_LINKED ;
2021-08-17 04:32:38 +03:00
work_color = get_work_color ( * bits ) ;
2010-06-29 12:07:12 +04:00
__set_bit ( WORK_STRUCT_LINKED_BIT , bits ) ;
}
2021-08-17 04:32:38 +03:00
pwq - > nr_in_flight [ work_color ] + + ;
work_flags | = work_color_to_flags ( work_color ) ;
2021-08-17 04:32:36 +03:00
insert_work ( pwq , & barr - > work , head , work_flags ) ;
2007-05-09 13:33:51 +04:00
}
2010-06-29 12:07:11 +04:00
/**
2013-02-14 07:29:12 +04:00
* flush_workqueue_prep_pwqs - prepare pwqs for workqueue flushing
2010-06-29 12:07:11 +04:00
* @ wq : workqueue being flushed
* @ flush_color : new flush color , < 0 for no - op
* @ work_color : new work color , < 0 for no - op
*
2013-02-14 07:29:12 +04:00
* Prepare pwqs for workqueue flushing .
2010-06-29 12:07:11 +04:00
*
2013-02-14 07:29:12 +04:00
* If @ flush_color is non - negative , flush_color on all pwqs should be
* - 1. If no pwq has in - flight commands at the specified color , all
* pwq - > flush_color ' s stay at - 1 and % false is returned . If any pwq
* has in flight commands , its pwq - > flush_color is set to
* @ flush_color , @ wq - > nr_pwqs_to_flush is updated accordingly , pwq
2010-06-29 12:07:11 +04:00
* wakeup logic is armed and % true is returned .
*
* The caller should have initialized @ wq - > first_flusher prior to
* calling this function with non - negative @ flush_color . If
* @ flush_color is negative , no flush color update is done and % false
* is returned .
*
2013-02-14 07:29:12 +04:00
* If @ work_color is non - negative , all pwqs should have the same
2010-06-29 12:07:11 +04:00
* work_color which is previous to @ work_color and all will be
* advanced to @ work_color .
*
* CONTEXT :
2013-03-26 03:57:17 +04:00
* mutex_lock ( wq - > mutex ) .
2010-06-29 12:07:11 +04:00
*
2013-08-01 01:59:24 +04:00
* Return :
2010-06-29 12:07:11 +04:00
* % true if @ flush_color > = 0 and there ' s something to flush . % false
* otherwise .
*/
2013-02-14 07:29:12 +04:00
static bool flush_workqueue_prep_pwqs ( struct workqueue_struct * wq ,
2010-06-29 12:07:11 +04:00
int flush_color , int work_color )
2005-04-17 02:20:36 +04:00
{
2010-06-29 12:07:11 +04:00
bool wait = false ;
2013-03-12 22:29:58 +04:00
struct pool_workqueue * pwq ;
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:11 +04:00
if ( flush_color > = 0 ) {
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( atomic_read ( & wq - > nr_pwqs_to_flush ) ) ;
2013-02-14 07:29:12 +04:00
atomic_set ( & wq - > nr_pwqs_to_flush , 1 ) ;
2005-04-17 02:20:36 +04:00
}
2009-04-03 03:58:24 +04:00
2013-03-12 22:29:58 +04:00
for_each_pwq ( pwq , wq ) {
2013-02-14 07:29:12 +04:00
struct worker_pool * pool = pwq - > pool ;
2007-05-09 13:33:51 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2007-05-09 13:33:54 +04:00
2010-06-29 12:07:11 +04:00
if ( flush_color > = 0 ) {
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( pwq - > flush_color ! = - 1 ) ;
2007-05-09 13:33:51 +04:00
2013-02-14 07:29:12 +04:00
if ( pwq - > nr_in_flight [ flush_color ] ) {
pwq - > flush_color = flush_color ;
atomic_inc ( & wq - > nr_pwqs_to_flush ) ;
2010-06-29 12:07:11 +04:00
wait = true ;
}
}
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:11 +04:00
if ( work_color > = 0 ) {
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( work_color ! = work_next_color ( pwq - > work_color ) ) ;
2013-02-14 07:29:12 +04:00
pwq - > work_color = work_color ;
2010-06-29 12:07:11 +04:00
}
2005-04-17 02:20:36 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2005-04-17 02:20:36 +04:00
}
2009-04-03 03:58:24 +04:00
2013-02-14 07:29:12 +04:00
if ( flush_color > = 0 & & atomic_dec_and_test ( & wq - > nr_pwqs_to_flush ) )
2010-06-29 12:07:11 +04:00
complete ( & wq - > first_flusher - > done ) ;
2007-05-24 00:57:57 +04:00
2010-06-29 12:07:11 +04:00
return wait ;
2005-04-17 02:20:36 +04:00
}
2024-02-05 00:28:06 +03:00
static void touch_wq_lockdep_map ( struct workqueue_struct * wq )
{
2024-02-05 00:28:06 +03:00
# ifdef CONFIG_LOCKDEP
if ( wq - > flags & WQ_BH )
local_bh_disable ( ) ;
2024-02-05 00:28:06 +03:00
lock_map_acquire ( & wq - > lockdep_map ) ;
lock_map_release ( & wq - > lockdep_map ) ;
2024-02-05 00:28:06 +03:00
if ( wq - > flags & WQ_BH )
local_bh_enable ( ) ;
# endif
2024-02-05 00:28:06 +03:00
}
static void touch_work_lockdep_map ( struct work_struct * work ,
struct workqueue_struct * wq )
{
2024-02-05 00:28:06 +03:00
# ifdef CONFIG_LOCKDEP
if ( wq - > flags & WQ_BH )
local_bh_disable ( ) ;
2024-02-05 00:28:06 +03:00
lock_map_acquire ( & work - > lockdep_map ) ;
lock_map_release ( & work - > lockdep_map ) ;
2024-02-05 00:28:06 +03:00
if ( wq - > flags & WQ_BH )
local_bh_enable ( ) ;
# endif
2024-02-05 00:28:06 +03:00
}
2006-07-30 14:03:42 +04:00
/**
2022-06-01 10:32:47 +03:00
* __flush_workqueue - ensure that any scheduled work has run to completion .
2006-07-30 14:03:42 +04:00
* @ wq : workqueue to flush
2005-04-17 02:20:36 +04:00
*
2013-03-14 03:51:36 +04:00
* This function sleeps until all work items which were queued on entry
* have finished execution , but it is not livelocked by new incoming ones .
2005-04-17 02:20:36 +04:00
*/
2022-06-01 10:32:47 +03:00
void __flush_workqueue ( struct workqueue_struct * wq )
2005-04-17 02:20:36 +04:00
{
2010-06-29 12:07:11 +04:00
struct wq_flusher this_flusher = {
. list = LIST_HEAD_INIT ( this_flusher . list ) ,
. flush_color = - 1 ,
2017-10-25 11:56:04 +03:00
. done = COMPLETION_INITIALIZER_ONSTACK_MAP ( this_flusher . done , wq - > lockdep_map ) ,
2010-06-29 12:07:11 +04:00
} ;
int next_color ;
2005-04-17 02:20:36 +04:00
2016-09-16 22:49:32 +03:00
if ( WARN_ON ( ! wq_online ) )
return ;
2024-02-05 00:28:06 +03:00
touch_wq_lockdep_map ( wq ) ;
2018-08-22 12:49:04 +03:00
2013-03-26 03:57:17 +04:00
mutex_lock ( & wq - > mutex ) ;
2010-06-29 12:07:11 +04:00
/*
* Start - to - wait phase
*/
next_color = work_next_color ( wq - > work_color ) ;
if ( next_color ! = wq - > flush_color ) {
/*
* Color space is not full . The current work_color
* becomes our flush_color and work_color is advanced
* by one .
*/
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( ! list_empty ( & wq - > flusher_overflow ) ) ;
2010-06-29 12:07:11 +04:00
this_flusher . flush_color = wq - > work_color ;
wq - > work_color = next_color ;
if ( ! wq - > first_flusher ) {
/* no flush in progress, become the first flusher */
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( wq - > flush_color ! = this_flusher . flush_color ) ;
2010-06-29 12:07:11 +04:00
wq - > first_flusher = & this_flusher ;
2013-02-14 07:29:12 +04:00
if ( ! flush_workqueue_prep_pwqs ( wq , wq - > flush_color ,
2010-06-29 12:07:11 +04:00
wq - > work_color ) ) {
/* nothing to flush, done */
wq - > flush_color = next_color ;
wq - > first_flusher = NULL ;
goto out_unlock ;
}
} else {
/* wait in queue */
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( wq - > flush_color = = this_flusher . flush_color ) ;
2010-06-29 12:07:11 +04:00
list_add_tail ( & this_flusher . list , & wq - > flusher_queue ) ;
2013-02-14 07:29:12 +04:00
flush_workqueue_prep_pwqs ( wq , - 1 , wq - > work_color ) ;
2010-06-29 12:07:11 +04:00
}
} else {
/*
* Oops , color space is full , wait on overflow queue .
* The next flush completion will assign us
* flush_color and transfer to flusher_queue .
*/
list_add_tail ( & this_flusher . list , & wq - > flusher_overflow ) ;
}
2015-12-07 18:58:57 +03:00
check_flush_dependency ( wq , NULL ) ;
2013-03-26 03:57:17 +04:00
mutex_unlock ( & wq - > mutex ) ;
2010-06-29 12:07:11 +04:00
wait_for_completion ( & this_flusher . done ) ;
/*
* Wake - up - and - cascade phase
*
* First flushers are responsible for cascading flushes and
* handling overflow . Non - first flushers can simply return .
*/
2020-03-10 19:23:19 +03:00
if ( READ_ONCE ( wq - > first_flusher ) ! = & this_flusher )
2010-06-29 12:07:11 +04:00
return ;
2013-03-26 03:57:17 +04:00
mutex_lock ( & wq - > mutex ) ;
2010-06-29 12:07:11 +04:00
2010-07-02 12:03:51 +04:00
/* we might have raced, check again with mutex held */
if ( wq - > first_flusher ! = & this_flusher )
goto out_unlock ;
2020-03-10 19:23:19 +03:00
WRITE_ONCE ( wq - > first_flusher , NULL ) ;
2010-06-29 12:07:11 +04:00
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( ! list_empty ( & this_flusher . list ) ) ;
WARN_ON_ONCE ( wq - > flush_color ! = this_flusher . flush_color ) ;
2010-06-29 12:07:11 +04:00
while ( true ) {
struct wq_flusher * next , * tmp ;
/* complete all the flushers sharing the current flush color */
list_for_each_entry_safe ( next , tmp , & wq - > flusher_queue , list ) {
if ( next - > flush_color ! = wq - > flush_color )
break ;
list_del_init ( & next - > list ) ;
complete ( & next - > done ) ;
}
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( ! list_empty ( & wq - > flusher_overflow ) & &
wq - > flush_color ! = work_next_color ( wq - > work_color ) ) ;
2010-06-29 12:07:11 +04:00
/* this flush_color is finished, advance by one */
wq - > flush_color = work_next_color ( wq - > flush_color ) ;
/* one color has been freed, handle overflow queue */
if ( ! list_empty ( & wq - > flusher_overflow ) ) {
/*
* Assign the same color to all overflowed
* flushers , advance work_color and append to
* flusher_queue . This is the start - to - wait
* phase for these overflowed flushers .
*/
list_for_each_entry ( tmp , & wq - > flusher_overflow , list )
tmp - > flush_color = wq - > work_color ;
wq - > work_color = work_next_color ( wq - > work_color ) ;
list_splice_tail_init ( & wq - > flusher_overflow ,
& wq - > flusher_queue ) ;
2013-02-14 07:29:12 +04:00
flush_workqueue_prep_pwqs ( wq , - 1 , wq - > work_color ) ;
2010-06-29 12:07:11 +04:00
}
if ( list_empty ( & wq - > flusher_queue ) ) {
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( wq - > flush_color ! = wq - > work_color ) ;
2010-06-29 12:07:11 +04:00
break ;
}
/*
* Need to flush more colors . Make the next flusher
2013-02-14 07:29:12 +04:00
* the new first flusher and arm pwqs .
2010-06-29 12:07:11 +04:00
*/
2013-03-12 22:29:57 +04:00
WARN_ON_ONCE ( wq - > flush_color = = wq - > work_color ) ;
WARN_ON_ONCE ( wq - > flush_color ! = next - > flush_color ) ;
2010-06-29 12:07:11 +04:00
list_del_init ( & next - > list ) ;
wq - > first_flusher = next ;
2013-02-14 07:29:12 +04:00
if ( flush_workqueue_prep_pwqs ( wq , wq - > flush_color , - 1 ) )
2010-06-29 12:07:11 +04:00
break ;
/*
* Meh . . . this color is already done , clear first
* flusher and repeat cascading .
*/
wq - > first_flusher = NULL ;
}
out_unlock :
2013-03-26 03:57:17 +04:00
mutex_unlock ( & wq - > mutex ) ;
2005-04-17 02:20:36 +04:00
}
2022-06-01 10:32:47 +03:00
EXPORT_SYMBOL ( __flush_workqueue ) ;
2005-04-17 02:20:36 +04:00
2011-04-05 20:01:44 +04:00
/**
* drain_workqueue - drain a workqueue
* @ wq : workqueue to drain
*
* Wait until the workqueue becomes empty . While draining is in progress ,
* only chain queueing is allowed . IOW , only currently pending or running
* work items on @ wq can queue further work items on it . @ wq is flushed
2015-05-13 13:10:05 +03:00
* repeatedly until it becomes empty . The number of flushing is determined
2011-04-05 20:01:44 +04:00
* by the depth of chaining and should be relatively short . Whine if it
* takes too long .
*/
void drain_workqueue ( struct workqueue_struct * wq )
{
unsigned int flush_cnt = 0 ;
2013-03-12 22:29:58 +04:00
struct pool_workqueue * pwq ;
2011-04-05 20:01:44 +04:00
/*
* __queue_work ( ) needs to test whether there are drainers , is much
* hotter than drain_workqueue ( ) and already looks at @ wq - > flags .
2013-03-12 22:30:04 +04:00
* Use __WQ_DRAINING so that queue doesn ' t have to check nr_drainers .
2011-04-05 20:01:44 +04:00
*/
2013-03-26 03:57:18 +04:00
mutex_lock ( & wq - > mutex ) ;
2011-04-05 20:01:44 +04:00
if ( ! wq - > nr_drainers + + )
2013-03-12 22:30:04 +04:00
wq - > flags | = __WQ_DRAINING ;
2013-03-26 03:57:18 +04:00
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 20:01:44 +04:00
reflush :
2022-06-01 10:32:47 +03:00
__flush_workqueue ( wq ) ;
2011-04-05 20:01:44 +04:00
2013-03-26 03:57:18 +04:00
mutex_lock ( & wq - > mutex ) ;
2013-03-12 22:30:00 +04:00
2013-03-12 22:29:58 +04:00
for_each_pwq ( pwq , wq ) {
2011-09-15 03:22:28 +04:00
bool drained ;
2011-04-05 20:01:44 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
2024-01-29 21:11:24 +03:00
drained = pwq_is_empty ( pwq ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2011-09-15 03:22:28 +04:00
if ( drained )
2011-04-05 20:01:44 +04:00
continue ;
if ( + + flush_cnt = = 10 | |
( flush_cnt % 100 = = 0 & & flush_cnt < = 1000 ) )
2021-01-23 11:04:00 +03:00
pr_warn ( " workqueue %s: %s() isn't complete after %u tries \n " ,
wq - > name , __func__ , flush_cnt ) ;
2013-03-12 22:30:00 +04:00
2013-03-26 03:57:18 +04:00
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 20:01:44 +04:00
goto reflush ;
}
if ( ! - - wq - > nr_drainers )
2013-03-12 22:30:04 +04:00
wq - > flags & = ~ __WQ_DRAINING ;
2013-03-26 03:57:18 +04:00
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 20:01:44 +04:00
}
EXPORT_SYMBOL_GPL ( drain_workqueue ) ;
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
static bool start_flush_work ( struct work_struct * work , struct wq_barrier * barr ,
bool from_cancel )
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 12:47:49 +04:00
{
2010-06-29 12:07:12 +04:00
struct worker * worker = NULL ;
2013-01-24 23:01:33 +04:00
struct worker_pool * pool ;
2013-02-14 07:29:12 +04:00
struct pool_workqueue * pwq ;
2024-02-05 00:28:06 +03:00
struct workqueue_struct * wq ;
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 12:47:49 +04:00
2019-03-13 19:55:47 +03:00
rcu_read_lock ( ) ;
2013-01-24 23:01:33 +04:00
pool = get_work_pool ( work ) ;
2013-03-12 22:30:00 +04:00
if ( ! pool ) {
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2010-09-16 12:42:16 +04:00
return false ;
2013-03-12 22:30:00 +04:00
}
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 12:47:49 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 06:04:53 +04:00
/* see the comment in try_to_grab_pending() with the same code */
2013-02-14 07:29:12 +04:00
pwq = get_work_pwq ( work ) ;
if ( pwq ) {
if ( unlikely ( pwq - > pool ! = pool ) )
2010-06-29 12:07:10 +04:00
goto already_gone ;
2012-08-21 01:51:23 +04:00
} else {
2013-01-24 23:01:33 +04:00
worker = find_worker_executing_work ( pool , work ) ;
2010-06-29 12:07:12 +04:00
if ( ! worker )
2010-06-29 12:07:10 +04:00
goto already_gone ;
2013-02-14 07:29:12 +04:00
pwq = worker - > current_pwq ;
2012-08-21 01:51:23 +04:00
}
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 12:47:49 +04:00
2024-02-05 00:28:06 +03:00
wq = pwq - > wq ;
check_flush_dependency ( wq , work ) ;
2015-12-07 18:58:57 +03:00
2013-02-14 07:29:12 +04:00
insert_wq_barrier ( pwq , barr , work , worker ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 12:07:13 +04:00
2024-02-05 00:28:06 +03:00
touch_work_lockdep_map ( work , wq ) ;
2011-01-10 01:32:15 +03:00
/*
2017-08-23 13:52:32 +03:00
* Force a lock recursion deadlock when using flush_work ( ) inside a
* single - threaded or rescuer equipped workqueue .
*
* For single threaded workqueues the deadlock happens when the work
* is after the work issuing the flush_work ( ) . For rescuer equipped
* workqueues the deadlock happens when the rescuer stalls , blocking
* forward progress .
2011-01-10 01:32:15 +03:00
*/
2024-02-05 00:28:06 +03:00
if ( ! from_cancel & & ( wq - > saved_max_active = = 1 | | wq - > rescuer ) )
touch_wq_lockdep_map ( wq ) ;
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2010-09-16 12:36:00 +04:00
return true ;
2010-06-29 12:07:10 +04:00
already_gone :
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2010-09-16 12:36:00 +04:00
return false ;
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 12:47:49 +04:00
}
2010-09-16 12:42:16 +04:00
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
static bool __flush_work ( struct work_struct * work , bool from_cancel )
{
struct wq_barrier barr ;
2024-03-25 20:21:03 +03:00
unsigned long data ;
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
if ( WARN_ON ( ! wq_online ) )
return false ;
2019-01-23 03:44:12 +03:00
if ( WARN_ON ( ! work - > func ) )
return false ;
2024-03-25 20:21:03 +03:00
if ( ! start_flush_work ( work , & barr , from_cancel ) )
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
return false ;
2024-03-25 20:21:03 +03:00
/*
* start_flush_work ( ) returned % true . If @ from_cancel is set , we know
* that @ work must have been executing during start_flush_work ( ) and
* can ' t currently be queued . Its data must contain OFFQ bits . If @ work
* was queued on a BH workqueue , we also know that it was running in the
* BH context and thus can be busy - waited .
*/
data = * work_data_bits ( work ) ;
if ( from_cancel & &
! WARN_ON_ONCE ( data & WORK_STRUCT_PWQ ) & & ( data & WORK_OFFQ_BH ) ) {
/*
* On RT , prevent a live lock when % current preempted soft
* interrupt processing or prevents ksoftirqd from running by
* keeping flipping BH . If the BH work item runs on a different
* CPU then this has no effect other than doing the BH
* disable / enable dance for nothing . This is copied from
* kernel / softirq . c : : tasklet_unlock_spin_wait ( ) .
*/
while ( ! try_wait_for_completion ( & barr . done ) ) {
if ( IS_ENABLED ( CONFIG_PREEMPT_RT ) ) {
local_bh_disable ( ) ;
local_bh_enable ( ) ;
} else {
cpu_relax ( ) ;
}
}
} else {
wait_for_completion ( & barr . done ) ;
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
}
2024-03-25 20:21:03 +03:00
destroy_work_on_stack ( & barr . work ) ;
return true ;
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
}
2010-09-16 12:42:16 +04:00
/**
* flush_work - wait for a work to finish executing the last queueing instance
* @ work : the work to flush
*
2012-08-21 01:51:23 +04:00
* Wait until @ work has finished execution . @ work is guaranteed to be idle
* on return if it hasn ' t been requeued since flush started .
2010-09-16 12:42:16 +04:00
*
2013-08-01 01:59:24 +04:00
* Return :
2010-09-16 12:42:16 +04:00
* % true if flush_work ( ) waited for the work to finish execution ,
* % false if it was already idle .
*/
bool flush_work ( struct work_struct * work )
{
2024-03-25 20:21:03 +03:00
might_sleep ( ) ;
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
return __flush_work ( work , false ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
}
2012-08-21 01:51:23 +04:00
EXPORT_SYMBOL_GPL ( flush_work ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
2024-02-21 08:36:14 +03:00
/**
* flush_delayed_work - wait for a dwork to finish executing the last queueing
* @ dwork : the delayed work to flush
*
* Delayed timer is cancelled and the pending work is queued for
* immediate execution . Like flush_work ( ) , this function only
* considers the last queueing instance of @ dwork .
*
* Return :
* % true if flush_work ( ) waited for the work to finish execution ,
* % false if it was already idle .
*/
bool flush_delayed_work ( struct delayed_work * dwork )
{
local_irq_disable ( ) ;
if ( del_timer_sync ( & dwork - > timer ) )
__queue_work ( dwork - > cpu , dwork - > wq , & dwork - > work ) ;
local_irq_enable ( ) ;
return flush_work ( & dwork - > work ) ;
}
EXPORT_SYMBOL ( flush_delayed_work ) ;
/**
* flush_rcu_work - wait for a rwork to finish executing the last queueing
* @ rwork : the rcu work to flush
*
* Return :
* % true if flush_rcu_work ( ) waited for the work to finish execution ,
* % false if it was already idle .
*/
bool flush_rcu_work ( struct rcu_work * rwork )
{
if ( test_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( & rwork - > work ) ) ) {
rcu_barrier ( ) ;
flush_work ( & rwork - > work ) ;
return true ;
} else {
return flush_work ( & rwork - > work ) ;
}
}
EXPORT_SYMBOL ( flush_rcu_work ) ;
2024-03-25 20:21:03 +03:00
static void work_offqd_disable ( struct work_offq_data * offqd )
{
const unsigned long max = ( 1lu < < WORK_OFFQ_DISABLE_BITS ) - 1 ;
if ( likely ( offqd - > disable < max ) )
offqd - > disable + + ;
else
WARN_ONCE ( true , " workqueue: work disable count overflowed \n " ) ;
}
static void work_offqd_enable ( struct work_offq_data * offqd )
{
if ( likely ( offqd - > disable > 0 ) )
offqd - > disable - - ;
else
WARN_ONCE ( true , " workqueue: work disable count underflowed \n " ) ;
}
2024-02-21 08:36:14 +03:00
static bool __cancel_work ( struct work_struct * work , u32 cflags )
2024-02-21 08:36:14 +03:00
{
2024-03-25 20:21:02 +03:00
struct work_offq_data offqd ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2024-02-21 08:36:14 +03:00
int ret ;
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
ret = work_grab_pending ( work , cflags , & irq_flags ) ;
2024-02-21 08:36:14 +03:00
2024-03-25 20:21:02 +03:00
work_offqd_unpack ( & offqd , * work_data_bits ( work ) ) ;
2024-02-21 08:36:14 +03:00
2024-03-25 20:21:03 +03:00
if ( cflags & WORK_CANCEL_DISABLE )
work_offqd_disable ( & offqd ) ;
2024-03-25 20:21:02 +03:00
set_work_pool_and_clear_pending ( work , offqd . pool_id ,
work_offqd_pack_flags ( & offqd ) ) ;
2024-02-21 08:36:14 +03:00
local_irq_restore ( irq_flags ) ;
2024-02-21 08:36:14 +03:00
return ret ;
}
2024-02-21 08:36:14 +03:00
static bool __cancel_work_sync ( struct work_struct * work , u32 cflags )
2007-07-16 10:41:44 +04:00
{
2024-02-21 08:36:14 +03:00
bool ret ;
2007-07-16 10:41:44 +04:00
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
ret = __cancel_work ( work , cflags | WORK_CANCEL_DISABLE ) ;
2012-08-03 21:30:46 +04:00
2024-03-25 20:21:03 +03:00
if ( * work_data_bits ( work ) & WORK_OFFQ_BH )
WARN_ON_ONCE ( in_hardirq ( ) ) ;
else
might_sleep ( ) ;
2012-08-03 21:30:46 +04:00
2016-09-16 22:49:32 +03:00
/*
2024-02-21 08:36:13 +03:00
* Skip __flush_work ( ) during early boot when we know that @ work isn ' t
* executing . This allows canceling during early boot .
2016-09-16 22:49:32 +03:00
*/
if ( wq_online )
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 12:49:03 +03:00
__flush_work ( work , true ) ;
2016-09-16 22:49:32 +03:00
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
if ( ! ( cflags & WORK_CANCEL_DISABLE ) )
enable_work ( work ) ;
2015-03-05 16:04:13 +03:00
2007-07-16 10:41:44 +04:00
return ret ;
}
2024-02-21 08:36:14 +03:00
/*
* See cancel_delayed_work ( )
*/
bool cancel_work ( struct work_struct * work )
{
2024-02-21 08:36:14 +03:00
return __cancel_work ( work , 0 ) ;
2024-02-21 08:36:14 +03:00
}
EXPORT_SYMBOL ( cancel_work ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
/**
2010-09-16 12:36:00 +04:00
* cancel_work_sync - cancel a work and wait for it to finish
* @ work : the work to cancel
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
*
2024-03-25 20:21:03 +03:00
* Cancel @ work and wait for its execution to finish . This function can be used
* even if the work re - queues itself or migrates to another workqueue . On return
* from this function , @ work is guaranteed to be not pending or executing on any
* CPU as long as there aren ' t racing enqueues .
2007-07-16 10:41:44 +04:00
*
2024-03-25 20:21:03 +03:00
* cancel_work_sync ( & delayed_work - > work ) must not be used for delayed_work ' s .
* Use cancel_delayed_work_sync ( ) instead .
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
*
2024-03-25 20:21:03 +03:00
* Must be called from a sleepable context if @ work was last queued on a non - BH
* workqueue . Can also be called from non - hardirq atomic contexts including BH
* if @ work was last queued on a BH workqueue .
2010-09-16 12:36:00 +04:00
*
2024-03-25 20:21:03 +03:00
* Returns % true if @ work was pending , % false otherwise .
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
*/
2010-09-16 12:36:00 +04:00
bool cancel_work_sync ( struct work_struct * work )
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
{
2024-02-21 08:36:14 +03:00
return __cancel_work_sync ( work , 0 ) ;
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:33:52 +04:00
}
2007-05-09 13:34:22 +04:00
EXPORT_SYMBOL_GPL ( cancel_work_sync ) ;
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:33:52 +04:00
2010-09-16 12:48:29 +04:00
/**
2012-08-22 00:18:24 +04:00
* cancel_delayed_work - cancel a delayed work
* @ dwork : delayed_work to cancel
2010-09-16 12:48:29 +04:00
*
2013-08-01 01:59:24 +04:00
* Kill off a pending delayed_work .
*
* Return : % true if @ dwork was pending and canceled ; % false if it wasn ' t
* pending .
*
* Note :
* The work callback function may still be running on return , unless
* it returns % true and the work doesn ' t re - arm itself . Explicitly flush or
* use cancel_delayed_work_sync ( ) to wait on it .
2010-09-16 12:48:29 +04:00
*
2012-08-22 00:18:24 +04:00
* This function is safe to call from any context including IRQ handler .
2010-09-16 12:48:29 +04:00
*/
2012-08-22 00:18:24 +04:00
bool cancel_delayed_work ( struct delayed_work * dwork )
2010-09-16 12:48:29 +04:00
{
2024-02-21 08:36:14 +03:00
return __cancel_work ( & dwork - > work , WORK_CANCEL_DELAYED ) ;
2010-09-16 12:48:29 +04:00
}
2012-08-22 00:18:24 +04:00
EXPORT_SYMBOL ( cancel_delayed_work ) ;
2010-09-16 12:48:29 +04:00
2010-09-16 12:36:00 +04:00
/**
* cancel_delayed_work_sync - cancel a delayed work and wait for it to finish
* @ dwork : the delayed work cancel
*
* This is cancel_work_sync ( ) for delayed works .
*
2013-08-01 01:59:24 +04:00
* Return :
2010-09-16 12:36:00 +04:00
* % true if @ dwork was pending , % false otherwise .
*/
bool cancel_delayed_work_sync ( struct delayed_work * dwork )
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
{
2024-02-21 08:36:14 +03:00
return __cancel_work_sync ( & dwork - > work , WORK_CANCEL_DELAYED ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 13:34:46 +04:00
}
2007-07-16 10:41:44 +04:00
EXPORT_SYMBOL ( cancel_delayed_work_sync ) ;
2005-04-17 02:20:36 +04:00
2024-03-25 20:21:03 +03:00
/**
* disable_work - Disable and cancel a work item
* @ work : work item to disable
*
* Disable @ work by incrementing its disable count and cancel it if currently
* pending . As long as the disable count is non - zero , any attempt to queue @ work
* will fail and return % false . The maximum supported disable depth is 2 to the
* power of % WORK_OFFQ_DISABLE_BITS , currently 65536.
*
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
* Can be called from any context . Returns % true if @ work was pending , % false
* otherwise .
2024-03-25 20:21:03 +03:00
*/
bool disable_work ( struct work_struct * work )
{
return __cancel_work ( work , WORK_CANCEL_DISABLE ) ;
}
EXPORT_SYMBOL_GPL ( disable_work ) ;
/**
* disable_work_sync - Disable , cancel and drain a work item
* @ work : work item to disable
*
* Similar to disable_work ( ) but also wait for @ work to finish if currently
* executing .
*
2024-03-25 20:21:03 +03:00
* Must be called from a sleepable context if @ work was last queued on a non - BH
* workqueue . Can also be called from non - hardirq atomic contexts including BH
* if @ work was last queued on a BH workqueue .
*
* Returns % true if @ work was pending , % false otherwise .
2024-03-25 20:21:03 +03:00
*/
bool disable_work_sync ( struct work_struct * work )
{
return __cancel_work_sync ( work , WORK_CANCEL_DISABLE ) ;
}
EXPORT_SYMBOL_GPL ( disable_work_sync ) ;
/**
* enable_work - Enable a work item
* @ work : work item to enable
*
* Undo disable_work [ _sync ] ( ) by decrementing @ work ' s disable count . @ work can
* only be queued if its disable count is 0.
*
workqueue: Remove WORK_OFFQ_CANCELING
cancel[_delayed]_work_sync() guarantees that it can shut down
self-requeueing work items. To achieve that, it grabs and then holds
WORK_STRUCT_PENDING bit set while flushing the currently executing instance.
As the PENDING bit is set, all queueing attempts including the
self-requeueing ones fail and once the currently executing instance is
flushed, the work item should be idle as long as someone else isn't actively
queueing it.
This means that the cancel_work_sync path may hold the PENDING bit set while
flushing the target work item. This isn't a problem for the queueing path -
it can just fail which is the desired effect. It doesn't affect flush. It
doesn't matter to cancel_work either as it can just report that the work
item has successfully canceled. However, if there's another cancel_work_sync
attempt on the work item, it can't simply fail or report success and that
would breach the guarantee that it should provide. cancel_work_sync has to
wait for and grab that PENDING bit and go through the motions.
WORK_OFFQ_CANCELING and wq_cancel_waitq are what implement this
cancel_work_sync to cancel_work_sync wait mechanism. When a work item is
being canceled, WORK_OFFQ_CANCELING is also set on it and other
cancel_work_sync attempts wait on the bit to be cleared using the wait
queue.
While this works, it's an isolated wart which doesn't jive with the rest of
flush and cancel mechanisms and forces enable_work() and disable_work() to
require a sleepable context, which hampers their usability.
Now that a work item can be disabled, we can use that to block queueing
while cancel_work_sync is in progress. Instead of holding PENDING the bit,
it can temporarily disable the work item, flush and then re-enable it as
that'd achieve the same end result of blocking queueings while canceling and
thus enable canceling of self-requeueing work items.
- WORK_OFFQ_CANCELING and the surrounding mechanims are removed.
- work_grab_pending() is now simpler, no longer has to wait for a blocking
operation and thus can be called from any context.
- With work_grab_pending() simplified, no need to use try_to_grab_pending()
directly. All users are converted to use work_grab_pending().
- __cancel_work_sync() is updated to __cancel_work() with
WORK_CANCEL_DISABLE to cancel and plug racing queueing attempts. It then
flushes and re-enables the work item if necessary.
- These changes allow disable_work() and enable_work() to be called from any
context.
v2: Lai pointed out that mod_delayed_work_on() needs to check the disable
count before queueing the delayed work item. Added
clear_pending_if_disabled() call.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-03-25 20:21:03 +03:00
* Can be called from any context . Returns % true if the disable count reached 0.
* Otherwise , % false .
2024-03-25 20:21:03 +03:00
*/
bool enable_work ( struct work_struct * work )
{
struct work_offq_data offqd ;
unsigned long irq_flags ;
work_grab_pending ( work , 0 , & irq_flags ) ;
work_offqd_unpack ( & offqd , * work_data_bits ( work ) ) ;
work_offqd_enable ( & offqd ) ;
set_work_pool_and_clear_pending ( work , offqd . pool_id ,
work_offqd_pack_flags ( & offqd ) ) ;
local_irq_restore ( irq_flags ) ;
return ! offqd . disable ;
}
EXPORT_SYMBOL_GPL ( enable_work ) ;
/**
* disable_delayed_work - Disable and cancel a delayed work item
* @ dwork : delayed work item to disable
*
* disable_work ( ) for delayed work items .
*/
bool disable_delayed_work ( struct delayed_work * dwork )
{
return __cancel_work ( & dwork - > work ,
WORK_CANCEL_DELAYED | WORK_CANCEL_DISABLE ) ;
}
EXPORT_SYMBOL_GPL ( disable_delayed_work ) ;
/**
* disable_delayed_work_sync - Disable , cancel and drain a delayed work item
* @ dwork : delayed work item to disable
*
* disable_work_sync ( ) for delayed work items .
*/
bool disable_delayed_work_sync ( struct delayed_work * dwork )
{
return __cancel_work_sync ( & dwork - > work ,
WORK_CANCEL_DELAYED | WORK_CANCEL_DISABLE ) ;
}
EXPORT_SYMBOL_GPL ( disable_delayed_work_sync ) ;
/**
* enable_delayed_work - Enable a delayed work item
* @ dwork : delayed work item to enable
*
* enable_work ( ) for delayed work items .
*/
bool enable_delayed_work ( struct delayed_work * dwork )
{
return enable_work ( & dwork - > work ) ;
}
EXPORT_SYMBOL_GPL ( enable_delayed_work ) ;
2006-06-25 16:47:49 +04:00
/**
2010-10-19 13:14:49 +04:00
* schedule_on_each_cpu - execute a function synchronously on each online CPU
2006-06-25 16:47:49 +04:00
* @ func : the function to call
*
2010-10-19 13:14:49 +04:00
* schedule_on_each_cpu ( ) executes @ func on each online CPU using the
* system workqueue and blocks until all CPUs have completed .
2006-06-25 16:47:49 +04:00
* schedule_on_each_cpu ( ) is very slow .
2010-10-19 13:14:49 +04:00
*
2013-08-01 01:59:24 +04:00
* Return :
2010-10-19 13:14:49 +04:00
* 0 on success , - errno on failure .
2006-06-25 16:47:49 +04:00
*/
2006-11-22 17:55:48 +03:00
int schedule_on_each_cpu ( work_func_t func )
2006-01-08 12:00:43 +03:00
{
int cpu ;
2010-08-08 16:24:09 +04:00
struct work_struct __percpu * works ;
2006-01-08 12:00:43 +03:00
2006-06-25 16:47:49 +04:00
works = alloc_percpu ( struct work_struct ) ;
if ( ! works )
2006-01-08 12:00:43 +03:00
return - ENOMEM ;
2006-06-25 16:47:49 +04:00
2021-08-03 17:16:20 +03:00
cpus_read_lock ( ) ;
2009-11-18 01:06:20 +03:00
2006-01-08 12:00:43 +03:00
for_each_online_cpu ( cpu ) {
2006-12-18 22:05:09 +03:00
struct work_struct * work = per_cpu_ptr ( works , cpu ) ;
INIT_WORK ( work , func ) ;
2010-06-29 12:07:14 +04:00
schedule_work_on ( cpu , work ) ;
2009-10-14 08:22:47 +04:00
}
2009-11-18 01:06:20 +03:00
for_each_online_cpu ( cpu )
flush_work ( per_cpu_ptr ( works , cpu ) ) ;
2021-08-03 17:16:20 +03:00
cpus_read_unlock ( ) ;
2006-06-25 16:47:49 +04:00
free_percpu ( works ) ;
2006-01-08 12:00:43 +03:00
return 0 ;
}
2006-02-23 21:43:43 +03:00
/**
* execute_in_process_context - reliably execute the routine with user context
* @ fn : the function to execute
* @ ew : guaranteed storage for the execute work structure ( must
* be available when the work executes )
*
* Executes the function immediately if process context is available ,
* otherwise schedules the function for delayed execution .
*
2013-08-01 01:59:24 +04:00
* Return : 0 - function was executed
2006-02-23 21:43:43 +03:00
* 1 - function was scheduled for execution
*/
2006-11-22 17:55:48 +03:00
int execute_in_process_context ( work_func_t fn , struct execute_work * ew )
2006-02-23 21:43:43 +03:00
{
if ( ! in_interrupt ( ) ) {
2006-11-22 17:55:48 +03:00
fn ( & ew - > work ) ;
2006-02-23 21:43:43 +03:00
return 0 ;
}
2006-11-22 17:55:48 +03:00
INIT_WORK ( & ew - > work , fn ) ;
2006-02-23 21:43:43 +03:00
schedule_work ( & ew - > work ) ;
return 1 ;
}
EXPORT_SYMBOL_GPL ( execute_in_process_context ) ;
2015-04-02 14:14:39 +03:00
/**
* free_workqueue_attrs - free a workqueue_attrs
* @ attrs : workqueue_attrs to free
2013-03-12 22:30:05 +04:00
*
2015-04-02 14:14:39 +03:00
* Undo alloc_workqueue_attrs ( ) .
2013-03-12 22:30:05 +04:00
*/
2019-09-06 04:40:22 +03:00
void free_workqueue_attrs ( struct workqueue_attrs * attrs )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
if ( attrs ) {
free_cpumask_var ( attrs - > cpumask ) ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
free_cpumask_var ( attrs - > __pod_cpumask ) ;
2015-04-02 14:14:39 +03:00
kfree ( attrs ) ;
}
2013-03-12 22:30:05 +04:00
}
2015-04-02 14:14:39 +03:00
/**
* alloc_workqueue_attrs - allocate a workqueue_attrs
*
* Allocate a new workqueue_attrs , initialize with default settings and
* return it .
*
* Return : The allocated new workqueue_attr on success . % NULL on failure .
*/
2019-09-06 04:40:22 +03:00
struct workqueue_attrs * alloc_workqueue_attrs ( void )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
struct workqueue_attrs * attrs ;
2013-03-12 22:30:05 +04:00
2019-06-26 17:52:38 +03:00
attrs = kzalloc ( sizeof ( * attrs ) , GFP_KERNEL ) ;
2015-04-02 14:14:39 +03:00
if ( ! attrs )
goto fail ;
2019-06-26 17:52:38 +03:00
if ( ! alloc_cpumask_var ( & attrs - > cpumask , GFP_KERNEL ) )
2015-04-02 14:14:39 +03:00
goto fail ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( ! alloc_cpumask_var ( & attrs - > __pod_cpumask , GFP_KERNEL ) )
goto fail ;
2015-04-02 14:14:39 +03:00
cpumask_copy ( attrs - > cpumask , cpu_possible_mask ) ;
2023-08-08 04:57:25 +03:00
attrs - > affn_scope = WQ_AFFN_DFL ;
2015-04-02 14:14:39 +03:00
return attrs ;
fail :
free_workqueue_attrs ( attrs ) ;
return NULL ;
2013-03-12 22:30:05 +04:00
}
2015-04-02 14:14:39 +03:00
static void copy_workqueue_attrs ( struct workqueue_attrs * to ,
const struct workqueue_attrs * from )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
to - > nice = from - > nice ;
cpumask_copy ( to - > cpumask , from - > cpumask ) ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
cpumask_copy ( to - > __pod_cpumask , from - > __pod_cpumask ) ;
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
to - > affn_strict = from - > affn_strict ;
2023-08-08 04:57:24 +03:00
2015-04-02 14:14:39 +03:00
/*
2023-08-08 04:57:24 +03:00
* Unlike hash and equality test , copying shouldn ' t ignore wq - only
* fields as copying is used for both pool and wq attrs . Instead ,
* get_unbound_pool ( ) explicitly clears the fields .
2015-04-02 14:14:39 +03:00
*/
2023-08-08 04:57:24 +03:00
to - > affn_scope = from - > affn_scope ;
2023-08-08 04:57:23 +03:00
to - > ordered = from - > ordered ;
2013-03-12 22:30:05 +04:00
}
2023-08-08 04:57:24 +03:00
/*
* Some attrs fields are workqueue - only . Clear them for worker_pool ' s . See the
* comments in ' struct workqueue_attrs ' definition .
*/
static void wqattrs_clear_for_pool ( struct workqueue_attrs * attrs )
{
2023-08-08 04:57:24 +03:00
attrs - > affn_scope = WQ_AFFN_NR_TYPES ;
2023-08-08 04:57:24 +03:00
attrs - > ordered = false ;
2024-03-08 12:42:52 +03:00
if ( attrs - > affn_strict )
cpumask_copy ( attrs - > cpumask , cpu_possible_mask ) ;
2023-08-08 04:57:24 +03:00
}
2015-04-02 14:14:39 +03:00
/* hash value of the content of @attr */
static u32 wqattrs_hash ( const struct workqueue_attrs * attrs )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
u32 hash = 0 ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
hash = jhash_1word ( attrs - > nice , hash ) ;
2024-03-08 12:42:52 +03:00
hash = jhash_1word ( attrs - > affn_strict , hash ) ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
hash = jhash ( cpumask_bits ( attrs - > __pod_cpumask ) ,
BITS_TO_LONGS ( nr_cpumask_bits ) * sizeof ( long ) , hash ) ;
2024-03-08 12:42:52 +03:00
if ( ! attrs - > affn_strict )
hash = jhash ( cpumask_bits ( attrs - > cpumask ) ,
BITS_TO_LONGS ( nr_cpumask_bits ) * sizeof ( long ) , hash ) ;
2015-04-02 14:14:39 +03:00
return hash ;
2013-03-12 22:30:05 +04:00
}
2015-04-02 14:14:39 +03:00
/* content equality test */
static bool wqattrs_equal ( const struct workqueue_attrs * a ,
const struct workqueue_attrs * b )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
if ( a - > nice ! = b - > nice )
return false ;
2024-03-08 12:42:52 +03:00
if ( a - > affn_strict ! = b - > affn_strict )
2015-04-02 14:14:39 +03:00
return false ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( ! cpumask_equal ( a - > __pod_cpumask , b - > __pod_cpumask ) )
return false ;
2024-03-08 12:42:52 +03:00
if ( ! a - > affn_strict & & ! cpumask_equal ( a - > cpumask , b - > cpumask ) )
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
return false ;
2015-04-02 14:14:39 +03:00
return true ;
2013-03-12 22:30:05 +04:00
}
2023-08-08 04:57:24 +03:00
/* Update @attrs with actually available CPUs */
static void wqattrs_actualize_cpumask ( struct workqueue_attrs * attrs ,
const cpumask_t * unbound_cpumask )
{
/*
* Calculate the effective CPU mask of @ attrs given @ unbound_cpumask . If
* @ attrs - > cpumask doesn ' t overlap with @ unbound_cpumask , we fallback to
* @ unbound_cpumask .
*/
cpumask_and ( attrs - > cpumask , attrs - > cpumask , unbound_cpumask ) ;
if ( unlikely ( cpumask_empty ( attrs - > cpumask ) ) )
cpumask_copy ( attrs - > cpumask , unbound_cpumask ) ;
}
2023-08-08 04:57:24 +03:00
/* find wq_pod_type to use for @attrs */
static const struct wq_pod_type *
wqattrs_pod_type ( const struct workqueue_attrs * attrs )
{
2023-08-08 04:57:25 +03:00
enum wq_affn_scope scope ;
struct wq_pod_type * pt ;
/* to synchronize access to wq_affn_dfl */
lockdep_assert_held ( & wq_pool_mutex ) ;
if ( attrs - > affn_scope = = WQ_AFFN_DFL )
scope = wq_affn_dfl ;
else
scope = attrs - > affn_scope ;
pt = & wq_pod_types [ scope ] ;
2023-08-08 04:57:24 +03:00
if ( ! WARN_ON_ONCE ( attrs - > affn_scope = = WQ_AFFN_NR_TYPES ) & &
likely ( pt - > nr_pods ) )
return pt ;
/*
* Before workqueue_init_topology ( ) , only SYSTEM is available which is
* initialized in workqueue_init_early ( ) .
*/
pt = & wq_pod_types [ WQ_AFFN_SYSTEM ] ;
BUG_ON ( ! pt - > nr_pods ) ;
return pt ;
}
2015-04-02 14:14:39 +03:00
/**
* init_worker_pool - initialize a newly zalloc ' d worker_pool
* @ pool : worker_pool to initialize
*
2015-05-23 08:08:14 +03:00
* Initialize a newly zalloc ' d @ pool . It also allocates @ pool - > attrs .
2015-04-02 14:14:39 +03:00
*
* Return : 0 on success , - errno on failure . Even on failure , all fields
* inside @ pool proper are initialized and put_unbound_pool ( ) can be called
* on @ pool safely to release it .
*/
static int init_worker_pool ( struct worker_pool * pool )
2013-03-12 22:30:05 +04:00
{
2020-05-27 22:46:33 +03:00
raw_spin_lock_init ( & pool - > lock ) ;
2015-04-02 14:14:39 +03:00
pool - > id = - 1 ;
pool - > cpu = - 1 ;
pool - > node = NUMA_NO_NODE ;
pool - > flags | = POOL_DISASSOCIATED ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
pool - > watchdog_ts = jiffies ;
2015-04-02 14:14:39 +03:00
INIT_LIST_HEAD ( & pool - > worklist ) ;
INIT_LIST_HEAD ( & pool - > idle_list ) ;
hash_init ( pool - > busy_hash ) ;
2013-03-12 22:30:05 +04:00
2017-10-17 01:58:25 +03:00
timer_setup ( & pool - > idle_timer , idle_worker_timeout , TIMER_DEFERRABLE ) ;
2023-01-12 19:14:29 +03:00
INIT_WORK ( & pool - > idle_cull_work , idle_cull_fn ) ;
2013-03-12 22:30:05 +04:00
2017-10-17 01:58:25 +03:00
timer_setup ( & pool - > mayday_timer , pool_mayday_timeout , 0 ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
INIT_LIST_HEAD ( & pool - > workers ) ;
2023-01-12 19:14:31 +03:00
INIT_LIST_HEAD ( & pool - > dying_workers ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
ida_init ( & pool - > worker_ida ) ;
INIT_HLIST_NODE ( & pool - > hash_node ) ;
pool - > refcnt = 1 ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* shouldn't fail above this point */
2019-06-26 17:52:38 +03:00
pool - > attrs = alloc_workqueue_attrs ( ) ;
2015-04-02 14:14:39 +03:00
if ( ! pool - > attrs )
return - ENOMEM ;
2023-08-08 04:57:24 +03:00
wqattrs_clear_for_pool ( pool - > attrs ) ;
2015-04-02 14:14:39 +03:00
return 0 ;
2013-03-12 22:30:05 +04:00
}
2019-02-15 02:00:54 +03:00
# ifdef CONFIG_LOCKDEP
static void wq_init_lockdep ( struct workqueue_struct * wq )
{
char * lock_name ;
lockdep_register_key ( & wq - > key ) ;
lock_name = kasprintf ( GFP_KERNEL , " %s%s " , " (wq_completion) " , wq - > name ) ;
if ( ! lock_name )
lock_name = wq - > name ;
2019-03-07 03:27:31 +03:00
wq - > lock_name = lock_name ;
2019-02-15 02:00:54 +03:00
lockdep_init_map ( & wq - > lockdep_map , lock_name , & wq - > key , 0 ) ;
}
static void wq_unregister_lockdep ( struct workqueue_struct * wq )
{
lockdep_unregister_key ( & wq - > key ) ;
}
static void wq_free_lockdep ( struct workqueue_struct * wq )
{
if ( wq - > lock_name ! = wq - > name )
kfree ( wq - > lock_name ) ;
}
# else
static void wq_init_lockdep ( struct workqueue_struct * wq )
{
}
static void wq_unregister_lockdep ( struct workqueue_struct * wq )
{
}
static void wq_free_lockdep ( struct workqueue_struct * wq )
{
}
# endif
2024-01-29 21:11:24 +03:00
static void free_node_nr_active ( struct wq_node_nr_active * * nna_ar )
{
int node ;
for_each_node ( node ) {
kfree ( nna_ar [ node ] ) ;
nna_ar [ node ] = NULL ;
}
kfree ( nna_ar [ nr_node_ids ] ) ;
nna_ar [ nr_node_ids ] = NULL ;
}
static void init_node_nr_active ( struct wq_node_nr_active * nna )
{
workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.
However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.
This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/workqueue.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_SYSTEM] = "system",
};
+static bool wq_topo_initialized = false;
+
/*
* Per-cpu work items which run for longer than the following threshold are
* automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
lockdep_assert_held(&wq->mutex);
+ if (!wq_topo_initialized)
+ return;
+
if (!cpumask_test_cpu(off_cpu, effective))
off_cpu = -1;
@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)
static void init_node_nr_active(struct wq_node_nr_active *nna)
{
+ nna->max = WQ_DFL_MIN_ACTIVE;
atomic_set(&nna->nr, 0);
raw_spin_lock_init(&nna->lock);
INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
+ wq_topo_initialized = true;
+
mutex_lock(&wq_pool_mutex);
/*
2024-01-31 08:06:43 +03:00
nna - > max = WQ_DFL_MIN_ACTIVE ;
2024-01-29 21:11:24 +03:00
atomic_set ( & nna - > nr , 0 ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
raw_spin_lock_init ( & nna - > lock ) ;
INIT_LIST_HEAD ( & nna - > pending_pwqs ) ;
2024-01-29 21:11:24 +03:00
}
/*
* Each node ' s nr_active counter will be accessed mostly from its own node and
* should be allocated in the node .
*/
static int alloc_node_nr_active ( struct wq_node_nr_active * * nna_ar )
{
struct wq_node_nr_active * nna ;
int node ;
for_each_node ( node ) {
nna = kzalloc_node ( sizeof ( * nna ) , GFP_KERNEL , node ) ;
if ( ! nna )
goto err_free ;
init_node_nr_active ( nna ) ;
nna_ar [ node ] = nna ;
}
/* [nr_node_ids] is used as the fallback */
nna = kzalloc_node ( sizeof ( * nna ) , GFP_KERNEL , NUMA_NO_NODE ) ;
if ( ! nna )
goto err_free ;
init_node_nr_active ( nna ) ;
nna_ar [ nr_node_ids ] = nna ;
return 0 ;
err_free :
free_node_nr_active ( nna_ar ) ;
return - ENOMEM ;
}
2015-04-02 14:14:39 +03:00
static void rcu_free_wq ( struct rcu_head * rcu )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
struct workqueue_struct * wq =
container_of ( rcu , struct workqueue_struct , rcu ) ;
2013-03-12 22:30:05 +04:00
2024-01-29 21:11:24 +03:00
if ( wq - > flags & WQ_UNBOUND )
free_node_nr_active ( wq - > node_nr_active ) ;
2019-02-15 02:00:54 +03:00
wq_free_lockdep ( wq ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
free_percpu ( wq - > cpu_pwq ) ;
free_workqueue_attrs ( wq - > unbound_attrs ) ;
2015-04-02 14:14:39 +03:00
kfree ( wq ) ;
2013-03-12 22:30:05 +04:00
}
2015-04-02 14:14:39 +03:00
static void rcu_free_pool ( struct rcu_head * rcu )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
struct worker_pool * pool = container_of ( rcu , struct worker_pool , rcu ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
ida_destroy ( & pool - > worker_ida ) ;
free_workqueue_attrs ( pool - > attrs ) ;
kfree ( pool ) ;
2013-03-12 22:30:05 +04:00
}
2015-04-02 14:14:39 +03:00
/**
* put_unbound_pool - put a worker_pool
* @ pool : worker_pool to put
*
2019-03-13 19:55:47 +03:00
* Put @ pool . If its refcnt reaches zero , it gets destroyed in RCU
2015-04-02 14:14:39 +03:00
* safe manner . get_unbound_pool ( ) calls this function on its failure path
* and this function should be able to release pools which went through ,
* successfully or not , init_worker_pool ( ) .
*
* Should be called with wq_pool_mutex held .
*/
static void put_unbound_pool ( struct worker_pool * pool )
2013-03-12 22:30:05 +04:00
{
2015-04-02 14:14:39 +03:00
DECLARE_COMPLETION_ONSTACK ( detach_completion ) ;
struct worker * worker ;
2023-08-04 06:22:15 +03:00
LIST_HEAD ( cull_list ) ;
2023-01-12 19:14:31 +03:00
2015-04-02 14:14:39 +03:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
if ( - - pool - > refcnt )
return ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* sanity checks */
if ( WARN_ON ( ! ( pool - > cpu < 0 ) ) | |
WARN_ON ( ! list_empty ( & pool - > worklist ) ) )
return ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* release id and unhash */
if ( pool - > id > = 0 )
idr_remove ( & worker_pool_idr , pool - > id ) ;
hash_del ( & pool - > hash_node ) ;
2013-04-01 22:23:38 +04:00
2015-04-02 14:14:39 +03:00
/*
2017-10-09 18:04:13 +03:00
* Become the manager and destroy all workers . This prevents
* @ pool ' s workers from blocking on attach_mutex . We ' re the last
* manager and @ pool gets freed with the flag set .
2023-01-12 19:14:30 +03:00
*
* Having a concurrent manager is quite unlikely to happen as we can
* only get here with
* pwq - > refcnt = = pool - > refcnt = = 0
* which implies no work queued to the pool , which implies no worker can
* become the manager . However a worker could have taken the role of
* manager before the refcnts dropped to 0 , since maybe_create_worker ( )
* drops pool - > lock
2015-04-02 14:14:39 +03:00
*/
2023-01-12 19:14:30 +03:00
while ( true ) {
rcuwait_wait_event ( & manager_wait ,
! ( pool - > flags & POOL_MANAGER_ACTIVE ) ,
TASK_UNINTERRUPTIBLE ) ;
2023-01-12 19:14:31 +03:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2023-01-12 19:14:30 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
if ( ! ( pool - > flags & POOL_MANAGER_ACTIVE ) ) {
pool - > flags | = POOL_MANAGER_ACTIVE ;
break ;
}
raw_spin_unlock_irq ( & pool - > lock ) ;
2023-01-12 19:14:31 +03:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2023-01-12 19:14:30 +03:00
}
2017-10-09 18:04:13 +03:00
2015-04-02 14:14:39 +03:00
while ( ( worker = first_idle_worker ( pool ) ) )
2023-01-12 19:14:31 +03:00
set_worker_dying ( worker , & cull_list ) ;
2015-04-02 14:14:39 +03:00
WARN_ON ( pool - > nr_workers | | pool - > nr_idle ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2013-04-01 22:23:38 +04:00
2023-01-12 19:14:31 +03:00
wake_dying_workers ( & cull_list ) ;
if ( ! list_empty ( & pool - > workers ) | | ! list_empty ( & pool - > dying_workers ) )
2015-04-02 14:14:39 +03:00
pool - > detach_completion = & detach_completion ;
2018-05-18 18:47:13 +03:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
if ( pool - > detach_completion )
wait_for_completion ( pool - > detach_completion ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* shut down the timers */
del_timer_sync ( & pool - > idle_timer ) ;
2023-01-12 19:14:29 +03:00
cancel_work_sync ( & pool - > idle_cull_work ) ;
2015-04-02 14:14:39 +03:00
del_timer_sync ( & pool - > mayday_timer ) ;
2013-03-12 22:30:05 +04:00
2019-03-13 19:55:47 +03:00
/* RCU protected to allow dereferences from get_work_pool() */
2018-11-07 06:18:45 +03:00
call_rcu ( & pool - > rcu , rcu_free_pool ) ;
2013-03-12 22:30:05 +04:00
}
/**
2015-04-02 14:14:39 +03:00
* get_unbound_pool - get a worker_pool with the specified attributes
* @ attrs : the attributes of the worker_pool to get
2013-03-12 22:30:05 +04:00
*
2015-04-02 14:14:39 +03:00
* Obtain a worker_pool which has the same attributes as @ attrs , bump the
* reference count and return it . If there already is a matching
* worker_pool , it will be used ; otherwise , this function attempts to
* create a new one .
2013-03-12 22:30:05 +04:00
*
2015-04-02 14:14:39 +03:00
* Should be called with wq_pool_mutex held .
2013-03-12 22:30:05 +04:00
*
2015-04-02 14:14:39 +03:00
* Return : On success , a worker_pool with the same attributes as @ attrs .
* On failure , % NULL .
2013-03-12 22:30:05 +04:00
*/
2015-04-02 14:14:39 +03:00
static struct worker_pool * get_unbound_pool ( const struct workqueue_attrs * attrs )
2013-03-12 22:30:05 +04:00
{
2023-08-08 04:57:24 +03:00
struct wq_pod_type * pt = & wq_pod_types [ WQ_AFFN_NUMA ] ;
2015-04-02 14:14:39 +03:00
u32 hash = wqattrs_hash ( attrs ) ;
struct worker_pool * pool ;
2023-08-08 04:57:24 +03:00
int pod , node = NUMA_NO_NODE ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* do we already have a matching pool? */
hash_for_each_possible ( unbound_pool_hash , pool , hash_node , hash ) {
if ( wqattrs_equal ( pool - > attrs , attrs ) ) {
pool - > refcnt + + ;
return pool ;
}
}
2013-03-12 22:30:05 +04:00
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
/* If __pod_cpumask is contained inside a NUMA pod, that's our node */
2023-08-08 04:57:24 +03:00
for ( pod = 0 ; pod < pt - > nr_pods ; pod + + ) {
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( cpumask_subset ( attrs - > __pod_cpumask , pt - > pod_cpus [ pod ] ) ) {
2023-08-08 04:57:24 +03:00
node = pt - > pod_node [ pod ] ;
break ;
2015-10-09 06:53:12 +03:00
}
}
2015-04-02 14:14:39 +03:00
/* nope, create a new one */
2023-08-08 04:57:24 +03:00
pool = kzalloc_node ( sizeof ( * pool ) , GFP_KERNEL , node ) ;
2015-04-02 14:14:39 +03:00
if ( ! pool | | init_worker_pool ( pool ) < 0 )
goto fail ;
2023-08-08 04:57:24 +03:00
pool - > node = node ;
2023-08-08 04:57:24 +03:00
copy_workqueue_attrs ( pool - > attrs , attrs ) ;
wqattrs_clear_for_pool ( pool - > attrs ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
if ( worker_pool_assign_id ( pool ) < 0 )
goto fail ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* create and start the initial worker */
2016-09-16 22:49:32 +03:00
if ( wq_online & & ! create_worker ( pool ) )
2015-04-02 14:14:39 +03:00
goto fail ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
/* install */
hash_add ( unbound_pool_hash , & pool - > hash_node , hash ) ;
2013-03-12 22:30:05 +04:00
2015-04-02 14:14:39 +03:00
return pool ;
fail :
if ( pool )
put_unbound_pool ( pool ) ;
return NULL ;
2013-03-12 22:30:05 +04:00
}
2015-04-02 14:14:39 +03:00
static void rcu_free_pwq ( struct rcu_head * rcu )
2013-03-12 22:30:00 +04:00
{
2015-04-02 14:14:39 +03:00
kmem_cache_free ( pwq_cache ,
container_of ( rcu , struct pool_workqueue , rcu ) ) ;
2013-03-12 22:30:00 +04:00
}
2015-04-02 14:14:39 +03:00
/*
2023-08-08 04:57:23 +03:00
* Scheduled on pwq_release_worker by put_pwq ( ) when an unbound pwq hits zero
* refcnt and needs to be destroyed .
2013-03-12 22:30:00 +04:00
*/
2023-08-08 04:57:23 +03:00
static void pwq_release_workfn ( struct kthread_work * work )
2013-03-12 22:30:00 +04:00
{
2015-04-02 14:14:39 +03:00
struct pool_workqueue * pwq = container_of ( work , struct pool_workqueue ,
2023-08-08 04:57:23 +03:00
release_work ) ;
2015-04-02 14:14:39 +03:00
struct workqueue_struct * wq = pwq - > wq ;
struct worker_pool * pool = pwq - > pool ;
2021-07-14 12:19:33 +03:00
bool is_last = false ;
2013-03-12 22:30:00 +04:00
2021-07-14 12:19:33 +03:00
/*
2023-08-08 04:57:23 +03:00
* When @ pwq is not linked , it doesn ' t hold any reference to the
2021-07-14 12:19:33 +03:00
* @ wq , and @ wq is invalid to access .
*/
if ( ! list_empty ( & pwq - > pwqs_node ) ) {
mutex_lock ( & wq - > mutex ) ;
list_del_rcu ( & pwq - > pwqs_node ) ;
is_last = list_empty ( & wq - > pwqs ) ;
2024-02-08 22:12:20 +03:00
/*
* For ordered workqueue with a plugged dfl_pwq , restart it now .
*/
if ( ! is_last & & ( wq - > flags & __WQ_ORDERED ) )
unplug_oldest_pwq ( wq ) ;
2021-07-14 12:19:33 +03:00
mutex_unlock ( & wq - > mutex ) ;
}
2015-04-02 14:14:39 +03:00
2023-08-08 04:57:23 +03:00
if ( wq - > flags & WQ_UNBOUND ) {
mutex_lock ( & wq_pool_mutex ) ;
put_unbound_pool ( pool ) ;
mutex_unlock ( & wq_pool_mutex ) ;
}
2015-04-02 14:14:39 +03:00
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( ! list_empty ( & pwq - > pending_node ) ) {
struct wq_node_nr_active * nna =
wq_node_nr_active ( pwq - > wq , pwq - > pool - > node ) ;
raw_spin_lock_irq ( & nna - > lock ) ;
list_del_init ( & pwq - > pending_node ) ;
raw_spin_unlock_irq ( & nna - > lock ) ;
}
2018-11-07 06:18:45 +03:00
call_rcu ( & pwq - > rcu , rcu_free_pwq ) ;
2013-03-12 22:30:00 +04:00
2013-08-01 05:56:36 +04:00
/*
2015-04-02 14:14:39 +03:00
* If we ' re the last pwq going away , @ wq is already dead and no one
* is gonna access it anymore . Schedule RCU free .
2013-08-01 05:56:36 +04:00
*/
2019-02-15 02:00:54 +03:00
if ( is_last ) {
wq_unregister_lockdep ( wq ) ;
2018-11-07 06:18:45 +03:00
call_rcu ( & wq - > rcu , rcu_free_wq ) ;
2019-02-15 02:00:54 +03:00
}
2013-03-12 22:30:03 +04:00
}
2021-07-31 03:01:29 +03:00
/* initialize newly allocated @pwq which is associated with @wq and @pool */
2015-04-02 14:14:39 +03:00
static void init_pwq ( struct pool_workqueue * pwq , struct workqueue_struct * wq ,
struct worker_pool * pool )
2013-03-12 22:30:03 +04:00
{
2024-02-21 08:36:14 +03:00
BUG_ON ( ( unsigned long ) pwq & ~ WORK_STRUCT_PWQ_MASK ) ;
2013-03-12 22:30:03 +04:00
2015-04-02 14:14:39 +03:00
memset ( pwq , 0 , sizeof ( * pwq ) ) ;
pwq - > pool = pool ;
pwq - > wq = wq ;
pwq - > flush_color = - 1 ;
pwq - > refcnt = 1 ;
2021-08-17 04:32:34 +03:00
INIT_LIST_HEAD ( & pwq - > inactive_works ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
INIT_LIST_HEAD ( & pwq - > pending_node ) ;
2015-04-02 14:14:39 +03:00
INIT_LIST_HEAD ( & pwq - > pwqs_node ) ;
INIT_LIST_HEAD ( & pwq - > mayday_node ) ;
2023-08-08 04:57:23 +03:00
kthread_init_work ( & pwq - > release_work , pwq_release_workfn ) ;
2013-03-12 22:30:03 +04:00
}
2015-04-02 14:14:39 +03:00
/* sync @pwq with the current state of its associated wq and link it */
static void link_pwq ( struct pool_workqueue * pwq )
2013-03-12 22:30:03 +04:00
{
2015-04-02 14:14:39 +03:00
struct workqueue_struct * wq = pwq - > wq ;
2013-03-12 22:30:03 +04:00
2015-04-02 14:14:39 +03:00
lockdep_assert_held ( & wq - > mutex ) ;
2013-04-01 22:23:32 +04:00
2015-04-02 14:14:39 +03:00
/* may be called multiple times, ignore if already linked */
if ( ! list_empty ( & pwq - > pwqs_node ) )
2013-03-12 22:30:03 +04:00
return ;
2015-04-02 14:14:39 +03:00
/* set the matching work_color */
pwq - > work_color = wq - > work_color ;
2013-03-12 22:30:03 +04:00
2015-04-02 14:14:39 +03:00
/* link in @pwq */
2024-02-08 19:10:11 +03:00
list_add_tail_rcu ( & pwq - > pwqs_node , & wq - > pwqs ) ;
2015-04-02 14:14:39 +03:00
}
2013-03-12 22:30:03 +04:00
2015-04-02 14:14:39 +03:00
/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
static struct pool_workqueue * alloc_unbound_pwq ( struct workqueue_struct * wq ,
const struct workqueue_attrs * attrs )
{
struct worker_pool * pool ;
struct pool_workqueue * pwq ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
2015-04-02 14:14:39 +03:00
lockdep_assert_held ( & wq_pool_mutex ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
2015-04-02 14:14:39 +03:00
pool = get_unbound_pool ( attrs ) ;
if ( ! pool )
return NULL ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 13:46:29 +04:00
2015-04-02 14:14:39 +03:00
pwq = kmem_cache_alloc_node ( pwq_cache , GFP_KERNEL , pool - > node ) ;
if ( ! pwq ) {
put_unbound_pool ( pool ) ;
return NULL ;
}
2013-03-12 22:30:03 +04:00
2015-04-02 14:14:39 +03:00
init_pwq ( pwq , wq , pool ) ;
return pwq ;
}
2013-03-12 22:30:03 +04:00
/**
2023-08-08 04:57:23 +03:00
* wq_calc_pod_cpumask - calculate a wq_attrs ' cpumask for a pod
2015-04-30 12:16:12 +03:00
* @ attrs : the wq_attrs of the default pwq of the target workqueue
2023-08-08 04:57:24 +03:00
* @ cpu : the target CPU
2015-04-02 14:14:39 +03:00
* @ cpu_going_down : if > = 0 , the CPU to consider as offline
2013-03-12 22:30:03 +04:00
*
2023-08-08 04:57:23 +03:00
* Calculate the cpumask a workqueue with @ attrs should use on @ pod . If
* @ cpu_going_down is > = 0 , that cpu is considered offline during calculation .
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
* The result is stored in @ attrs - > __pod_cpumask .
2013-04-01 22:23:32 +04:00
*
2023-08-08 04:57:23 +03:00
* If pod affinity is not enabled , @ attrs - > cpumask is always used . If enabled
* and @ pod has online CPUs requested by @ attrs , the returned cpumask is the
* intersection of the possible CPUs of @ pod and @ attrs - > cpumask .
2013-08-01 01:59:24 +04:00
*
2023-08-08 04:57:23 +03:00
* The caller is responsible for ensuring that the cpumask of @ pod stays stable .
2013-03-12 22:30:03 +04:00
*/
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
static void wq_calc_pod_cpumask ( struct workqueue_attrs * attrs , int cpu ,
int cpu_going_down )
2013-03-12 22:30:03 +04:00
{
2023-08-08 04:57:24 +03:00
const struct wq_pod_type * pt = wqattrs_pod_type ( attrs ) ;
int pod = pt - > cpu_pod [ cpu ] ;
2013-03-12 22:30:03 +04:00
2023-08-08 04:57:23 +03:00
/* does @pod have any online CPUs @attrs wants? */
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
cpumask_and ( attrs - > __pod_cpumask , pt - > pod_cpus [ pod ] , attrs - > cpumask ) ;
cpumask_and ( attrs - > __pod_cpumask , attrs - > __pod_cpumask , cpu_online_mask ) ;
2015-04-02 14:14:39 +03:00
if ( cpu_going_down > = 0 )
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
cpumask_clear_cpu ( cpu_going_down , attrs - > __pod_cpumask ) ;
2013-03-12 22:30:03 +04:00
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( cpumask_empty ( attrs - > __pod_cpumask ) ) {
cpumask_copy ( attrs - > __pod_cpumask , attrs - > cpumask ) ;
2023-08-08 04:57:24 +03:00
return ;
}
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
2023-08-08 04:57:23 +03:00
/* yeap, return possible CPUs in @pod that @attrs wants */
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
cpumask_and ( attrs - > __pod_cpumask , attrs - > cpumask , pt - > pod_cpus [ pod ] ) ;
2017-07-28 00:27:14 +03:00
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( cpumask_empty ( attrs - > __pod_cpumask ) )
2017-07-28 00:27:14 +03:00
pr_warn_once ( " WARNING: workqueue cpumask: online intersect > "
" possible intersect \n " ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
}
2024-01-29 21:11:24 +03:00
/* install @pwq into @wq and return the old pwq, @cpu < 0 for dfl_pwq */
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
static struct pool_workqueue * install_unbound_pwq ( struct workqueue_struct * wq ,
int cpu , struct pool_workqueue * pwq )
2013-04-01 22:23:35 +04:00
{
2024-01-29 21:11:24 +03:00
struct pool_workqueue __rcu * * slot = unbound_pwq_slot ( wq , cpu ) ;
2013-04-01 22:23:35 +04:00
struct pool_workqueue * old_pwq ;
2015-05-12 15:32:29 +03:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-04-01 22:23:35 +04:00
lockdep_assert_held ( & wq - > mutex ) ;
/* link_pwq() can handle duplicate calls */
link_pwq ( pwq ) ;
2024-01-29 21:11:24 +03:00
old_pwq = rcu_access_pointer ( * slot ) ;
rcu_assign_pointer ( * slot , pwq ) ;
2013-04-01 22:23:35 +04:00
return old_pwq ;
}
2015-04-27 12:58:38 +03:00
/* context to store the prepared attrs & pwqs before applying */
struct apply_wqattrs_ctx {
struct workqueue_struct * wq ; /* target workqueue */
struct workqueue_attrs * attrs ; /* attrs to apply */
2015-04-30 12:16:12 +03:00
struct list_head list ; /* queued for batching commit */
2015-04-27 12:58:38 +03:00
struct pool_workqueue * dfl_pwq ;
struct pool_workqueue * pwq_tbl [ ] ;
} ;
/* free the resources after success or abort */
static void apply_wqattrs_cleanup ( struct apply_wqattrs_ctx * ctx )
{
if ( ctx ) {
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
int cpu ;
2015-04-27 12:58:38 +03:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
for_each_possible_cpu ( cpu )
put_pwq_unlocked ( ctx - > pwq_tbl [ cpu ] ) ;
2015-04-27 12:58:38 +03:00
put_pwq_unlocked ( ctx - > dfl_pwq ) ;
free_workqueue_attrs ( ctx - > attrs ) ;
kfree ( ctx ) ;
}
}
/* allocate the attrs and pwqs for later installation */
static struct apply_wqattrs_ctx *
apply_wqattrs_prepare ( struct workqueue_struct * wq ,
2023-01-12 19:14:27 +03:00
const struct workqueue_attrs * attrs ,
const cpumask_var_t unbound_cpumask )
2013-03-12 22:30:04 +04:00
{
2015-04-27 12:58:38 +03:00
struct apply_wqattrs_ctx * ctx ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
struct workqueue_attrs * new_attrs ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
int cpu ;
2013-03-12 22:30:04 +04:00
2015-04-27 12:58:38 +03:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-12 22:30:04 +04:00
2023-08-08 04:57:24 +03:00
if ( WARN_ON ( attrs - > affn_scope < 0 | |
attrs - > affn_scope > = WQ_AFFN_NR_TYPES ) )
return ERR_PTR ( - EINVAL ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
ctx = kzalloc ( struct_size ( ctx , pwq_tbl , nr_cpu_ids ) , GFP_KERNEL ) ;
2013-03-12 22:30:04 +04:00
2019-06-26 17:52:38 +03:00
new_attrs = alloc_workqueue_attrs ( ) ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
if ( ! ctx | | ! new_attrs )
2015-04-27 12:58:38 +03:00
goto out_free ;
2013-04-01 22:23:31 +04:00
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
/*
* If something goes wrong during CPU up / down , we ' ll fall back to
* the default pwq covering whole @ attrs - > cpumask . Always create
* it even if we don ' t use it immediately .
*/
2023-08-08 04:57:24 +03:00
copy_workqueue_attrs ( new_attrs , attrs ) ;
wqattrs_actualize_cpumask ( new_attrs , unbound_cpumask ) ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
cpumask_copy ( new_attrs - > __pod_cpumask , new_attrs - > cpumask ) ;
2015-04-27 12:58:38 +03:00
ctx - > dfl_pwq = alloc_unbound_pwq ( wq , new_attrs ) ;
if ( ! ctx - > dfl_pwq )
goto out_free ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
for_each_possible_cpu ( cpu ) {
2023-08-08 04:57:23 +03:00
if ( new_attrs - > ordered ) {
2015-04-27 12:58:38 +03:00
ctx - > dfl_pwq - > refcnt + + ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
ctx - > pwq_tbl [ cpu ] = ctx - > dfl_pwq ;
} else {
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
wq_calc_pod_cpumask ( new_attrs , cpu , - 1 ) ;
ctx - > pwq_tbl [ cpu ] = alloc_unbound_pwq ( wq , new_attrs ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
if ( ! ctx - > pwq_tbl [ cpu ] )
goto out_free ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
}
}
2015-04-30 12:16:12 +03:00
/* save the user configured attrs and sanitize it. */
copy_workqueue_attrs ( new_attrs , attrs ) ;
cpumask_and ( new_attrs - > cpumask , new_attrs - > cpumask , cpu_possible_mask ) ;
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
cpumask_copy ( new_attrs - > __pod_cpumask , new_attrs - > cpumask ) ;
2015-04-27 12:58:38 +03:00
ctx - > attrs = new_attrs ;
2015-04-30 12:16:12 +03:00
2024-02-08 22:12:20 +03:00
/*
* For initialized ordered workqueues , there should only be one pwq
* ( dfl_pwq ) . Set the plugged flag of ctx - > dfl_pwq to suspend execution
* of newly queued work items until execution of older work items in
* the old pwq ' s have completed .
*/
if ( ( wq - > flags & __WQ_ORDERED ) & & ! list_empty ( & wq - > pwqs ) )
ctx - > dfl_pwq - > plugged = true ;
2015-04-27 12:58:38 +03:00
ctx - > wq = wq ;
return ctx ;
out_free :
free_workqueue_attrs ( new_attrs ) ;
apply_wqattrs_cleanup ( ctx ) ;
2023-08-08 04:57:24 +03:00
return ERR_PTR ( - ENOMEM ) ;
2015-04-27 12:58:38 +03:00
}
/* set attrs and install prepared pwqs, @ctx points to old pwqs on return */
static void apply_wqattrs_commit ( struct apply_wqattrs_ctx * ctx )
{
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
int cpu ;
2013-03-12 22:30:04 +04:00
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
/* all pwqs have been created successfully, let's install'em */
2015-04-27 12:58:38 +03:00
mutex_lock ( & ctx - > wq - > mutex ) ;
2013-04-01 22:23:32 +04:00
2015-04-27 12:58:38 +03:00
copy_workqueue_attrs ( ctx - > wq - > unbound_attrs , ctx - > attrs ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
2024-01-29 21:11:24 +03:00
/* save the previous pwqs and install the new ones */
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
for_each_possible_cpu ( cpu )
ctx - > pwq_tbl [ cpu ] = install_unbound_pwq ( ctx - > wq , cpu ,
ctx - > pwq_tbl [ cpu ] ) ;
2024-01-29 21:11:24 +03:00
ctx - > dfl_pwq = install_unbound_pwq ( ctx - > wq , - 1 , ctx - > dfl_pwq ) ;
2013-04-01 22:23:35 +04:00
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
/* update node_nr_active->max */
wq_update_node_max_active ( ctx - > wq , - 1 ) ;
2024-02-08 19:10:13 +03:00
/* rescuer needs to respect wq cpumask changes */
if ( ctx - > wq - > rescuer )
set_cpus_allowed_ptr ( ctx - > wq - > rescuer - > task ,
unbound_effective_cpumask ( ctx - > wq ) ) ;
2015-04-27 12:58:38 +03:00
mutex_unlock ( & ctx - > wq - > mutex ) ;
}
2013-03-12 22:30:04 +04:00
2015-05-19 13:03:47 +03:00
static int apply_workqueue_attrs_locked ( struct workqueue_struct * wq ,
const struct workqueue_attrs * attrs )
2015-04-27 12:58:38 +03:00
{
struct apply_wqattrs_ctx * ctx ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
2015-04-27 12:58:38 +03:00
/* only unbound workqueues can change attributes */
if ( WARN_ON ( ! ( wq - > flags & WQ_UNBOUND ) ) )
return - EINVAL ;
2013-04-01 22:23:31 +04:00
2023-01-12 19:14:27 +03:00
ctx = apply_wqattrs_prepare ( wq , attrs , wq_unbound_cpumask ) ;
2023-08-08 04:57:24 +03:00
if ( IS_ERR ( ctx ) )
return PTR_ERR ( ctx ) ;
2015-04-27 12:58:38 +03:00
/* the ctx has been prepared successfully, let's commit it */
2016-01-07 15:38:59 +03:00
apply_wqattrs_commit ( ctx ) ;
2015-04-27 12:58:38 +03:00
apply_wqattrs_cleanup ( ctx ) ;
2016-01-07 15:38:59 +03:00
return 0 ;
2013-03-12 22:30:04 +04:00
}
2015-05-19 13:03:47 +03:00
/**
* apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
* @ wq : the target workqueue
* @ attrs : the workqueue_attrs to apply , allocated with alloc_workqueue_attrs ( )
*
2023-08-08 04:57:23 +03:00
* Apply @ attrs to an unbound workqueue @ wq . Unless disabled , this function maps
* a separate pwq to each CPU pod with possibles CPUs in @ attrs - > cpumask so that
* work items are affine to the pod it was issued on . Older pwqs are released as
* in - flight work items finish . Note that a work item which repeatedly requeues
* itself back - to - back will stay on its current pwq .
2015-05-19 13:03:47 +03:00
*
* Performs GFP_KERNEL allocations .
*
2021-08-03 17:16:20 +03:00
* Assumes caller has CPU hotplug read exclusion , i . e . cpus_read_lock ( ) .
2019-09-06 04:40:23 +03:00
*
2015-05-19 13:03:47 +03:00
* Return : 0 on success and - errno on failure .
*/
2019-09-06 04:40:22 +03:00
int apply_workqueue_attrs ( struct workqueue_struct * wq ,
2015-05-19 13:03:47 +03:00
const struct workqueue_attrs * attrs )
{
int ret ;
2019-09-06 04:40:23 +03:00
lockdep_assert_cpus_held ( ) ;
mutex_lock ( & wq_pool_mutex ) ;
2015-05-19 13:03:47 +03:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2019-09-06 04:40:23 +03:00
mutex_unlock ( & wq_pool_mutex ) ;
2015-05-19 13:03:47 +03:00
return ret ;
}
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
/**
2023-08-08 04:57:23 +03:00
* wq_update_pod - update pod affinity of a wq for CPU hot [ un ] plug
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
* @ wq : the target workqueue
2023-08-08 04:57:23 +03:00
* @ cpu : the CPU to update pool association for
* @ hotplug_cpu : the CPU coming up or going down
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
* @ online : whether @ cpu is coming up or going down
*
* This function is to be called from % CPU_DOWN_PREPARE , % CPU_ONLINE and
2023-08-08 04:57:23 +03:00
* % CPU_DOWN_FAILED . @ cpu is being hot [ un ] plugged , update pod affinity of
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
* @ wq accordingly .
*
2023-08-08 04:57:23 +03:00
*
* If pod affinity can ' t be adjusted due to memory allocation failure , it falls
* back to @ wq - > dfl_pwq which may not be optimal but is always correct .
*
* Note that when the last allowed CPU of a pod goes offline for a workqueue
* with a cpumask spanning multiple pods , the workers which were already
* executing the work items for the workqueue will lose their CPU affinity and
* may execute on any CPU . This is similar to how per - cpu workqueues behave on
* CPU_DOWN . If a workqueue user wants strict affinity , it ' s the user ' s
* responsibility to flush the work item from CPU_DOWN_PREPARE .
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
*/
2023-08-08 04:57:23 +03:00
static void wq_update_pod ( struct workqueue_struct * wq , int cpu ,
int hotplug_cpu , bool online )
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
{
2023-08-08 04:57:23 +03:00
int off_cpu = online ? - 1 : hotplug_cpu ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
struct pool_workqueue * old_pwq = NULL , * pwq ;
struct workqueue_attrs * target_attrs ;
lockdep_assert_held ( & wq_pool_mutex ) ;
2023-08-08 04:57:24 +03:00
if ( ! ( wq - > flags & WQ_UNBOUND ) | | wq - > unbound_attrs - > ordered )
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
return ;
/*
* We don ' t wanna alloc / free wq_attrs for each wq for each CPU .
* Let ' s use a preallocated one . The following buf is protected by
* CPU hotplug exclusion .
*/
2023-08-08 04:57:23 +03:00
target_attrs = wq_update_pod_attrs_buf ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
copy_workqueue_attrs ( target_attrs , wq - > unbound_attrs ) ;
2023-08-08 04:57:24 +03:00
wqattrs_actualize_cpumask ( target_attrs , wq_unbound_cpumask ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
/* nothing to do if the target cpumask matches the current pwq */
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
wq_calc_pod_cpumask ( target_attrs , cpu , off_cpu ) ;
2024-01-29 21:11:24 +03:00
if ( wqattrs_equal ( target_attrs , unbound_pwq ( wq , cpu ) - > pool - > attrs ) )
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
return ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
/* create a new pwq */
pwq = alloc_unbound_pwq ( wq , target_attrs ) ;
if ( ! pwq ) {
2023-08-08 04:57:23 +03:00
pr_warn ( " workqueue: allocation failed while updating CPU pod affinity of \" %s \" \n " ,
2014-05-12 21:59:35 +04:00
wq - > name ) ;
2014-04-16 09:32:29 +04:00
goto use_dfl_pwq ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
}
2015-05-12 15:32:30 +03:00
/* Install the new pwq. */
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
mutex_lock ( & wq - > mutex ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
old_pwq = install_unbound_pwq ( wq , cpu , pwq ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
goto out_unlock ;
use_dfl_pwq :
2015-05-12 15:32:30 +03:00
mutex_lock ( & wq - > mutex ) ;
2024-01-29 21:11:24 +03:00
pwq = unbound_pwq ( wq , - 1 ) ;
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
get_pwq ( pwq ) ;
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
old_pwq = install_unbound_pwq ( wq , cpu , pwq ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
out_unlock :
mutex_unlock ( & wq - > mutex ) ;
put_pwq_unlocked ( old_pwq ) ;
}
2013-03-12 22:29:57 +04:00
static int alloc_and_link_pwqs ( struct workqueue_struct * wq )
2010-06-29 12:07:11 +04:00
{
2013-03-12 22:29:58 +04:00
bool highpri = wq - > flags & WQ_HIGHPRI ;
2013-09-05 20:30:04 +04:00
int cpu , ret ;
2013-03-12 22:29:57 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
wq - > cpu_pwq = alloc_percpu ( struct pool_workqueue * ) ;
if ( ! wq - > cpu_pwq )
goto enomem ;
2013-03-12 22:29:57 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
if ( ! ( wq - > flags & WQ_UNBOUND ) ) {
2013-03-12 22:29:57 +04:00
for_each_possible_cpu ( cpu ) {
2024-02-05 00:28:06 +03:00
struct pool_workqueue * * pwq_p ;
struct worker_pool __percpu * pools ;
struct worker_pool * pool ;
if ( wq - > flags & WQ_BH )
pools = bh_worker_pools ;
else
pools = cpu_worker_pools ;
pool = & ( per_cpu_ptr ( pools , cpu ) [ highpri ] ) ;
pwq_p = per_cpu_ptr ( wq - > cpu_pwq , cpu ) ;
2023-08-08 04:57:23 +03:00
* pwq_p = kmem_cache_alloc_node ( pwq_cache , GFP_KERNEL ,
pool - > node ) ;
if ( ! * pwq_p )
goto enomem ;
2010-07-02 12:03:51 +04:00
2023-08-08 04:57:23 +03:00
init_pwq ( * pwq_p , wq , pool ) ;
2013-04-01 22:23:35 +04:00
mutex_lock ( & wq - > mutex ) ;
2023-08-08 04:57:23 +03:00
link_pwq ( * pwq_p ) ;
2013-04-01 22:23:35 +04:00
mutex_unlock ( & wq - > mutex ) ;
2013-03-12 22:29:57 +04:00
}
2013-03-12 22:30:04 +04:00
return 0 ;
2019-09-06 04:40:23 +03:00
}
2021-08-03 17:16:20 +03:00
cpus_read_lock ( ) ;
2019-09-06 04:40:23 +03:00
if ( wq - > flags & __WQ_ORDERED ) {
2024-01-29 21:11:24 +03:00
struct pool_workqueue * dfl_pwq ;
2013-09-05 20:30:04 +04:00
ret = apply_workqueue_attrs ( wq , ordered_wq_attrs [ highpri ] ) ;
/* there should only be single pwq for ordering guarantee */
2024-01-29 21:11:24 +03:00
dfl_pwq = rcu_access_pointer ( wq - > dfl_pwq ) ;
WARN ( ! ret & & ( wq - > pwqs . next ! = & dfl_pwq - > pwqs_node | |
wq - > pwqs . prev ! = & dfl_pwq - > pwqs_node ) ,
2013-09-05 20:30:04 +04:00
" ordering guarantee broken for workqueue %s \n " , wq - > name ) ;
2013-03-12 22:29:57 +04:00
} else {
2019-09-06 04:40:23 +03:00
ret = apply_workqueue_attrs ( wq , unbound_std_wq_attrs [ highpri ] ) ;
2013-03-12 22:29:57 +04:00
}
2021-08-03 17:16:20 +03:00
cpus_read_unlock ( ) ;
2019-09-06 04:40:23 +03:00
2023-09-20 09:07:04 +03:00
/* for unbound pwq, flush the pwq_release_worker ensures that the
* pwq_release_workfn ( ) completes before calling kfree ( wq ) .
*/
if ( ret )
kthread_flush_worker ( pwq_release_worker ) ;
2019-09-06 04:40:23 +03:00
return ret ;
2023-08-08 04:57:23 +03:00
enomem :
if ( wq - > cpu_pwq ) {
2023-10-11 11:27:59 +03:00
for_each_possible_cpu ( cpu ) {
struct pool_workqueue * pwq = * per_cpu_ptr ( wq - > cpu_pwq , cpu ) ;
if ( pwq )
kmem_cache_free ( pwq_cache , pwq ) ;
}
2023-08-08 04:57:23 +03:00
free_percpu ( wq - > cpu_pwq ) ;
wq - > cpu_pwq = NULL ;
}
return - ENOMEM ;
2010-06-29 12:07:11 +04:00
}
2010-07-02 12:03:51 +04:00
static int wq_clamp_max_active ( int max_active , unsigned int flags ,
const char * name )
2010-06-29 12:07:14 +04:00
{
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
if ( max_active < 1 | | max_active > WQ_MAX_ACTIVE )
2012-08-19 01:52:42 +04:00
pr_warn ( " workqueue: max_active %d requested for %s is out of range, clamping between %d and %d \n " ,
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
max_active , name , 1 , WQ_MAX_ACTIVE ) ;
2010-06-29 12:07:14 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
return clamp_val ( max_active , 1 , WQ_MAX_ACTIVE ) ;
2010-06-29 12:07:14 +04:00
}
2018-01-08 16:38:32 +03:00
/*
* Workqueues which may be used during memory reclaim should have a rescuer
* to guarantee forward progress .
*/
static int init_rescuer ( struct workqueue_struct * wq )
{
struct worker * rescuer ;
2020-05-08 18:07:40 +03:00
int ret ;
2018-01-08 16:38:32 +03:00
if ( ! ( wq - > flags & WQ_MEM_RECLAIM ) )
return 0 ;
rescuer = alloc_worker ( NUMA_NO_NODE ) ;
2023-03-07 15:53:34 +03:00
if ( ! rescuer ) {
pr_err ( " workqueue: Failed to allocate a rescuer for wq \" %s \" \n " ,
wq - > name ) ;
2018-01-08 16:38:32 +03:00
return - ENOMEM ;
2023-03-07 15:53:34 +03:00
}
2018-01-08 16:38:32 +03:00
rescuer - > rescue_wq = wq ;
2023-08-08 15:03:29 +03:00
rescuer - > task = kthread_create ( rescuer_thread , rescuer , " kworker/R-%s " , wq - > name ) ;
2020-04-29 07:04:13 +03:00
if ( IS_ERR ( rescuer - > task ) ) {
2020-05-08 18:07:40 +03:00
ret = PTR_ERR ( rescuer - > task ) ;
2023-03-07 15:53:34 +03:00
pr_err ( " workqueue: Failed to create a rescuer kthread for wq \" %s \" : %pe " ,
wq - > name , ERR_PTR ( ret ) ) ;
2018-01-08 16:38:32 +03:00
kfree ( rescuer ) ;
2020-05-08 18:07:40 +03:00
return ret ;
2018-01-08 16:38:32 +03:00
}
wq - > rescuer = rescuer ;
2024-01-16 19:19:27 +03:00
if ( wq - > flags & WQ_UNBOUND )
2024-02-08 19:10:14 +03:00
kthread_bind_mask ( rescuer - > task , wq_unbound_cpumask ) ;
2024-01-16 19:19:27 +03:00
else
kthread_bind_mask ( rescuer - > task , cpu_possible_mask ) ;
2018-01-08 16:38:32 +03:00
wake_up_process ( rescuer - > task ) ;
return 0 ;
}
2024-01-29 21:11:24 +03:00
/**
* wq_adjust_max_active - update a wq ' s max_active to the current setting
* @ wq : target workqueue
*
* If @ wq isn ' t freezing , set @ wq - > max_active to the saved_max_active and
* activate inactive work items accordingly . If @ wq is freezing , clear
* @ wq - > max_active to zero .
*/
static void wq_adjust_max_active ( struct workqueue_struct * wq )
{
2024-01-29 21:11:24 +03:00
bool activated ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
int new_max , new_min ;
2024-01-29 21:11:24 +03:00
lockdep_assert_held ( & wq - > mutex ) ;
if ( ( wq - > flags & WQ_FREEZABLE ) & & workqueue_freezing ) {
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
new_max = 0 ;
new_min = 0 ;
} else {
new_max = wq - > saved_max_active ;
new_min = wq - > saved_min_active ;
2024-01-29 21:11:24 +03:00
}
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( wq - > max_active = = new_max & & wq - > min_active = = new_min )
2024-01-29 21:11:24 +03:00
return ;
/*
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* Update @ wq - > max / min_active and then kick inactive work items if more
2024-01-29 21:11:24 +03:00
* active work items are allowed . This doesn ' t break work item ordering
* because new work items are always queued behind existing inactive
* work items if there are any .
*/
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
WRITE_ONCE ( wq - > max_active , new_max ) ;
WRITE_ONCE ( wq - > min_active , new_min ) ;
if ( wq - > flags & WQ_UNBOUND )
wq_update_node_max_active ( wq , - 1 ) ;
if ( new_max = = 0 )
return ;
2024-01-29 21:11:24 +03:00
2024-01-29 21:11:24 +03:00
/*
* Round - robin through pwq ' s activating the first inactive work item
* until max_active is filled .
*/
do {
struct pool_workqueue * pwq ;
2024-01-29 21:11:24 +03:00
2024-01-29 21:11:24 +03:00
activated = false ;
for_each_pwq ( pwq , wq ) {
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2024-01-29 21:11:24 +03:00
2024-01-29 21:11:24 +03:00
/* can be called during early boot w/ irq disabled */
2024-02-21 08:36:14 +03:00
raw_spin_lock_irqsave ( & pwq - > pool - > lock , irq_flags ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( pwq_activate_first_inactive ( pwq , true ) ) {
2024-01-29 21:11:24 +03:00
activated = true ;
kick_pool ( pwq - > pool ) ;
}
2024-02-21 08:36:14 +03:00
raw_spin_unlock_irqrestore ( & pwq - > pool - > lock , irq_flags ) ;
2024-01-29 21:11:24 +03:00
}
} while ( activated ) ;
2024-01-29 21:11:24 +03:00
}
2019-03-12 23:21:26 +03:00
__printf ( 1 , 4 )
2019-02-15 02:00:54 +03:00
struct workqueue_struct * alloc_workqueue ( const char * fmt ,
unsigned int flags ,
int max_active , . . . )
2005-04-17 02:20:36 +04:00
{
2013-04-01 22:23:34 +04:00
va_list args ;
2005-04-17 02:20:36 +04:00
struct workqueue_struct * wq ;
2024-01-29 21:11:24 +03:00
size_t wq_size ;
int name_len ;
2012-01-11 03:11:35 +04:00
2024-02-05 00:28:06 +03:00
if ( flags & WQ_BH ) {
if ( WARN_ON_ONCE ( flags & ~ __WQ_BH_ALLOWS ) )
return NULL ;
if ( WARN_ON_ONCE ( max_active ) )
return NULL ;
}
2013-04-08 15:15:40 +04:00
/* see the comment above the definition of WQ_POWER_EFFICIENT */
if ( ( flags & WQ_POWER_EFFICIENT ) & & wq_power_efficient )
flags | = WQ_UNBOUND ;
2013-04-01 22:23:34 +04:00
/* allocate wq and format name */
2024-01-29 21:11:24 +03:00
if ( flags & WQ_UNBOUND )
wq_size = struct_size ( wq , node_nr_active , nr_node_ids + 1 ) ;
else
wq_size = sizeof ( * wq ) ;
wq = kzalloc ( wq_size , GFP_KERNEL ) ;
2012-01-11 03:11:35 +04:00
if ( ! wq )
2013-03-12 22:30:04 +04:00
return NULL ;
2012-01-11 03:11:35 +04:00
2013-04-01 22:23:34 +04:00
if ( flags & WQ_UNBOUND ) {
2019-06-26 17:52:38 +03:00
wq - > unbound_attrs = alloc_workqueue_attrs ( ) ;
2013-04-01 22:23:34 +04:00
if ( ! wq - > unbound_attrs )
goto err_free_wq ;
}
2019-02-15 02:00:54 +03:00
va_start ( args , max_active ) ;
2024-01-29 21:11:24 +03:00
name_len = vsnprintf ( wq - > name , sizeof ( wq - > name ) , fmt , args ) ;
2012-01-11 03:11:35 +04:00
va_end ( args ) ;
2005-04-17 02:20:36 +04:00
2024-01-29 21:11:24 +03:00
if ( name_len > = WQ_NAME_LEN )
pr_warn_once ( " workqueue: name exceeds WQ_NAME_LEN. Truncating to: %s \n " ,
wq - > name ) ;
2024-01-15 20:08:22 +03:00
2024-02-05 00:28:06 +03:00
if ( flags & WQ_BH ) {
/*
* BH workqueues always share a single execution context per CPU
* and don ' t impose any max_active limit .
*/
max_active = INT_MAX ;
} else {
max_active = max_active ? : WQ_DFL_ACTIVE ;
max_active = wq_clamp_max_active ( max_active , flags , wq - > name ) ;
}
2007-05-09 13:34:09 +04:00
2012-01-11 03:11:35 +04:00
/* init wq */
2010-06-29 12:07:10 +04:00
wq - > flags = flags ;
2024-01-29 21:11:24 +03:00
wq - > max_active = max_active ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
wq - > min_active = min ( max_active , WQ_DFL_MIN_ACTIVE ) ;
wq - > saved_max_active = wq - > max_active ;
wq - > saved_min_active = wq - > min_active ;
2013-03-26 03:57:17 +04:00
mutex_init ( & wq - > mutex ) ;
2013-02-14 07:29:12 +04:00
atomic_set ( & wq - > nr_pwqs_to_flush , 0 ) ;
2013-03-12 22:29:57 +04:00
INIT_LIST_HEAD ( & wq - > pwqs ) ;
2010-06-29 12:07:11 +04:00
INIT_LIST_HEAD ( & wq - > flusher_queue ) ;
INIT_LIST_HEAD ( & wq - > flusher_overflow ) ;
2013-03-12 22:29:59 +04:00
INIT_LIST_HEAD ( & wq - > maydays ) ;
2010-06-29 12:07:13 +04:00
2019-02-15 02:00:54 +03:00
wq_init_lockdep ( wq ) ;
2007-05-09 13:34:13 +04:00
INIT_LIST_HEAD ( & wq - > list ) ;
2007-05-09 13:34:09 +04:00
2024-01-29 21:11:24 +03:00
if ( flags & WQ_UNBOUND ) {
if ( alloc_node_nr_active ( wq - > node_nr_active ) < 0 )
goto err_unreg_lockdep ;
}
2013-03-12 22:29:57 +04:00
if ( alloc_and_link_pwqs ( wq ) < 0 )
2024-01-29 21:11:24 +03:00
goto err_free_node_nr_active ;
2010-06-29 12:07:11 +04:00
2018-01-08 16:38:37 +03:00
if ( wq_online & & init_rescuer ( wq ) < 0 )
2018-01-08 16:38:32 +03:00
goto err_destroy ;
2007-05-09 13:34:09 +04:00
2013-03-12 22:30:05 +04:00
if ( ( wq - > flags & WQ_SYSFS ) & & workqueue_sysfs_register ( wq ) )
goto err_destroy ;
2010-06-29 12:07:12 +04:00
/*
2013-03-26 03:57:17 +04:00
* wq_pool_mutex protects global freeze state and workqueues list .
* Grab it , adjust max_active and add the new @ wq to workqueues
* list .
2010-06-29 12:07:12 +04:00
*/
2013-03-26 03:57:17 +04:00
mutex_lock ( & wq_pool_mutex ) ;
2010-06-29 12:07:12 +04:00
2013-03-26 03:57:19 +04:00
mutex_lock ( & wq - > mutex ) ;
2024-01-29 21:11:24 +03:00
wq_adjust_max_active ( wq ) ;
2013-03-26 03:57:19 +04:00
mutex_unlock ( & wq - > mutex ) ;
2010-06-29 12:07:12 +04:00
2015-03-09 16:22:28 +03:00
list_add_tail_rcu ( & wq - > list , & workqueues ) ;
2010-06-29 12:07:12 +04:00
2013-03-26 03:57:17 +04:00
mutex_unlock ( & wq_pool_mutex ) ;
2010-06-29 12:07:11 +04:00
2007-05-09 13:34:09 +04:00
return wq ;
2013-03-12 22:30:04 +04:00
2024-01-29 21:11:24 +03:00
err_free_node_nr_active :
if ( wq - > flags & WQ_UNBOUND )
free_node_nr_active ( wq - > node_nr_active ) ;
2019-03-12 02:02:55 +03:00
err_unreg_lockdep :
2019-03-04 01:00:46 +03:00
wq_unregister_lockdep ( wq ) ;
wq_free_lockdep ( wq ) ;
2019-03-12 02:02:55 +03:00
err_free_wq :
2013-04-01 22:23:34 +04:00
free_workqueue_attrs ( wq - > unbound_attrs ) ;
2013-03-12 22:30:04 +04:00
kfree ( wq ) ;
return NULL ;
err_destroy :
destroy_workqueue ( wq ) ;
2010-06-29 12:07:10 +04:00
return NULL ;
2007-05-09 13:34:09 +04:00
}
2019-02-15 02:00:54 +03:00
EXPORT_SYMBOL_GPL ( alloc_workqueue ) ;
2005-04-17 02:20:36 +04:00
2019-09-23 21:08:58 +03:00
static bool pwq_busy ( struct pool_workqueue * pwq )
{
int i ;
for ( i = 0 ; i < WORK_NR_COLORS ; i + + )
if ( pwq - > nr_in_flight [ i ] )
return true ;
2024-01-29 21:11:24 +03:00
if ( ( pwq ! = rcu_access_pointer ( pwq - > wq - > dfl_pwq ) ) & & ( pwq - > refcnt > 1 ) )
2019-09-23 21:08:58 +03:00
return true ;
2024-01-29 21:11:24 +03:00
if ( ! pwq_is_empty ( pwq ) )
2019-09-23 21:08:58 +03:00
return true ;
return false ;
}
2007-05-09 13:34:09 +04:00
/**
* destroy_workqueue - safely terminate a workqueue
* @ wq : target workqueue
*
* Safely destroy a workqueue . All work currently pending will be done first .
*/
void destroy_workqueue ( struct workqueue_struct * wq )
{
2013-03-12 22:29:58 +04:00
struct pool_workqueue * pwq ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
int cpu ;
2007-05-09 13:34:09 +04:00
2019-09-19 04:43:40 +03:00
/*
* Remove it from sysfs first so that sanity check failure doesn ' t
* lead to sysfs name conflicts .
*/
workqueue_sysfs_unregister ( wq ) ;
2022-12-13 07:39:36 +03:00
/* mark the workqueue destruction is in progress */
mutex_lock ( & wq - > mutex ) ;
wq - > flags | = __WQ_DESTROYING ;
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 20:01:44 +04:00
/* drain it before proceeding with destruction */
drain_workqueue ( wq ) ;
2010-12-20 21:32:04 +03:00
2019-09-19 04:43:40 +03:00
/* kill rescuer, if sanity checks fail, leave it w/o rescuer */
if ( wq - > rescuer ) {
struct worker * rescuer = wq - > rescuer ;
/* this prevents new queueing */
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & wq_mayday_lock ) ;
2019-09-19 04:43:40 +03:00
wq - > rescuer = NULL ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & wq_mayday_lock ) ;
2019-09-19 04:43:40 +03:00
/* rescuer will empty maydays list before exiting */
kthread_stop ( rescuer - > task ) ;
2019-09-20 23:39:57 +03:00
kfree ( rescuer ) ;
2019-09-19 04:43:40 +03:00
}
2019-09-23 21:08:58 +03:00
/*
* Sanity checks - grab all the locks so that we wait for all
* in - flight operations which may do put_pwq ( ) .
*/
mutex_lock ( & wq_pool_mutex ) ;
2013-03-26 03:57:18 +04:00
mutex_lock ( & wq - > mutex ) ;
2013-03-12 22:29:58 +04:00
for_each_pwq ( pwq , wq ) {
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
2019-09-23 21:08:58 +03:00
if ( WARN_ON ( pwq_busy ( pwq ) ) ) {
2019-11-28 03:47:49 +03:00
pr_warn ( " %s: %s has the following busy pwq \n " ,
__func__ , wq - > name ) ;
2019-09-23 21:08:58 +03:00
show_pwq ( pwq ) ;
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2013-03-26 03:57:18 +04:00
mutex_unlock ( & wq - > mutex ) ;
2019-09-23 21:08:58 +03:00
mutex_unlock ( & wq_pool_mutex ) ;
2021-10-20 06:09:00 +03:00
show_one_workqueue ( wq ) ;
2013-03-12 22:29:57 +04:00
return ;
2013-03-12 22:30:00 +04:00
}
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2013-03-12 22:29:57 +04:00
}
2013-03-26 03:57:18 +04:00
mutex_unlock ( & wq - > mutex ) ;
2013-03-12 22:29:57 +04:00
2010-06-29 12:07:12 +04:00
/*
* wq list is used to freeze wq , remove from list after
* flushing is complete in case freeze races us .
*/
2015-03-09 16:22:28 +03:00
list_del_rcu ( & wq - > list ) ;
2013-03-26 03:57:17 +04:00
mutex_unlock ( & wq_pool_mutex ) ;
2007-05-09 13:34:09 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
/*
* We ' re the sole accessor of @ wq . Directly access cpu_pwq and dfl_pwq
* to put the base refs . @ wq will be auto - destroyed from the last
* pwq_put . RCU read lock prevents @ wq from going away from under us .
*/
rcu_read_lock ( ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
for_each_possible_cpu ( cpu ) {
2024-01-29 21:11:24 +03:00
put_pwq_unlocked ( unbound_pwq ( wq , cpu ) ) ;
RCU_INIT_POINTER ( * unbound_pwq_slot ( wq , cpu ) , NULL ) ;
2013-03-12 22:30:03 +04:00
}
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
2024-01-29 21:11:24 +03:00
put_pwq_unlocked ( unbound_pwq ( wq , - 1 ) ) ;
RCU_INIT_POINTER ( * unbound_pwq_slot ( wq , - 1 ) , NULL ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
rcu_read_unlock ( ) ;
2007-05-09 13:34:09 +04:00
}
EXPORT_SYMBOL_GPL ( destroy_workqueue ) ;
2010-06-29 12:07:14 +04:00
/**
* workqueue_set_max_active - adjust max_active of a workqueue
* @ wq : target workqueue
* @ max_active : new max_active value .
*
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
* Set max_active of @ wq to @ max_active . See the alloc_workqueue ( ) function
* comment .
2010-06-29 12:07:14 +04:00
*
* CONTEXT :
* Don ' t call from IRQ context .
*/
void workqueue_set_max_active ( struct workqueue_struct * wq , int max_active )
{
2024-02-05 00:28:06 +03:00
/* max_active doesn't mean anything for BH workqueues */
if ( WARN_ON ( wq - > flags & WQ_BH ) )
return ;
2013-03-12 22:30:04 +04:00
/* disallow meddling with max_active for ordered workqueues */
2024-02-06 03:19:10 +03:00
if ( WARN_ON ( wq - > flags & __WQ_ORDERED ) )
2013-03-12 22:30:04 +04:00
return ;
2010-07-02 12:03:51 +04:00
max_active = wq_clamp_max_active ( max_active , wq - > flags , wq - > name ) ;
2010-06-29 12:07:14 +04:00
2013-03-26 03:57:19 +04:00
mutex_lock ( & wq - > mutex ) ;
2010-06-29 12:07:14 +04:00
wq - > saved_max_active = max_active ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( wq - > flags & WQ_UNBOUND )
wq - > saved_min_active = min ( wq - > saved_min_active , max_active ) ;
2024-01-29 21:11:24 +03:00
wq_adjust_max_active ( wq ) ;
2009-11-18 01:06:20 +03:00
2013-03-26 03:57:19 +04:00
mutex_unlock ( & wq - > mutex ) ;
2006-01-08 12:00:43 +03:00
}
2010-06-29 12:07:14 +04:00
EXPORT_SYMBOL_GPL ( workqueue_set_max_active ) ;
2006-01-08 12:00:43 +03:00
2024-02-09 03:11:56 +03:00
/**
* workqueue_set_min_active - adjust min_active of an unbound workqueue
* @ wq : target unbound workqueue
* @ min_active : new min_active value
*
* Set min_active of an unbound workqueue . Unlike other types of workqueues , an
* unbound workqueue is not guaranteed to be able to process max_active
* interdependent work items . Instead , an unbound workqueue is guaranteed to be
* able to process min_active number of interdependent work items which is
* % WQ_DFL_MIN_ACTIVE by default .
*
* Use this function to adjust the min_active value between 0 and the current
* max_active .
*/
void workqueue_set_min_active ( struct workqueue_struct * wq , int min_active )
{
/* min_active is only meaningful for non-ordered unbound workqueues */
if ( WARN_ON ( ( wq - > flags & ( WQ_BH | WQ_UNBOUND | __WQ_ORDERED ) ) ! =
WQ_UNBOUND ) )
return ;
mutex_lock ( & wq - > mutex ) ;
wq - > saved_min_active = clamp ( min_active , 0 , wq - > saved_max_active ) ;
wq_adjust_max_active ( wq ) ;
mutex_unlock ( & wq - > mutex ) ;
}
2018-02-11 12:38:28 +03:00
/**
* current_work - retrieve % current task ' s work struct
*
* Determine if % current task is a workqueue worker and what it ' s working on .
* Useful to find out the context that the % current task is running in .
*
* Return : work struct if % current task is a workqueue worker , % NULL otherwise .
*/
struct work_struct * current_work ( void )
{
struct worker * worker = current_wq_worker ( ) ;
return worker ? worker - > current_work : NULL ;
}
EXPORT_SYMBOL ( current_work ) ;
2013-03-13 04:41:37 +04:00
/**
* current_is_workqueue_rescuer - is % current workqueue rescuer ?
*
* Determine whether % current is a workqueue rescuer . Can be used from
* work functions to determine whether it ' s being run off the rescuer task .
2013-08-01 01:59:24 +04:00
*
* Return : % true if % current is a workqueue rescuer . % false otherwise .
2013-03-13 04:41:37 +04:00
*/
bool current_is_workqueue_rescuer ( void )
{
struct worker * worker = current_wq_worker ( ) ;
2013-03-19 23:28:03 +04:00
return worker & & worker - > rescue_wq ;
2013-03-13 04:41:37 +04:00
}
2010-02-12 11:39:21 +03:00
/**
2010-06-29 12:07:14 +04:00
* workqueue_congested - test whether a workqueue is congested
* @ cpu : CPU in question
* @ wq : target workqueue
2010-02-12 11:39:21 +03:00
*
2010-06-29 12:07:14 +04:00
* Test whether @ wq ' s cpu workqueue for @ cpu is congested . There is
* no synchronization around this function and the test result is
* unreliable and only useful as advisory hints or for debugging .
2010-02-12 11:39:21 +03:00
*
2013-05-10 22:10:17 +04:00
* If @ cpu is WORK_CPU_UNBOUND , the test is performed on the local CPU .
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
*
* With the exception of ordered workqueues , all workqueues have per - cpu
* pool_workqueues , each with its own congested state . A workqueue being
* congested on one CPU doesn ' t mean that the workqueue is contested on any
* other CPUs .
2013-05-10 22:10:17 +04:00
*
2013-08-01 01:59:24 +04:00
* Return :
2010-06-29 12:07:14 +04:00
* % true if congested , % false otherwise .
2010-02-12 11:39:21 +03:00
*/
2013-03-12 22:29:59 +04:00
bool workqueue_congested ( int cpu , struct workqueue_struct * wq )
2005-04-17 02:20:36 +04:00
{
2013-03-12 22:30:00 +04:00
struct pool_workqueue * pwq ;
2013-03-12 22:30:00 +04:00
bool ret ;
2019-03-13 19:55:47 +03:00
rcu_read_lock ( ) ;
preempt_disable ( ) ;
2013-03-12 22:30:00 +04:00
2013-05-10 22:10:17 +04:00
if ( cpu = = WORK_CPU_UNBOUND )
cpu = smp_processor_id ( ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
pwq = * per_cpu_ptr ( wq - > cpu_pwq , cpu ) ;
2021-08-17 04:32:34 +03:00
ret = ! list_empty ( & pwq - > inactive_works ) ;
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
2019-03-13 19:55:47 +03:00
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
2013-03-12 22:30:00 +04:00
return ret ;
2005-04-17 02:20:36 +04:00
}
2010-06-29 12:07:14 +04:00
EXPORT_SYMBOL_GPL ( workqueue_congested ) ;
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:14 +04:00
/**
* work_busy - test whether a work is currently pending or running
* @ work : the work to be tested
*
* Test whether @ work is currently pending or running . There is no
* synchronization around this function and the test result is
* unreliable and only useful as advisory hints or for debugging .
*
2013-08-01 01:59:24 +04:00
* Return :
2010-06-29 12:07:14 +04:00
* OR ' d bitmask of WORK_BUSY_ * bits .
*/
unsigned int work_busy ( struct work_struct * work )
2005-04-17 02:20:36 +04:00
{
2013-03-12 22:30:00 +04:00
struct worker_pool * pool ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2010-06-29 12:07:14 +04:00
unsigned int ret = 0 ;
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:14 +04:00
if ( work_pending ( work ) )
ret | = WORK_BUSY_PENDING ;
2005-04-17 02:20:36 +04:00
2019-03-13 19:55:47 +03:00
rcu_read_lock ( ) ;
2013-03-12 22:30:00 +04:00
pool = get_work_pool ( work ) ;
2013-02-07 06:04:53 +04:00
if ( pool ) {
2024-02-21 08:36:14 +03:00
raw_spin_lock_irqsave ( & pool - > lock , irq_flags ) ;
2013-02-07 06:04:53 +04:00
if ( find_worker_executing_work ( pool , work ) )
ret | = WORK_BUSY_RUNNING ;
2024-02-21 08:36:14 +03:00
raw_spin_unlock_irqrestore ( & pool - > lock , irq_flags ) ;
2013-02-07 06:04:53 +04:00
}
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
2010-06-29 12:07:14 +04:00
return ret ;
2005-04-17 02:20:36 +04:00
}
2010-06-29 12:07:14 +04:00
EXPORT_SYMBOL_GPL ( work_busy ) ;
2005-04-17 02:20:36 +04:00
2013-05-01 02:27:22 +04:00
/**
* set_worker_desc - set description for the current work item
* @ fmt : printf - style format string
* @ . . . : arguments for the format string
*
* This function can be called by a running work function to describe what
* the work item is about . If the worker task gets dumped , this
* information will be printed out together to help debugging . The
* description can be at most WORKER_DESC_LEN including the trailing ' \0 ' .
*/
void set_worker_desc ( const char * fmt , . . . )
{
struct worker * worker = current_wq_worker ( ) ;
va_list args ;
if ( worker ) {
va_start ( args , fmt ) ;
vsnprintf ( worker - > desc , sizeof ( worker - > desc ) , fmt , args ) ;
va_end ( args ) ;
}
}
2018-05-17 20:14:57 +03:00
EXPORT_SYMBOL_GPL ( set_worker_desc ) ;
2013-05-01 02:27:22 +04:00
/**
* print_worker_info - print out worker information and description
* @ log_lvl : the log level to use when printing
* @ task : target task
*
* If @ task is a worker and currently executing a work item , print out the
* name of the workqueue being serviced and worker description set with
* set_worker_desc ( ) by the currently executing work item .
*
* This function can be safely called on any task as long as the
* task_struct itself is accessible . While safe , this function isn ' t
* synchronized and may print out mixups or garbages of limited length .
*/
void print_worker_info ( const char * log_lvl , struct task_struct * task )
{
work_func_t * fn = NULL ;
char name [ WQ_NAME_LEN ] = { } ;
char desc [ WORKER_DESC_LEN ] = { } ;
struct pool_workqueue * pwq = NULL ;
struct workqueue_struct * wq = NULL ;
struct worker * worker ;
if ( ! ( task - > flags & PF_WQ_WORKER ) )
return ;
/*
* This function is called without any synchronization and @ task
* could be in any state . Be careful with dereferences .
*/
2016-10-11 23:55:17 +03:00
worker = kthread_probe_data ( task ) ;
2013-05-01 02:27:22 +04:00
/*
2018-05-18 18:47:13 +03:00
* Carefully copy the associated workqueue ' s workfn , name and desc .
* Keep the original last ' \0 ' in case the original is garbage .
2013-05-01 02:27:22 +04:00
*/
2020-06-17 10:37:53 +03:00
copy_from_kernel_nofault ( & fn , & worker - > current_func , sizeof ( fn ) ) ;
copy_from_kernel_nofault ( & pwq , & worker - > current_pwq , sizeof ( pwq ) ) ;
copy_from_kernel_nofault ( & wq , & pwq - > wq , sizeof ( wq ) ) ;
copy_from_kernel_nofault ( name , wq - > name , sizeof ( name ) - 1 ) ;
copy_from_kernel_nofault ( desc , worker - > desc , sizeof ( desc ) - 1 ) ;
2013-05-01 02:27:22 +04:00
if ( fn | | name [ 0 ] | | desc [ 0 ] ) {
2019-03-25 22:32:28 +03:00
printk ( " %sWorkqueue: %s %ps " , log_lvl , name , fn ) ;
2018-05-18 18:47:13 +03:00
if ( strcmp ( name , desc ) )
2013-05-01 02:27:22 +04:00
pr_cont ( " (%s) " , desc ) ;
pr_cont ( " \n " ) ;
}
}
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
static void pr_cont_pool_info ( struct worker_pool * pool )
{
pr_cont ( " cpus=%*pbl " , nr_cpumask_bits , pool - > attrs - > cpumask ) ;
if ( pool - > node ! = NUMA_NO_NODE )
pr_cont ( " node=%d " , pool - > node ) ;
2024-02-05 00:28:06 +03:00
pr_cont ( " flags=0x%x " , pool - > flags ) ;
if ( pool - > flags & POOL_BH )
pr_cont ( " bh%s " ,
pool - > attrs - > nice = = HIGHPRI_NICE_LEVEL ? " -hi " : " " ) ;
else
pr_cont ( " nice=%d " , pool - > attrs - > nice ) ;
}
static void pr_cont_worker_id ( struct worker * worker )
{
struct worker_pool * pool = worker - > pool ;
if ( pool - > flags & WQ_BH )
pr_cont ( " bh%s " ,
pool - > attrs - > nice = = HIGHPRI_NICE_LEVEL ? " -hi " : " " ) ;
else
pr_cont ( " %d%s " , task_pid_nr ( worker - > task ) ,
worker - > rescue_wq ? " (RESCUER) " : " " ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
}
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
struct pr_cont_work_struct {
bool comma ;
work_func_t func ;
long ctr ;
} ;
static void pr_cont_work_flush ( bool comma , work_func_t func , struct pr_cont_work_struct * pcwsp )
{
if ( ! pcwsp - > ctr )
goto out_record ;
if ( func = = pcwsp - > func ) {
pcwsp - > ctr + + ;
return ;
}
if ( pcwsp - > ctr = = 1 )
pr_cont ( " %s %ps " , pcwsp - > comma ? " , " : " " , pcwsp - > func ) ;
else
pr_cont ( " %s %ld*%ps " , pcwsp - > comma ? " , " : " " , pcwsp - > ctr , pcwsp - > func ) ;
pcwsp - > ctr = 0 ;
out_record :
if ( ( long ) func = = - 1L )
return ;
pcwsp - > comma = comma ;
pcwsp - > func = func ;
pcwsp - > ctr = 1 ;
}
static void pr_cont_work ( bool comma , struct work_struct * work , struct pr_cont_work_struct * pcwsp )
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
{
if ( work - > func = = wq_barrier_func ) {
struct wq_barrier * barr ;
barr = container_of ( work , struct wq_barrier , work ) ;
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
pr_cont_work_flush ( comma , ( work_func_t ) - 1 , pcwsp ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
pr_cont ( " %s BAR(%d) " , comma ? " , " : " " ,
task_pid_nr ( barr - > task ) ) ;
} else {
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
if ( ! comma )
pr_cont_work_flush ( comma , ( work_func_t ) - 1 , pcwsp ) ;
pr_cont_work_flush ( comma , work - > func , pcwsp ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
}
}
static void show_pwq ( struct pool_workqueue * pwq )
{
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
struct pr_cont_work_struct pcws = { . ctr = 0 , } ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
struct worker_pool * pool = pwq - > pool ;
struct work_struct * work ;
struct worker * worker ;
bool has_in_flight = false , has_pending = false ;
int bkt ;
pr_info ( " pwq %d: " , pool - > id ) ;
pr_cont_pool_info ( pool ) ;
2024-01-29 21:11:24 +03:00
pr_cont ( " active=%d refcnt=%d%s \n " ,
pwq - > nr_active , pwq - > refcnt ,
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
! list_empty ( & pwq - > mayday_node ) ? " MAYDAY " : " " ) ;
hash_for_each ( pool - > busy_hash , bkt , worker , hentry ) {
if ( worker - > current_pwq = = pwq ) {
has_in_flight = true ;
break ;
}
}
if ( has_in_flight ) {
bool comma = false ;
pr_info ( " in-flight: " ) ;
hash_for_each ( pool - > busy_hash , bkt , worker , hentry ) {
if ( worker - > current_pwq ! = pwq )
continue ;
2024-02-05 00:28:06 +03:00
pr_cont ( " %s " , comma ? " , " : " " ) ;
pr_cont_worker_id ( worker ) ;
pr_cont ( " :%ps " , worker - > current_func ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
list_for_each_entry ( work , & worker - > scheduled , entry )
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
pr_cont_work ( false , work , & pcws ) ;
pr_cont_work_flush ( comma , ( work_func_t ) - 1L , & pcws ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
comma = true ;
}
pr_cont ( " \n " ) ;
}
list_for_each_entry ( work , & pool - > worklist , entry ) {
if ( get_work_pwq ( work ) = = pwq ) {
has_pending = true ;
break ;
}
}
if ( has_pending ) {
bool comma = false ;
pr_info ( " pending: " ) ;
list_for_each_entry ( work , & pool - > worklist , entry ) {
if ( get_work_pwq ( work ) ! = pwq )
continue ;
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
pr_cont_work ( comma , work , & pcws ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
comma = ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) ;
}
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
pr_cont_work_flush ( comma , ( work_func_t ) - 1L , & pcws ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
pr_cont ( " \n " ) ;
}
2021-08-17 04:32:34 +03:00
if ( ! list_empty ( & pwq - > inactive_works ) ) {
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
bool comma = false ;
2021-08-17 04:32:34 +03:00
pr_info ( " inactive: " ) ;
list_for_each_entry ( work , & pwq - > inactive_works , entry ) {
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
pr_cont_work ( comma , work , & pcws ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
comma = ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) ;
}
workqueue: Make show_pwq() use run-length encoding
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-07 03:10:24 +03:00
pr_cont_work_flush ( comma , ( work_func_t ) - 1L , & pcws ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
pr_cont ( " \n " ) ;
}
}
/**
2021-10-20 06:09:00 +03:00
* show_one_workqueue - dump state of specified workqueue
* @ wq : workqueue whose state will be printed
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
*/
2021-10-20 06:09:00 +03:00
void show_one_workqueue ( struct workqueue_struct * wq )
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
{
2021-10-20 06:09:00 +03:00
struct pool_workqueue * pwq ;
bool idle = true ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
2021-10-20 06:09:00 +03:00
for_each_pwq ( pwq , wq ) {
2024-01-29 21:11:24 +03:00
if ( ! pwq_is_empty ( pwq ) ) {
2021-10-20 06:09:00 +03:00
idle = false ;
break ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
}
2021-10-20 06:09:00 +03:00
}
if ( idle ) /* Nothing to print for idle workqueue */
return ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
2021-10-20 06:09:00 +03:00
pr_info ( " workqueue %s: flags=0x%x \n " , wq - > name , wq - > flags ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
2021-10-20 06:09:00 +03:00
for_each_pwq ( pwq , wq ) {
2024-02-21 08:36:14 +03:00
raw_spin_lock_irqsave ( & pwq - > pool - > lock , irq_flags ) ;
2024-01-29 21:11:24 +03:00
if ( ! pwq_is_empty ( pwq ) ) {
2018-01-11 03:53:35 +03:00
/*
2021-10-20 06:09:00 +03:00
* Defer printing to avoid deadlocks in console
* drivers that queue work while holding locks
* also taken in their write paths .
2018-01-11 03:53:35 +03:00
*/
2021-10-20 06:09:00 +03:00
printk_deferred_enter ( ) ;
show_pwq ( pwq ) ;
printk_deferred_exit ( ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
}
2024-02-21 08:36:14 +03:00
raw_spin_unlock_irqrestore ( & pwq - > pool - > lock , irq_flags ) ;
2018-01-11 03:53:35 +03:00
/*
* We could be printing a lot from atomic context , e . g .
2021-10-20 06:09:00 +03:00
* sysrq - t - > show_all_workqueues ( ) . Avoid triggering
2018-01-11 03:53:35 +03:00
* hard lockup .
*/
touch_nmi_watchdog ( ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
}
2021-10-20 06:09:00 +03:00
}
/**
* show_one_worker_pool - dump state of specified worker pool
* @ pool : worker pool whose state will be printed
*/
static void show_one_worker_pool ( struct worker_pool * pool )
{
struct worker * worker ;
bool first = true ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2023-03-07 15:53:31 +03:00
unsigned long hung = 0 ;
2021-10-20 06:09:00 +03:00
2024-02-21 08:36:14 +03:00
raw_spin_lock_irqsave ( & pool - > lock , irq_flags ) ;
2021-10-20 06:09:00 +03:00
if ( pool - > nr_workers = = pool - > nr_idle )
goto next_pool ;
2023-03-07 15:53:31 +03:00
/* How long the first pending work is waiting for a worker. */
if ( ! list_empty ( & pool - > worklist ) )
hung = jiffies_to_msecs ( jiffies - pool - > watchdog_ts ) / 1000 ;
2021-10-20 06:09:00 +03:00
/*
* Defer printing to avoid deadlocks in console drivers that
* queue work while holding locks also taken in their write
* paths .
*/
printk_deferred_enter ( ) ;
pr_info ( " pool %d: " , pool - > id ) ;
pr_cont_pool_info ( pool ) ;
2023-03-07 15:53:31 +03:00
pr_cont ( " hung=%lus workers=%d " , hung , pool - > nr_workers ) ;
2021-10-20 06:09:00 +03:00
if ( pool - > manager )
pr_cont ( " manager: %d " ,
task_pid_nr ( pool - > manager - > task ) ) ;
list_for_each_entry ( worker , & pool - > idle_list , entry ) {
2024-02-05 00:28:06 +03:00
pr_cont ( " %s " , first ? " idle: " : " " ) ;
pr_cont_worker_id ( worker ) ;
2021-10-20 06:09:00 +03:00
first = false ;
}
pr_cont ( " \n " ) ;
printk_deferred_exit ( ) ;
next_pool :
2024-02-21 08:36:14 +03:00
raw_spin_unlock_irqrestore ( & pool - > lock , irq_flags ) ;
2021-10-20 06:09:00 +03:00
/*
* We could be printing a lot from atomic context , e . g .
* sysrq - t - > show_all_workqueues ( ) . Avoid triggering
* hard lockup .
*/
touch_nmi_watchdog ( ) ;
}
/**
* show_all_workqueues - dump workqueue state
*
2023-03-20 06:29:05 +03:00
* Called from a sysrq handler and prints out all busy workqueues and pools .
2021-10-20 06:09:00 +03:00
*/
void show_all_workqueues ( void )
{
struct workqueue_struct * wq ;
struct worker_pool * pool ;
int pi ;
rcu_read_lock ( ) ;
pr_info ( " Showing busy workqueues and worker pools: \n " ) ;
list_for_each_entry_rcu ( wq , & workqueues , list )
show_one_workqueue ( wq ) ;
for_each_pool ( pool , pi )
show_one_worker_pool ( pool ) ;
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 16:22:28 +03:00
}
2023-03-20 06:29:05 +03:00
/**
* show_freezable_workqueues - dump freezable workqueue state
*
* Called from try_to_freeze_tasks ( ) and prints out all freezable workqueues
* still busy .
*/
void show_freezable_workqueues ( void )
{
struct workqueue_struct * wq ;
rcu_read_lock ( ) ;
pr_info ( " Showing freezable workqueues that are still busy: \n " ) ;
list_for_each_entry_rcu ( wq , & workqueues , list ) {
if ( ! ( wq - > flags & WQ_FREEZABLE ) )
continue ;
show_one_workqueue ( wq ) ;
}
rcu_read_unlock ( ) ;
}
2018-05-18 18:47:13 +03:00
/* used to show worker information through /proc/PID/{comm,stat,status} */
void wq_worker_comm ( char * buf , size_t size , struct task_struct * task )
{
int off ;
/* always show the actual comm */
off = strscpy ( buf , task - > comm , size ) ;
if ( off < 0 )
return ;
2018-05-21 18:04:35 +03:00
/* stabilize PF_WQ_WORKER and worker pool association */
2018-05-18 18:47:13 +03:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2018-05-21 18:04:35 +03:00
if ( task - > flags & PF_WQ_WORKER ) {
struct worker * worker = kthread_data ( task ) ;
struct worker_pool * pool = worker - > pool ;
2018-05-18 18:47:13 +03:00
2018-05-21 18:04:35 +03:00
if ( pool ) {
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2018-05-21 18:04:35 +03:00
/*
* - > desc tracks information ( wq name or
* set_worker_desc ( ) ) for the latest execution . If
* current , prepend ' + ' , otherwise ' - ' .
*/
if ( worker - > desc [ 0 ] ! = ' \0 ' ) {
if ( worker - > current_work )
scnprintf ( buf + off , size - off , " +%s " ,
worker - > desc ) ;
else
scnprintf ( buf + off , size - off , " -%s " ,
worker - > desc ) ;
}
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2018-05-18 18:47:13 +03:00
}
}
mutex_unlock ( & wq_pool_attach_mutex ) ;
}
2018-05-22 22:47:32 +03:00
# ifdef CONFIG_SMP
2010-06-29 12:07:12 +04:00
/*
* CPU hotplug .
*
2010-06-29 12:07:14 +04:00
* There are two challenges in supporting CPU hotplug . Firstly , there
2013-02-14 07:29:12 +04:00
* are a lot of assumptions on strong associations among work , pwq and
2013-01-24 23:01:34 +04:00
* pool which make migrating pending and scheduled works very
2010-06-29 12:07:14 +04:00
* difficult to implement without impacting hot paths . Secondly ,
2013-01-24 23:01:33 +04:00
* worker pools serve mix of short , long and very long running works making
2010-06-29 12:07:14 +04:00
* blocked draining impractical .
*
2013-01-24 23:01:33 +04:00
* This is solved by allowing the pools to be disassociated from the CPU
2012-07-17 23:39:27 +04:00
* running as an unbound one and allowing it to be reattached later if the
* cpu comes back online .
2010-06-29 12:07:12 +04:00
*/
2005-04-17 02:20:36 +04:00
2017-12-01 17:20:36 +03:00
static void unbind_workers ( int cpu )
2007-05-09 13:34:09 +04:00
{
2012-07-14 09:16:44 +04:00
struct worker_pool * pool ;
2010-06-29 12:07:12 +04:00
struct worker * worker ;
2007-05-09 13:34:09 +04:00
2013-03-12 22:30:03 +04:00
for_each_cpu_worker_pool ( pool , cpu ) {
2018-05-18 18:47:13 +03:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
2007-05-09 13:34:09 +04:00
2013-01-24 23:01:33 +04:00
/*
2014-05-20 13:46:34 +04:00
* We ' ve blocked all attach / detach operations . Make all workers
2013-01-24 23:01:33 +04:00
* unbound and set DISASSOCIATED . Before this , all workers
2021-12-07 10:35:39 +03:00
* must be on the cpu . After this , they may become diasporas .
2021-12-07 10:35:40 +03:00
* And the preemption disabled section in their sched callbacks
* are guaranteed to see WORKER_UNBOUND since the code here
* is on the same cpu .
2013-01-24 23:01:33 +04:00
*/
2014-05-20 13:46:31 +04:00
for_each_pool_worker ( worker , pool )
2013-01-24 23:01:33 +04:00
worker - > flags | = WORKER_UNBOUND ;
2007-05-09 13:34:15 +04:00
2013-01-24 23:01:33 +04:00
pool - > flags | = POOL_DISASSOCIATED ;
2012-07-17 23:39:26 +04:00
2013-03-09 03:18:28 +04:00
/*
2021-12-07 10:35:41 +03:00
* The handling of nr_running in sched callbacks are disabled
* now . Zap nr_running . After this , nr_running stays zero and
* need_more_worker ( ) and keep_working ( ) are always true as
* long as the worklist is not empty . This pool now behaves as
* an unbound ( in terms of concurrency management ) pool which
2013-03-09 03:18:28 +04:00
* are served by workers tied to the pool .
*/
2021-12-23 15:31:40 +03:00
pool - > nr_running = 0 ;
2013-03-09 03:18:28 +04:00
/*
* With concurrency management just turned off , a busy
* worker blocking could lead to lengthy stalls . Kick off
* unbound chain execution of currently pending work items .
*/
2023-08-08 04:57:25 +03:00
kick_pool ( pool ) ;
2021-12-07 10:35:41 +03:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2021-12-07 10:35:41 +03:00
2023-01-12 19:14:28 +03:00
for_each_pool_worker ( worker , pool )
unbind_worker ( worker ) ;
2021-12-07 10:35:41 +03:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2013-03-09 03:18:28 +04:00
}
2007-05-09 13:34:09 +04:00
}
2013-03-20 00:45:21 +04:00
/**
* rebind_workers - rebind all workers of a pool to the associated CPU
* @ pool : pool of interest
*
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
* @ pool - > cpu is coming online . Rebind all workers to the CPU .
2013-03-20 00:45:21 +04:00
*/
static void rebind_workers ( struct worker_pool * pool )
{
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
struct worker * worker ;
2013-03-20 00:45:21 +04:00
2018-05-18 18:47:13 +03:00
lockdep_assert_held ( & wq_pool_attach_mutex ) ;
2013-03-20 00:45:21 +04:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
/*
* Restore CPU affinity of all workers . As all idle workers should
* be on the run - queue of the associated CPU before any local
2015-05-23 08:08:14 +03:00
* wake - ups for concurrency management happen , restore CPU affinity
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
* of all workers first and then clear UNBOUND . As we ' re called
* from CPU_ONLINE , the following shouldn ' t fail .
*/
2023-01-13 20:40:40 +03:00
for_each_pool_worker ( worker , pool ) {
kthread_set_per_cpu ( worker - > task , pool - > cpu ) ;
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task ,
workqueue: Add workqueue_attrs->__pod_cpumask
workqueue_attrs has two uses:
* to specify the required unouned workqueue properties by users
* to match worker_pool's properties to workqueues by core code
For example, if the user wants to restrict a workqueue to run only CPUs 0
and 2, and the two CPUs are on different affinity scopes, the workqueue's
attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be
associated with two worker_pools, one with attrs->cpumask containing just
CPU 0 and the other CPU 2.
Workqueue wants to support non-strict affinity scopes where work items are
started in their matching affinity scopes but the scheduler is free to
migrate them outside the starting scopes, which can enable utilizing the
whole machine while maintaining most of the locality benefits from affinity
scopes.
To enable that, worker_pools need to distinguish the strict affinity that it
has to follow (because that's the restriction coming from the user) and the
soft affinity that it wants to apply when dispatching work items. Note that
two worker_pools with different soft dispatching requirements have to be
separate; otherwise, for example, we'd be ping-ponging worker threads across
NUMA boundaries constantly.
This patch adds workqueue_attrs->__pod_cpumask. The new field is double
underscored as it's only used internally to distinguish worker_pools. A
worker_pool's ->cpumask is now always the same as the online subset of
allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's
subset of that ->cpumask. Going back to the example above, both worker_pools
would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask
would contain 0 while the other's 2.
* pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask
that the pool's workers must stay within. This is currently always
->__pod_cpumask as all boundaries are still strict.
* As a workqueue_attrs can now track both the associated workqueues' cpumask
and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external
out-argument. Drop @cpumask and instead store the result in
->__pod_cpumask.
* The above also simplifies apply_wqattrs_prepare() as the same
workqueue_attrs can be used to create all pods associated with a
workqueue. tmp_attrs is dropped.
* wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq
update is needed instead of only comparing ->cpumask so that
->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks
but the code is easier to understand and more robust this way.
The only user-visible behavior change is that two workqueues with different
cpumasks no longer can share worker_pools even when their pod subsets
coincide. Going back to the example, let's say there's another workqueue
with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped
to two worker_pools - one with CPU 0, the other with 2 and 3. The former has
the same cpumask as the first pod of the earlier example and would have
shared the same worker_pool but that's no longer the case after this patch.
The worker_pools would have the same ->__pod_cpumask but their ->cpumask's
wouldn't match.
While this is necessary to support non-strict affinity scopes, there can be
further optimizations to maintain sharing among strict affinity scopes.
However, non-strict affinity scopes are going to be preferable for most use
cases and we don't see very diverse mixture of unbound workqueue cpumasks
anyway, so the additional overhead doesn't seem to justify the extra
complexity.
v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask
to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by
using wqattrs_equal() for comparison instead.
- Per-cpu worker pools weren't initializing ->__pod_cpumask which caused
a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask.
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-08-08 04:57:25 +03:00
pool_allowed_cpus ( pool ) ) < 0 ) ;
2023-01-13 20:40:40 +03:00
}
2013-03-20 00:45:21 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_lock_irq ( & pool - > lock ) ;
workqueue: fix rebind bound workers warning
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
dump_stack+0x89/0xd4
__warn+0xfd/0x120
warn_slowpath_null+0x1d/0x20
rebind_workers+0x1c0/0x1d0
workqueue_cpu_up_callback+0xf5/0x1d0
notifier_call_chain+0x64/0x90
? trace_hardirqs_on_caller+0xf2/0x220
? notify_prepare+0x80/0x80
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x35/0x50
notify_down_prepare+0x5e/0x80
? notify_prepare+0x80/0x80
cpuhp_invoke_callback+0x73/0x330
? __schedule+0x33e/0x8a0
cpuhp_down_callbacks+0x51/0xc0
cpuhp_thread_fun+0xc1/0xf0
smpboot_thread_fn+0x159/0x2a0
? smpboot_create_threads+0x80/0x80
kthread+0xef/0x110
? wait_for_completion+0xf0/0x120
? schedule_tail+0x35/0xf0
ret_from_fork+0x22/0x50
? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed
This bug can be reproduced by below config w/ nohz_full= all cpus:
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y
As Thomas pointed out:
| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP = 5
| CPU_PRI_WORKQUEUE_DOWN = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.
The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:
workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.
This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
2016-05-11 12:55:18 +03:00
2014-06-03 11:33:27 +04:00
pool - > flags & = ~ POOL_DISASSOCIATED ;
2013-03-20 00:45:21 +04:00
2014-05-20 13:46:31 +04:00
for_each_pool_worker ( worker , pool ) {
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
unsigned int worker_flags = worker - > flags ;
2013-03-20 00:45:21 +04:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
/*
* We want to clear UNBOUND but can ' t directly call
* worker_clr_flags ( ) or adjust nr_running . Atomically
* replace UNBOUND with another NOT_RUNNING flag REBOUND .
* @ worker will clear REBOUND using worker_clr_flags ( ) when
* it initiates the next execution cycle thus restoring
* concurrency management . Note that when or whether
* @ worker clears REBOUND doesn ' t affect correctness .
*
locking/atomics, workqueue: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the workqueue code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-12-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 00:07:22 +03:00
* WRITE_ONCE ( ) is necessary because @ worker - > flags may be
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
* tested without holding any lock in
2019-03-13 19:55:48 +03:00
* wq_worker_running ( ) . Without it , NOT_RUNNING test may
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
* fail incorrectly leading to premature concurrency
* management operations .
*/
WARN_ON_ONCE ( ! ( worker_flags & WORKER_UNBOUND ) ) ;
worker_flags | = WORKER_REBOUND ;
worker_flags & = ~ WORKER_UNBOUND ;
locking/atomics, workqueue: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the workqueue code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-12-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24 00:07:22 +03:00
WRITE_ONCE ( worker - > flags , worker_flags ) ;
2013-03-20 00:45:21 +04:00
}
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
2020-05-27 22:46:33 +03:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2013-03-20 00:45:21 +04:00
}
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
/**
* restore_unbound_workers_cpumask - restore cpumask of unbound workers
* @ pool : unbound pool of interest
* @ cpu : the CPU which is coming up
*
* An unbound pool may end up with a cpumask which doesn ' t have any online
* CPUs . When a worker of such pool get scheduled , the scheduler resets
* its cpus_allowed . If @ cpu is in @ pool ' s cpumask which didn ' t have any
* online CPU before , cpus_allowed of all its workers should be restored .
*/
static void restore_unbound_workers_cpumask ( struct worker_pool * pool , int cpu )
{
static cpumask_t cpumask ;
struct worker * worker ;
2018-05-18 18:47:13 +03:00
lockdep_assert_held ( & wq_pool_attach_mutex ) ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
/* is @cpu allowed for @pool? */
if ( ! cpumask_test_cpu ( cpu , pool - > attrs - > cpumask ) )
return ;
cpumask_and ( & cpumask , pool - > attrs - > cpumask , cpu_online_mask ) ;
/* as we're called from CPU_ONLINE, the following shouldn't fail */
2014-05-20 13:46:31 +04:00
for_each_pool_worker ( worker , pool )
2016-06-16 15:38:42 +03:00
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task , & cpumask ) < 0 ) ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
}
2016-07-13 20:16:29 +03:00
int workqueue_prepare_cpu ( unsigned int cpu )
{
struct worker_pool * pool ;
for_each_cpu_worker_pool ( pool , cpu ) {
if ( pool - > nr_workers )
continue ;
if ( ! create_worker ( pool ) )
return - ENOMEM ;
}
return 0 ;
}
int workqueue_online_cpu ( unsigned int cpu )
2007-05-09 13:34:09 +04:00
{
2012-07-14 09:16:44 +04:00
struct worker_pool * pool ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 22:23:36 +04:00
struct workqueue_struct * wq ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
int pi ;
2012-07-17 23:39:27 +04:00
2016-07-13 20:16:29 +03:00
mutex_lock ( & wq_pool_mutex ) ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-20 00:45:21 +04:00
2016-07-13 20:16:29 +03:00
for_each_pool ( pool , pi ) {
2024-02-05 00:28:06 +03:00
/* BH pools aren't affected by hotplug */
if ( pool - > flags & POOL_BH )
continue ;
2013-01-24 23:01:33 +04:00
2024-02-05 00:28:06 +03:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2016-07-13 20:16:29 +03:00
if ( pool - > cpu = = cpu )
rebind_workers ( pool ) ;
else if ( pool - > cpu < 0 )
restore_unbound_workers_cpumask ( pool , cpu ) ;
2018-05-18 18:47:13 +03:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2016-07-13 20:16:29 +03:00
}
2015-04-02 14:14:39 +03:00
2023-08-08 04:57:23 +03:00
/* update pod affinity of unbound workqueues */
2023-08-08 04:57:23 +03:00
list_for_each_entry ( wq , & workqueues , list ) {
2023-08-08 04:57:24 +03:00
struct workqueue_attrs * attrs = wq - > unbound_attrs ;
if ( attrs ) {
const struct wq_pod_type * pt = wqattrs_pod_type ( attrs ) ;
int tcpu ;
2023-08-08 04:57:23 +03:00
2023-08-08 04:57:24 +03:00
for_each_cpu ( tcpu , pt - > pod_cpus [ pt - > cpu_pod [ cpu ] ] )
2023-08-08 04:57:23 +03:00
wq_update_pod ( wq , tcpu , cpu , true ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
mutex_lock ( & wq - > mutex ) ;
wq_update_node_max_active ( wq , - 1 ) ;
mutex_unlock ( & wq - > mutex ) ;
2023-08-08 04:57:23 +03:00
}
}
2015-04-02 14:14:39 +03:00
2016-07-13 20:16:29 +03:00
mutex_unlock ( & wq_pool_mutex ) ;
return 0 ;
2015-04-02 14:14:39 +03:00
}
2016-07-13 20:16:29 +03:00
int workqueue_offline_cpu ( unsigned int cpu )
2015-04-02 14:14:39 +03:00
{
struct workqueue_struct * wq ;
2016-07-13 20:16:29 +03:00
/* unbinding per-cpu workers should happen on the local CPU */
2017-12-01 17:20:36 +03:00
if ( WARN_ON ( cpu ! = smp_processor_id ( ) ) )
return - 1 ;
unbind_workers ( cpu ) ;
2016-07-13 20:16:29 +03:00
2023-08-08 04:57:23 +03:00
/* update pod affinity of unbound workqueues */
2016-07-13 20:16:29 +03:00
mutex_lock ( & wq_pool_mutex ) ;
2023-08-08 04:57:23 +03:00
list_for_each_entry ( wq , & workqueues , list ) {
2023-08-08 04:57:24 +03:00
struct workqueue_attrs * attrs = wq - > unbound_attrs ;
if ( attrs ) {
const struct wq_pod_type * pt = wqattrs_pod_type ( attrs ) ;
int tcpu ;
2023-08-08 04:57:23 +03:00
2023-08-08 04:57:24 +03:00
for_each_cpu ( tcpu , pt - > pod_cpus [ pt - > cpu_pod [ cpu ] ] )
2023-08-08 04:57:23 +03:00
wq_update_pod ( wq , tcpu , cpu , false ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
mutex_lock ( & wq - > mutex ) ;
wq_update_node_max_active ( wq , cpu ) ;
mutex_unlock ( & wq - > mutex ) ;
2023-08-08 04:57:23 +03:00
}
}
2016-07-13 20:16:29 +03:00
mutex_unlock ( & wq_pool_mutex ) ;
return 0 ;
2015-04-02 14:14:39 +03:00
}
struct work_for_cpu {
struct work_struct work ;
long ( * fn ) ( void * ) ;
void * arg ;
long ret ;
} ;
static void work_for_cpu_fn ( struct work_struct * work )
{
struct work_for_cpu * wfc = container_of ( work , struct work_for_cpu , work ) ;
wfc - > ret = wfc - > fn ( wfc - > arg ) ;
}
/**
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
* work_on_cpu_key - run a function in thread context on a particular cpu
2015-04-02 14:14:39 +03:00
* @ cpu : the cpu to run on
* @ fn : the function to run
* @ arg : the function arg
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
* @ key : The lock class key for lock debugging purposes
2015-04-02 14:14:39 +03:00
*
* It is up to the caller to ensure that the cpu doesn ' t go offline .
* The caller must not hold any locks which would prevent @ fn from completing .
*
* Return : The value @ fn returns .
*/
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
long work_on_cpu_key ( int cpu , long ( * fn ) ( void * ) ,
void * arg , struct lock_class_key * key )
2015-04-02 14:14:39 +03:00
{
struct work_for_cpu wfc = { . fn = fn , . arg = arg } ;
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
INIT_WORK_ONSTACK_KEY ( & wfc . work , work_for_cpu_fn , key ) ;
2015-04-02 14:14:39 +03:00
schedule_work_on ( cpu , & wfc . work ) ;
flush_work ( & wfc . work ) ;
destroy_work_on_stack ( & wfc . work ) ;
return wfc . ret ;
}
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
EXPORT_SYMBOL_GPL ( work_on_cpu_key ) ;
2017-04-12 23:07:28 +03:00
/**
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
* work_on_cpu_safe_key - run a function in thread context on a particular cpu
2017-04-12 23:07:28 +03:00
* @ cpu : the cpu to run on
* @ fn : the function to run
* @ arg : the function argument
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
* @ key : The lock class key for lock debugging purposes
2017-04-12 23:07:28 +03:00
*
* Disables CPU hotplug and calls work_on_cpu ( ) . The caller must not hold
* any locks which would prevent @ fn from completing .
*
* Return : The value @ fn returns .
*/
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
long work_on_cpu_safe_key ( int cpu , long ( * fn ) ( void * ) ,
void * arg , struct lock_class_key * key )
2017-04-12 23:07:28 +03:00
{
long ret = - ENODEV ;
2021-08-03 17:16:20 +03:00
cpus_read_lock ( ) ;
2017-04-12 23:07:28 +03:00
if ( cpu_online ( cpu ) )
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
ret = work_on_cpu_key ( cpu , fn , arg , key ) ;
2021-08-03 17:16:20 +03:00
cpus_read_unlock ( ) ;
2017-04-12 23:07:28 +03:00
return ret ;
}
workqueue: Provide one lock class key per work_on_cpu() callsite
All callers of work_on_cpu() share the same lock class key for all the
functions queued. As a result the workqueue related locking scenario for
a function A may be spuriously accounted as an inversion against the
locking scenario of function B such as in the following model:
long A(void *arg)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
long B(void *arg)
{
}
void launchA(void)
{
work_on_cpu(0, A, NULL);
}
void launchB(void)
{
mutex_lock(&mutex);
work_on_cpu(1, B, NULL);
mutex_unlock(&mutex);
}
launchA and launchB running concurrently have no chance to deadlock.
However the above can be reported by lockdep as a possible locking
inversion because the works containing A() and B() are treated as
belonging to the same locking class.
The following shows an existing example of such a spurious lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted
------------------------------------------------------
kworker/0:1/9 is trying to acquire lock:
ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0
but task is already holding lock:
ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
__flush_work+0x83/0x4e0
work_on_cpu+0x97/0xc0
rcu_nocb_cpu_offload+0x62/0xb0
rcu_nocb_toggle+0xd0/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}:
__mutex_lock+0x81/0xc80
rcu_nocb_cpu_deoffload+0x38/0xb0
rcu_nocb_toggle+0x144/0x1d0
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
percpu_down_write+0x31/0x200
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
kthread+0xe6/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work)
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&wfc.work));
lock(rcu_state.barrier_mutex);
lock((work_completion)(&wfc.work));
lock(cpu_hotplug_lock);
*** DEADLOCK ***
2 locks held by kworker/0:1/9:
#0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500
#1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500
stack backtrace:
CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: events work_for_cpu_fn
Call Trace:
rcu-torture: rcu_torture_read_exit: Start of episode
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x132/0x150
__lock_acquire+0x1538/0x2500
lock_acquire+0xbf/0x2a0
? _cpu_down+0x57/0x2b0
percpu_down_write+0x31/0x200
? _cpu_down+0x57/0x2b0
_cpu_down+0x57/0x2b0
__cpu_down_maps_locked+0x10/0x20
work_for_cpu_fn+0x15/0x20
process_scheduled_works+0x2a7/0x500
worker_thread+0x173/0x330
? __pfx_worker_thread+0x10/0x10
kthread+0xe6/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2f/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK
Fix this with providing one lock class key per work_on_cpu() caller.
Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-24 18:07:02 +03:00
EXPORT_SYMBOL_GPL ( work_on_cpu_safe_key ) ;
2015-04-02 14:14:39 +03:00
# endif /* CONFIG_SMP */
# ifdef CONFIG_FREEZER
/**
* freeze_workqueues_begin - begin freezing workqueues
*
* Start freezing workqueues . After this function returns , all freezable
2021-08-17 04:32:34 +03:00
* workqueues will queue new works to their inactive_works list instead of
2015-04-02 14:14:39 +03:00
* pool - > worklist .
*
* CONTEXT :
* Grabs and releases wq_pool_mutex , wq - > mutex and pool - > lock ' s .
*/
void freeze_workqueues_begin ( void )
{
struct workqueue_struct * wq ;
mutex_lock ( & wq_pool_mutex ) ;
WARN_ON_ONCE ( workqueue_freezing ) ;
workqueue_freezing = true ;
list_for_each_entry ( wq , & workqueues , list ) {
mutex_lock ( & wq - > mutex ) ;
2024-01-29 21:11:24 +03:00
wq_adjust_max_active ( wq ) ;
2015-04-02 14:14:39 +03:00
mutex_unlock ( & wq - > mutex ) ;
}
mutex_unlock ( & wq_pool_mutex ) ;
}
/**
* freeze_workqueues_busy - are freezable workqueues still busy ?
*
* Check whether freezing is complete . This function must be called
* between freeze_workqueues_begin ( ) and thaw_workqueues ( ) .
*
* CONTEXT :
* Grabs and releases wq_pool_mutex .
*
* Return :
* % true if some freezable workqueues are still busy . % false if freezing
* is complete .
*/
bool freeze_workqueues_busy ( void )
{
bool busy = false ;
struct workqueue_struct * wq ;
struct pool_workqueue * pwq ;
mutex_lock ( & wq_pool_mutex ) ;
WARN_ON_ONCE ( ! workqueue_freezing ) ;
list_for_each_entry ( wq , & workqueues , list ) {
if ( ! ( wq - > flags & WQ_FREEZABLE ) )
continue ;
/*
* nr_active is monotonically decreasing . It ' s safe
* to peek without lock .
*/
2019-03-13 19:55:47 +03:00
rcu_read_lock ( ) ;
2015-04-02 14:14:39 +03:00
for_each_pwq ( pwq , wq ) {
WARN_ON_ONCE ( pwq - > nr_active < 0 ) ;
if ( pwq - > nr_active ) {
busy = true ;
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2015-04-02 14:14:39 +03:00
goto out_unlock ;
}
}
2019-03-13 19:55:47 +03:00
rcu_read_unlock ( ) ;
2015-04-02 14:14:39 +03:00
}
out_unlock :
mutex_unlock ( & wq_pool_mutex ) ;
return busy ;
}
/**
* thaw_workqueues - thaw workqueues
*
* Thaw workqueues . Normal queueing is restored and all collected
* frozen works are transferred to their respective pool worklists .
*
* CONTEXT :
* Grabs and releases wq_pool_mutex , wq - > mutex and pool - > lock ' s .
*/
void thaw_workqueues ( void )
{
struct workqueue_struct * wq ;
mutex_lock ( & wq_pool_mutex ) ;
if ( ! workqueue_freezing )
goto out_unlock ;
workqueue_freezing = false ;
/* restore max_active and repopulate worklist */
list_for_each_entry ( wq , & workqueues , list ) {
mutex_lock ( & wq - > mutex ) ;
2024-01-29 21:11:24 +03:00
wq_adjust_max_active ( wq ) ;
2015-04-02 14:14:39 +03:00
mutex_unlock ( & wq - > mutex ) ;
}
out_unlock :
mutex_unlock ( & wq_pool_mutex ) ;
}
# endif /* CONFIG_FREEZER */
2023-01-12 19:14:27 +03:00
static int workqueue_apply_unbound_cpumask ( const cpumask_var_t unbound_cpumask )
2015-04-30 12:16:12 +03:00
{
LIST_HEAD ( ctxs ) ;
int ret = 0 ;
struct workqueue_struct * wq ;
struct apply_wqattrs_ctx * ctx , * n ;
lockdep_assert_held ( & wq_pool_mutex ) ;
list_for_each_entry ( wq , & workqueues , list ) {
2024-02-03 18:43:30 +03:00
if ( ! ( wq - > flags & WQ_UNBOUND ) | | ( wq - > flags & __WQ_DESTROYING ) )
2015-04-30 12:16:12 +03:00
continue ;
2023-01-12 19:14:27 +03:00
ctx = apply_wqattrs_prepare ( wq , wq - > unbound_attrs , unbound_cpumask ) ;
2023-08-08 04:57:24 +03:00
if ( IS_ERR ( ctx ) ) {
ret = PTR_ERR ( ctx ) ;
2015-04-30 12:16:12 +03:00
break ;
}
list_add_tail ( & ctx - > list , & ctxs ) ;
}
list_for_each_entry_safe ( ctx , n , & ctxs , list ) {
if ( ! ret )
apply_wqattrs_commit ( ctx ) ;
apply_wqattrs_cleanup ( ctx ) ;
}
2023-01-12 19:14:27 +03:00
if ( ! ret ) {
mutex_lock ( & wq_pool_attach_mutex ) ;
cpumask_copy ( wq_unbound_cpumask , unbound_cpumask ) ;
mutex_unlock ( & wq_pool_attach_mutex ) ;
}
2015-04-30 12:16:12 +03:00
return ret ;
}
2023-10-25 21:25:52 +03:00
/**
* workqueue_unbound_exclude_cpumask - Exclude given CPUs from unbound cpumask
* @ exclude_cpumask : the cpumask to be excluded from wq_unbound_cpumask
*
* This function can be called from cpuset code to provide a set of isolated
* CPUs that should be excluded from wq_unbound_cpumask . The caller must hold
* either cpus_read_lock or cpus_write_lock .
*/
int workqueue_unbound_exclude_cpumask ( cpumask_var_t exclude_cpumask )
{
cpumask_var_t cpumask ;
int ret = 0 ;
if ( ! zalloc_cpumask_var ( & cpumask , GFP_KERNEL ) )
return - ENOMEM ;
lockdep_assert_cpus_held ( ) ;
mutex_lock ( & wq_pool_mutex ) ;
/* Save the current isolated cpumask & export it via sysfs */
cpumask_copy ( wq_isolated_cpumask , exclude_cpumask ) ;
/*
* If the operation fails , it will fall back to
* wq_requested_unbound_cpumask which is initially set to
* ( HK_TYPE_WQ ∩ HK_TYPE_DOMAIN ) house keeping mask and rewritten
* by any subsequent write to workqueue / cpumask sysfs file .
*/
if ( ! cpumask_andnot ( cpumask , wq_requested_unbound_cpumask , exclude_cpumask ) )
cpumask_copy ( cpumask , wq_requested_unbound_cpumask ) ;
if ( ! cpumask_equal ( cpumask , wq_unbound_cpumask ) )
ret = workqueue_apply_unbound_cpumask ( cpumask ) ;
mutex_unlock ( & wq_pool_mutex ) ;
free_cpumask_var ( cpumask ) ;
return ret ;
}
2023-08-08 04:57:24 +03:00
static int parse_affn_scope ( const char * val )
{
int i ;
for ( i = 0 ; i < ARRAY_SIZE ( wq_affn_names ) ; i + + ) {
if ( ! strncasecmp ( val , wq_affn_names [ i ] , strlen ( wq_affn_names [ i ] ) ) )
return i ;
}
return - EINVAL ;
}
static int wq_affn_dfl_set ( const char * val , const struct kernel_param * kp )
{
2023-08-08 04:57:25 +03:00
struct workqueue_struct * wq ;
int affn , cpu ;
2023-08-08 04:57:24 +03:00
affn = parse_affn_scope ( val ) ;
if ( affn < 0 )
return affn ;
2023-08-08 04:57:25 +03:00
if ( affn = = WQ_AFFN_DFL )
return - EINVAL ;
cpus_read_lock ( ) ;
mutex_lock ( & wq_pool_mutex ) ;
2023-08-08 04:57:24 +03:00
wq_affn_dfl = affn ;
2023-08-08 04:57:25 +03:00
list_for_each_entry ( wq , & workqueues , list ) {
for_each_online_cpu ( cpu ) {
wq_update_pod ( wq , cpu , cpu , true ) ;
}
}
mutex_unlock ( & wq_pool_mutex ) ;
cpus_read_unlock ( ) ;
2023-08-08 04:57:24 +03:00
return 0 ;
}
static int wq_affn_dfl_get ( char * buffer , const struct kernel_param * kp )
{
return scnprintf ( buffer , PAGE_SIZE , " %s \n " , wq_affn_names [ wq_affn_dfl ] ) ;
}
static const struct kernel_param_ops wq_affn_dfl_ops = {
. set = wq_affn_dfl_set ,
. get = wq_affn_dfl_get ,
} ;
module_param_cb ( default_affinity_scope , & wq_affn_dfl_ops , NULL , 0644 ) ;
2015-04-02 14:14:39 +03:00
# ifdef CONFIG_SYSFS
/*
* Workqueues with WQ_SYSFS flag set is visible to userland via
* / sys / bus / workqueue / devices / WQ_NAME . All visible workqueues have the
* following attributes .
*
2023-08-08 04:57:24 +03:00
* per_cpu RO bool : whether the workqueue is per - cpu or unbound
* max_active RW int : maximum number of in - flight work items
2015-04-02 14:14:39 +03:00
*
* Unbound workqueues have the following extra attributes .
*
2023-08-08 04:57:24 +03:00
* nice RW int : nice value of the workers
* cpumask RW mask : bitmask of allowed CPUs for the workers
* affinity_scope RW str : worker CPU affinity scope ( cache , numa , none )
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
* affinity_strict RW bool : worker CPU affinity is strict
2015-04-02 14:14:39 +03:00
*/
struct wq_device {
struct workqueue_struct * wq ;
struct device dev ;
} ;
static struct workqueue_struct * dev_to_wq ( struct device * dev )
{
struct wq_device * wq_dev = container_of ( dev , struct wq_device , dev ) ;
return wq_dev - > wq ;
}
static ssize_t per_cpu_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
return scnprintf ( buf , PAGE_SIZE , " %d \n " , ( bool ) ! ( wq - > flags & WQ_UNBOUND ) ) ;
}
static DEVICE_ATTR_RO ( per_cpu ) ;
static ssize_t max_active_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
return scnprintf ( buf , PAGE_SIZE , " %d \n " , wq - > saved_max_active ) ;
}
static ssize_t max_active_store ( struct device * dev ,
struct device_attribute * attr , const char * buf ,
size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int val ;
if ( sscanf ( buf , " %d " , & val ) ! = 1 | | val < = 0 )
return - EINVAL ;
workqueue_set_max_active ( wq , val ) ;
return count ;
}
static DEVICE_ATTR_RW ( max_active ) ;
static struct attribute * wq_sysfs_attrs [ ] = {
& dev_attr_per_cpu . attr ,
& dev_attr_max_active . attr ,
NULL ,
} ;
ATTRIBUTE_GROUPS ( wq_sysfs ) ;
2023-11-21 05:18:40 +03:00
static void apply_wqattrs_lock ( void )
{
/* CPUs should stay stable across pwq creations and installations */
cpus_read_lock ( ) ;
mutex_lock ( & wq_pool_mutex ) ;
}
static void apply_wqattrs_unlock ( void )
{
mutex_unlock ( & wq_pool_mutex ) ;
cpus_read_unlock ( ) ;
}
2015-04-02 14:14:39 +03:00
static ssize_t wq_nice_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int written ;
mutex_lock ( & wq - > mutex ) ;
written = scnprintf ( buf , PAGE_SIZE , " %d \n " , wq - > unbound_attrs - > nice ) ;
mutex_unlock ( & wq - > mutex ) ;
return written ;
}
/* prepare workqueue_attrs for sysfs store operations */
static struct workqueue_attrs * wq_sysfs_prep_attrs ( struct workqueue_struct * wq )
{
struct workqueue_attrs * attrs ;
2015-05-20 09:41:18 +03:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2019-06-26 17:52:38 +03:00
attrs = alloc_workqueue_attrs ( ) ;
2015-04-02 14:14:39 +03:00
if ( ! attrs )
return NULL ;
copy_workqueue_attrs ( attrs , wq - > unbound_attrs ) ;
return attrs ;
}
static ssize_t wq_nice_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
2015-05-19 13:03:48 +03:00
int ret = - ENOMEM ;
apply_wqattrs_lock ( ) ;
2015-04-02 14:14:39 +03:00
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( ! attrs )
2015-05-19 13:03:48 +03:00
goto out_unlock ;
2015-04-02 14:14:39 +03:00
if ( sscanf ( buf , " %d " , & attrs - > nice ) = = 1 & &
attrs - > nice > = MIN_NICE & & attrs - > nice < = MAX_NICE )
2015-05-19 13:03:48 +03:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2015-04-02 14:14:39 +03:00
else
ret = - EINVAL ;
2015-05-19 13:03:48 +03:00
out_unlock :
apply_wqattrs_unlock ( ) ;
2015-04-02 14:14:39 +03:00
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
}
static ssize_t wq_cpumask_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int written ;
mutex_lock ( & wq - > mutex ) ;
written = scnprintf ( buf , PAGE_SIZE , " %*pb \n " ,
cpumask_pr_args ( wq - > unbound_attrs - > cpumask ) ) ;
mutex_unlock ( & wq - > mutex ) ;
return written ;
}
static ssize_t wq_cpumask_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
2015-05-19 13:03:48 +03:00
int ret = - ENOMEM ;
apply_wqattrs_lock ( ) ;
2015-04-02 14:14:39 +03:00
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( ! attrs )
2015-05-19 13:03:48 +03:00
goto out_unlock ;
2015-04-02 14:14:39 +03:00
ret = cpumask_parse ( buf , attrs - > cpumask ) ;
if ( ! ret )
2015-05-19 13:03:48 +03:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2015-04-02 14:14:39 +03:00
2015-05-19 13:03:48 +03:00
out_unlock :
apply_wqattrs_unlock ( ) ;
2015-04-02 14:14:39 +03:00
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
}
2023-08-08 04:57:24 +03:00
static ssize_t wq_affn_scope_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int written ;
mutex_lock ( & wq - > mutex ) ;
2023-08-08 04:57:25 +03:00
if ( wq - > unbound_attrs - > affn_scope = = WQ_AFFN_DFL )
written = scnprintf ( buf , PAGE_SIZE , " %s (%s) \n " ,
wq_affn_names [ WQ_AFFN_DFL ] ,
wq_affn_names [ wq_affn_dfl ] ) ;
else
written = scnprintf ( buf , PAGE_SIZE , " %s \n " ,
wq_affn_names [ wq - > unbound_attrs - > affn_scope ] ) ;
2023-08-08 04:57:24 +03:00
mutex_unlock ( & wq - > mutex ) ;
return written ;
}
static ssize_t wq_affn_scope_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
int affn , ret = - ENOMEM ;
affn = parse_affn_scope ( buf ) ;
if ( affn < 0 )
return affn ;
apply_wqattrs_lock ( ) ;
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( attrs ) {
attrs - > affn_scope = affn ;
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
}
apply_wqattrs_unlock ( ) ;
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
}
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
static ssize_t wq_affinity_strict_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
return scnprintf ( buf , PAGE_SIZE , " %d \n " ,
wq - > unbound_attrs - > affn_strict ) ;
}
static ssize_t wq_affinity_strict_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
int v , ret = - ENOMEM ;
if ( sscanf ( buf , " %d " , & v ) ! = 1 )
return - EINVAL ;
apply_wqattrs_lock ( ) ;
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( attrs ) {
attrs - > affn_strict = ( bool ) v ;
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
}
apply_wqattrs_unlock ( ) ;
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
}
2015-04-02 14:14:39 +03:00
static struct device_attribute wq_sysfs_unbound_attrs [ ] = {
__ATTR ( nice , 0644 , wq_nice_show , wq_nice_store ) ,
__ATTR ( cpumask , 0644 , wq_cpumask_show , wq_cpumask_store ) ,
2023-08-08 04:57:24 +03:00
__ATTR ( affinity_scope , 0644 , wq_affn_scope_show , wq_affn_scope_store ) ,
workqueue: Implement non-strict affinity scope for unbound workqueues
An unbound workqueue can be served by multiple worker_pools to improve
locality. The segmentation is achieved by grouping CPUs into pods. By
default, the cache boundaries according to cpus_share_cache() define the
CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the
system has two L3 caches. The workqueue would be mapped to two worker_pools
each serving one L3 cache domains.
While this improves locality, because the pod boundaries are strict, it
limits the total bandwidth a given issuer can consume. For example, let's
say there is a thread pinned to a CPU issuing enough work items to saturate
the whole machine. With the machine segmented into two pods, no matter how
many work items it issues, it can only use half of the CPUs on the system.
While this limitation has existed for a very long time, it wasn't very
pronounced because the affinity grouping used to be always by NUMA nodes.
With cache boundaries as the default and support for even finer grained
scopes (smt and cpu), it is now an a lot more pressing problem.
This patch implements non-strict affinity scope where the pod boundaries
aren't enforced strictly. Going back to the previous example, the workqueue
would still be mapped to two worker_pools; however, the affinity enforcement
would be soft. The workers in both pools would have their cpus_allowed set
to the whole machine thus allowing the scheduler to migrate them anywhere on
the machine. However, whenever an idle worker is woken up, the workqueue
code asks the scheduler to bring back the task within the pod if the worker
is outside. ie. work items start executing within its affinity scope but can
be migrated outside as the scheduler sees fit. This removes the hard cap on
utilization while maintaining the benefits of affinity scopes.
After the earlier ->__pod_cpumask changes, the implementation is pretty
simple. When non-strict which is the new default:
* pool_allowed_cpus() returns @pool->attrs->cpumask instead of
->__pod_cpumask so that the workers are allowed to run on any CPU that
the associated workqueues allow.
* If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets
the field to a CPU within the pod.
This would be the first use of task_struct->wake_cpu outside scheduler
proper, so it isn't clear whether this would be acceptable. However, other
methods of migrating tasks are significantly more expensive and are likely
prohibitively so if we want to do this on every work item. This needs
discussion with scheduler folks.
There is also a race window where setting ->wake_cpu wouldn't be effective
as the target task is still on CPU. However, the window is pretty small and
this being a best-effort optimization, it doesn't seem to warrant more
complexity at the moment.
While the non-strict cache affinity scopes seem to be the best option, the
performance picture interacts with the affinity scope and is a bit
complicated to fully discuss in this patch, so the behavior is made easily
selectable through wqattrs and sysfs and the next patch will add
documentation to discuss performance implications.
v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-08 04:57:25 +03:00
__ATTR ( affinity_strict , 0644 , wq_affinity_strict_show , wq_affinity_strict_store ) ,
2015-04-02 14:14:39 +03:00
__ATTR_NULL ,
} ;
2009-01-17 02:31:15 +03:00
2024-02-06 21:05:06 +03:00
static const struct bus_type wq_subsys = {
2015-04-02 14:14:39 +03:00
. name = " workqueue " ,
. dev_groups = wq_sysfs_groups ,
2008-11-05 05:39:10 +03:00
} ;
2023-11-21 05:18:40 +03:00
/**
* workqueue_set_unbound_cpumask - Set the low - level unbound cpumask
* @ cpumask : the cpumask to set
*
* The low - level workqueues cpumask is a global cpumask that limits
* the affinity of all unbound workqueues . This function check the @ cpumask
* and apply it to all unbound workqueues and updates all pwqs of them .
*
* Return : 0 - Success
* - EINVAL - Invalid @ cpumask
* - ENOMEM - Failed to allocate memory for attrs or pwqs .
*/
static int workqueue_set_unbound_cpumask ( cpumask_var_t cpumask )
{
int ret = - EINVAL ;
/*
* Not excluding isolated cpus on purpose .
* If the user wishes to include them , we allow that .
*/
cpumask_and ( cpumask , cpumask , cpu_possible_mask ) ;
if ( ! cpumask_empty ( cpumask ) ) {
apply_wqattrs_lock ( ) ;
cpumask_copy ( wq_requested_unbound_cpumask , cpumask ) ;
if ( cpumask_equal ( cpumask , wq_unbound_cpumask ) ) {
ret = 0 ;
goto out_unlock ;
}
ret = workqueue_apply_unbound_cpumask ( cpumask ) ;
out_unlock :
apply_wqattrs_unlock ( ) ;
}
return ret ;
}
2023-10-25 21:25:52 +03:00
static ssize_t __wq_cpumask_show ( struct device * dev ,
struct device_attribute * attr , char * buf , cpumask_var_t mask )
2015-04-27 12:58:39 +03:00
{
int written ;
2015-04-30 12:16:12 +03:00
mutex_lock ( & wq_pool_mutex ) ;
2023-10-25 21:25:52 +03:00
written = scnprintf ( buf , PAGE_SIZE , " %*pb \n " , cpumask_pr_args ( mask ) ) ;
2015-04-30 12:16:12 +03:00
mutex_unlock ( & wq_pool_mutex ) ;
2015-04-27 12:58:39 +03:00
return written ;
}
2024-03-08 08:39:32 +03:00
static ssize_t cpumask_requested_show ( struct device * dev ,
2023-10-25 21:25:52 +03:00
struct device_attribute * attr , char * buf )
{
2024-03-08 08:39:32 +03:00
return __wq_cpumask_show ( dev , attr , buf , wq_requested_unbound_cpumask ) ;
2023-10-25 21:25:52 +03:00
}
2024-03-08 08:39:32 +03:00
static DEVICE_ATTR_RO ( cpumask_requested ) ;
2023-10-25 21:25:52 +03:00
2024-03-08 08:39:32 +03:00
static ssize_t cpumask_isolated_show ( struct device * dev ,
2023-10-25 21:25:52 +03:00
struct device_attribute * attr , char * buf )
{
2024-03-08 08:39:32 +03:00
return __wq_cpumask_show ( dev , attr , buf , wq_isolated_cpumask ) ;
2023-10-25 21:25:52 +03:00
}
2024-03-08 08:39:32 +03:00
static DEVICE_ATTR_RO ( cpumask_isolated ) ;
2023-10-25 21:25:52 +03:00
2024-03-08 08:39:32 +03:00
static ssize_t cpumask_show ( struct device * dev ,
2023-10-25 21:25:52 +03:00
struct device_attribute * attr , char * buf )
{
2024-03-08 08:39:32 +03:00
return __wq_cpumask_show ( dev , attr , buf , wq_unbound_cpumask ) ;
2023-10-25 21:25:52 +03:00
}
2024-03-08 08:39:32 +03:00
static ssize_t cpumask_store ( struct device * dev ,
2015-04-30 12:16:12 +03:00
struct device_attribute * attr , const char * buf , size_t count )
{
cpumask_var_t cpumask ;
int ret ;
if ( ! zalloc_cpumask_var ( & cpumask , GFP_KERNEL ) )
return - ENOMEM ;
ret = cpumask_parse ( buf , cpumask ) ;
if ( ! ret )
ret = workqueue_set_unbound_cpumask ( cpumask ) ;
free_cpumask_var ( cpumask ) ;
return ret ? ret : count ;
}
2024-03-08 08:39:32 +03:00
static DEVICE_ATTR_RW ( cpumask ) ;
2015-04-30 12:16:12 +03:00
2024-03-08 08:39:32 +03:00
static struct attribute * wq_sysfs_cpumask_attrs [ ] = {
& dev_attr_cpumask . attr ,
& dev_attr_cpumask_requested . attr ,
& dev_attr_cpumask_isolated . attr ,
NULL ,
2023-10-25 21:25:52 +03:00
} ;
2024-03-08 08:39:32 +03:00
ATTRIBUTE_GROUPS ( wq_sysfs_cpumask ) ;
2015-04-27 12:58:39 +03:00
2015-04-02 14:14:39 +03:00
static int __init wq_sysfs_init ( void )
2008-11-05 05:39:10 +03:00
{
2024-03-08 08:39:32 +03:00
return subsys_virtual_register ( & wq_subsys , wq_sysfs_cpumask_groups ) ;
2008-11-05 05:39:10 +03:00
}
2015-04-02 14:14:39 +03:00
core_initcall ( wq_sysfs_init ) ;
2008-11-05 05:39:10 +03:00
2015-04-02 14:14:39 +03:00
static void wq_device_release ( struct device * dev )
2008-11-05 05:39:10 +03:00
{
2015-04-02 14:14:39 +03:00
struct wq_device * wq_dev = container_of ( dev , struct wq_device , dev ) ;
2009-04-09 19:50:37 +04:00
2015-04-02 14:14:39 +03:00
kfree ( wq_dev ) ;
2008-11-05 05:39:10 +03:00
}
2010-06-29 12:07:12 +04:00
/**
2015-04-02 14:14:39 +03:00
* workqueue_sysfs_register - make a workqueue visible in sysfs
* @ wq : the workqueue to register
2010-06-29 12:07:12 +04:00
*
2015-04-02 14:14:39 +03:00
* Expose @ wq in sysfs under / sys / bus / workqueue / devices .
* alloc_workqueue * ( ) automatically calls this function if WQ_SYSFS is set
* which is the preferred method .
2010-06-29 12:07:12 +04:00
*
2015-04-02 14:14:39 +03:00
* Workqueue user should use this function directly iff it wants to apply
* workqueue_attrs before making the workqueue visible in sysfs ; otherwise ,
* apply_workqueue_attrs ( ) may race against userland updating the
* attributes .
*
* Return : 0 on success , - errno on failure .
2010-06-29 12:07:12 +04:00
*/
2015-04-02 14:14:39 +03:00
int workqueue_sysfs_register ( struct workqueue_struct * wq )
2010-06-29 12:07:12 +04:00
{
2015-04-02 14:14:39 +03:00
struct wq_device * wq_dev ;
int ret ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
/*
2024-02-08 22:12:20 +03:00
* Adjusting max_active breaks ordering guarantee . Disallow exposing
* ordered workqueues .
2015-04-02 14:14:39 +03:00
*/
2024-02-06 03:19:10 +03:00
if ( WARN_ON ( wq - > flags & __WQ_ORDERED ) )
2015-04-02 14:14:39 +03:00
return - EINVAL ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
wq - > wq_dev = wq_dev = kzalloc ( sizeof ( * wq_dev ) , GFP_KERNEL ) ;
if ( ! wq_dev )
return - ENOMEM ;
2013-03-14 06:47:40 +04:00
2015-04-02 14:14:39 +03:00
wq_dev - > wq = wq ;
wq_dev - > dev . bus = & wq_subsys ;
wq_dev - > dev . release = wq_device_release ;
2016-02-17 23:04:41 +03:00
dev_set_name ( & wq_dev - > dev , " %s " , wq - > name ) ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
/*
* unbound_attrs are created separately . Suppress uevent until
* everything is ready .
*/
dev_set_uevent_suppress ( & wq_dev - > dev , true ) ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
ret = device_register ( & wq_dev - > dev ) ;
if ( ret ) {
2018-03-06 13:05:43 +03:00
put_device ( & wq_dev - > dev ) ;
2015-04-02 14:14:39 +03:00
wq - > wq_dev = NULL ;
return ret ;
}
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
if ( wq - > flags & WQ_UNBOUND ) {
struct device_attribute * attr ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
for ( attr = wq_sysfs_unbound_attrs ; attr - > attr . name ; attr + + ) {
ret = device_create_file ( & wq_dev - > dev , attr ) ;
if ( ret ) {
device_unregister ( & wq_dev - > dev ) ;
wq - > wq_dev = NULL ;
return ret ;
2010-06-29 12:07:12 +04:00
}
}
}
2015-04-02 14:14:39 +03:00
dev_set_uevent_suppress ( & wq_dev - > dev , false ) ;
kobject_uevent ( & wq_dev - > dev . kobj , KOBJ_ADD ) ;
return 0 ;
2010-06-29 12:07:12 +04:00
}
/**
2015-04-02 14:14:39 +03:00
* workqueue_sysfs_unregister - undo workqueue_sysfs_register ( )
* @ wq : the workqueue to unregister
2010-06-29 12:07:12 +04:00
*
2015-04-02 14:14:39 +03:00
* If @ wq is registered to sysfs by workqueue_sysfs_register ( ) , unregister .
2010-06-29 12:07:12 +04:00
*/
2015-04-02 14:14:39 +03:00
static void workqueue_sysfs_unregister ( struct workqueue_struct * wq )
2010-06-29 12:07:12 +04:00
{
2015-04-02 14:14:39 +03:00
struct wq_device * wq_dev = wq - > wq_dev ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
if ( ! wq - > wq_dev )
return ;
2010-06-29 12:07:12 +04:00
2015-04-02 14:14:39 +03:00
wq - > wq_dev = NULL ;
device_unregister ( & wq_dev - > dev ) ;
2010-06-29 12:07:12 +04:00
}
2015-04-02 14:14:39 +03:00
# else /* CONFIG_SYSFS */
static void workqueue_sysfs_unregister ( struct workqueue_struct * wq ) { }
# endif /* CONFIG_SYSFS */
2010-06-29 12:07:12 +04:00
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
/*
* Workqueue watchdog .
*
* Stall may be caused by various bugs - missing WQ_MEM_RECLAIM , illegal
* flush dependency , a concurrency managed work item which stays RUNNING
* indefinitely . Workqueue stalls can be very difficult to debug as the
* usual warning mechanisms don ' t trigger and internal workqueue state is
* largely opaque .
*
* Workqueue watchdog monitors all worker pools periodically and dumps
* state if some pools failed to make forward progress for a while where
* forward progress is defined as the first item on - > worklist changing .
*
* This mechanism is controlled through the kernel parameter
* " workqueue.watchdog_thresh " which can be updated at runtime through the
* corresponding sysfs parameter file .
*/
# ifdef CONFIG_WQ_WATCHDOG
static unsigned long wq_watchdog_thresh = 30 ;
2017-10-05 02:27:00 +03:00
static struct timer_list wq_watchdog_timer ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
static unsigned long wq_watchdog_touched = INITIAL_JIFFIES ;
static DEFINE_PER_CPU ( unsigned long , wq_watchdog_touched_cpu ) = INITIAL_JIFFIES ;
2023-03-07 15:53:35 +03:00
/*
* Show workers that might prevent the processing of pending work items .
* The only candidates are CPU - bound workers in the running state .
* Pending work items should be handled by another idle worker
* in all other situations .
*/
static void show_cpu_pool_hog ( struct worker_pool * pool )
{
struct worker * worker ;
2024-02-21 08:36:14 +03:00
unsigned long irq_flags ;
2023-03-07 15:53:35 +03:00
int bkt ;
2024-02-21 08:36:14 +03:00
raw_spin_lock_irqsave ( & pool - > lock , irq_flags ) ;
2023-03-07 15:53:35 +03:00
hash_for_each ( pool - > busy_hash , bkt , worker , hentry ) {
if ( task_is_running ( worker - > task ) ) {
/*
* Defer printing to avoid deadlocks in console
* drivers that queue work while holding locks
* also taken in their write paths .
*/
printk_deferred_enter ( ) ;
pr_info ( " pool %d: \n " , pool - > id ) ;
sched_show_task ( worker - > task ) ;
printk_deferred_exit ( ) ;
}
}
2024-02-21 08:36:14 +03:00
raw_spin_unlock_irqrestore ( & pool - > lock , irq_flags ) ;
2023-03-07 15:53:35 +03:00
}
static void show_cpu_pools_hogs ( void )
{
struct worker_pool * pool ;
int pi ;
pr_info ( " Showing backtraces of running workers in stalled CPU-bound worker pools: \n " ) ;
rcu_read_lock ( ) ;
for_each_pool ( pool , pi ) {
if ( pool - > cpu_stall )
show_cpu_pool_hog ( pool ) ;
}
rcu_read_unlock ( ) ;
}
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
static void wq_watchdog_reset_touched ( void )
{
int cpu ;
wq_watchdog_touched = jiffies ;
for_each_possible_cpu ( cpu )
per_cpu ( wq_watchdog_touched_cpu , cpu ) = jiffies ;
}
2017-10-05 02:27:00 +03:00
static void wq_watchdog_timer_fn ( struct timer_list * unused )
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
{
unsigned long thresh = READ_ONCE ( wq_watchdog_thresh ) * HZ ;
bool lockup_detected = false ;
2023-03-07 15:53:35 +03:00
bool cpu_pool_stall = false ;
2021-05-20 13:14:22 +03:00
unsigned long now = jiffies ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
struct worker_pool * pool ;
int pi ;
if ( ! thresh )
return ;
rcu_read_lock ( ) ;
for_each_pool ( pool , pi ) {
unsigned long pool_ts , touched , ts ;
2023-03-07 15:53:35 +03:00
pool - > cpu_stall = false ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
if ( list_empty ( & pool - > worklist ) )
continue ;
2021-05-20 13:14:22 +03:00
/*
* If a virtual machine is stopped by the host it can look to
* the watchdog like a stall .
*/
kvm_check_and_clear_guest_paused ( ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
/* get the latest of pool and touched timestamps */
2021-03-24 14:40:29 +03:00
if ( pool - > cpu > = 0 )
touched = READ_ONCE ( per_cpu ( wq_watchdog_touched_cpu , pool - > cpu ) ) ;
else
touched = READ_ONCE ( wq_watchdog_touched ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
pool_ts = READ_ONCE ( pool - > watchdog_ts ) ;
if ( time_after ( pool_ts , touched ) )
ts = pool_ts ;
else
ts = touched ;
/* did we stall? */
2021-05-20 13:14:22 +03:00
if ( time_after ( now , ts + thresh ) ) {
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
lockup_detected = true ;
2024-02-05 00:28:06 +03:00
if ( pool - > cpu > = 0 & & ! ( pool - > flags & POOL_BH ) ) {
2023-03-07 15:53:35 +03:00
pool - > cpu_stall = true ;
cpu_pool_stall = true ;
}
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
pr_emerg ( " BUG: workqueue lockup - pool " ) ;
pr_cont_pool_info ( pool ) ;
pr_cont ( " stuck for %us! \n " ,
2021-05-20 13:14:22 +03:00
jiffies_to_msecs ( now - pool_ts ) / 1000 ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
}
2023-03-07 15:53:35 +03:00
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
}
rcu_read_unlock ( ) ;
if ( lockup_detected )
2021-10-20 06:09:00 +03:00
show_all_workqueues ( ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
2023-03-07 15:53:35 +03:00
if ( cpu_pool_stall )
show_cpu_pools_hogs ( ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
wq_watchdog_reset_touched ( ) ;
mod_timer ( & wq_watchdog_timer , jiffies + thresh ) ;
}
2018-08-21 18:25:07 +03:00
notrace void wq_watchdog_touch ( int cpu )
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
{
if ( cpu > = 0 )
per_cpu ( wq_watchdog_touched_cpu , cpu ) = jiffies ;
2021-03-24 14:40:29 +03:00
wq_watchdog_touched = jiffies ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
}
static void wq_watchdog_set_thresh ( unsigned long thresh )
{
wq_watchdog_thresh = 0 ;
del_timer_sync ( & wq_watchdog_timer ) ;
if ( thresh ) {
wq_watchdog_thresh = thresh ;
wq_watchdog_reset_touched ( ) ;
mod_timer ( & wq_watchdog_timer , jiffies + thresh * HZ ) ;
}
}
static int wq_watchdog_param_set_thresh ( const char * val ,
const struct kernel_param * kp )
{
unsigned long thresh ;
int ret ;
ret = kstrtoul ( val , 0 , & thresh ) ;
if ( ret )
return ret ;
if ( system_wq )
wq_watchdog_set_thresh ( thresh ) ;
else
wq_watchdog_thresh = thresh ;
return 0 ;
}
static const struct kernel_param_ops wq_watchdog_thresh_ops = {
. set = wq_watchdog_param_set_thresh ,
. get = param_get_ulong ,
} ;
module_param_cb ( watchdog_thresh , & wq_watchdog_thresh_ops , & wq_watchdog_thresh ,
0644 ) ;
static void wq_watchdog_init ( void )
{
2017-10-05 02:27:00 +03:00
timer_setup ( & wq_watchdog_timer , wq_watchdog_timer_fn , TIMER_DEFERRABLE ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
wq_watchdog_set_thresh ( wq_watchdog_thresh ) ;
}
# else /* CONFIG_WQ_WATCHDOG */
static inline void wq_watchdog_init ( void ) { }
# endif /* CONFIG_WQ_WATCHDOG */
2024-02-14 21:33:55 +03:00
static void bh_pool_kick_normal ( struct irq_work * irq_work )
{
raise_softirq_irqoff ( TASKLET_SOFTIRQ ) ;
}
static void bh_pool_kick_highpri ( struct irq_work * irq_work )
{
raise_softirq_irqoff ( HI_SOFTIRQ ) ;
}
2023-11-22 00:39:36 +03:00
static void __init restrict_unbound_cpumask ( const char * name , const struct cpumask * mask )
{
if ( ! cpumask_intersects ( wq_unbound_cpumask , mask ) ) {
pr_warn ( " workqueue: Restricting unbound_cpumask (%*pb) with %s (%*pb) leaves no CPU, ignoring \n " ,
cpumask_pr_args ( wq_unbound_cpumask ) , name , cpumask_pr_args ( mask ) ) ;
return ;
}
cpumask_and ( wq_unbound_cpumask , wq_unbound_cpumask , mask ) ;
}
2024-02-05 00:28:06 +03:00
static void __init init_cpu_worker_pool ( struct worker_pool * pool , int cpu , int nice )
{
BUG_ON ( init_worker_pool ( pool ) ) ;
pool - > cpu = cpu ;
cpumask_copy ( pool - > attrs - > cpumask , cpumask_of ( cpu ) ) ;
cpumask_copy ( pool - > attrs - > __pod_cpumask , cpumask_of ( cpu ) ) ;
pool - > attrs - > nice = nice ;
pool - > attrs - > affn_strict = true ;
pool - > node = cpu_to_node ( cpu ) ;
/* alloc pool ID */
mutex_lock ( & wq_pool_mutex ) ;
BUG_ON ( worker_pool_assign_id ( pool ) ) ;
mutex_unlock ( & wq_pool_mutex ) ;
}
2016-09-16 22:49:32 +03:00
/**
* workqueue_init_early - early init for workqueue subsystem
*
2023-08-08 04:57:24 +03:00
* This is the first step of three - staged workqueue subsystem initialization and
* invoked as soon as the bare basics - memory allocation , cpumasks and idr are
* up . It sets up all the data structures and system workqueues and allows early
* boot code to create workqueues and queue / cancel work items . Actual work item
* execution starts only after kthreads can be created and scheduled right
* before early initcalls .
2016-09-16 22:49:32 +03:00
*/
2020-02-23 10:28:52 +03:00
void __init workqueue_init_early ( void )
2005-04-17 02:20:36 +04:00
{
2023-08-08 04:57:24 +03:00
struct wq_pod_type * pt = & wq_pod_types [ WQ_AFFN_SYSTEM ] ;
2013-03-12 22:30:00 +04:00
int std_nice [ NR_STD_WORKER_POOLS ] = { 0 , HIGHPRI_NICE_LEVEL } ;
2024-02-14 21:33:55 +03:00
void ( * irq_work_fns [ 2 ] ) ( struct irq_work * ) = { bh_pool_kick_normal ,
bh_pool_kick_highpri } ;
2013-03-12 22:30:00 +04:00
int i , cpu ;
2010-06-29 12:07:11 +04:00
2020-06-01 11:44:40 +03:00
BUILD_BUG_ON ( __alignof__ ( struct pool_workqueue ) < __alignof__ ( long long ) ) ;
2013-03-12 22:29:57 +04:00
2015-04-27 12:58:39 +03:00
BUG_ON ( ! alloc_cpumask_var ( & wq_unbound_cpumask , GFP_KERNEL ) ) ;
2023-10-25 21:25:52 +03:00
BUG_ON ( ! alloc_cpumask_var ( & wq_requested_unbound_cpumask , GFP_KERNEL ) ) ;
BUG_ON ( ! zalloc_cpumask_var ( & wq_isolated_cpumask , GFP_KERNEL ) ) ;
2015-04-27 12:58:39 +03:00
2023-11-22 00:39:36 +03:00
cpumask_copy ( wq_unbound_cpumask , cpu_possible_mask ) ;
restrict_unbound_cpumask ( " HK_TYPE_WQ " , housekeeping_cpumask ( HK_TYPE_WQ ) ) ;
restrict_unbound_cpumask ( " HK_TYPE_DOMAIN " , housekeeping_cpumask ( HK_TYPE_DOMAIN ) ) ;
2023-06-29 06:50:50 +03:00
if ( ! cpumask_empty ( & wq_cmdline_cpumask ) )
2023-11-22 00:39:36 +03:00
restrict_unbound_cpumask ( " workqueue.unbound_cpus " , & wq_cmdline_cpumask ) ;
2023-06-29 06:50:50 +03:00
2023-10-25 21:25:52 +03:00
cpumask_copy ( wq_requested_unbound_cpumask , wq_unbound_cpumask ) ;
2023-06-29 06:50:50 +03:00
2013-03-12 22:29:57 +04:00
pwq_cache = KMEM_CACHE ( pool_workqueue , SLAB_PANIC ) ;
2023-08-08 04:57:24 +03:00
wq_update_pod_attrs_buf = alloc_workqueue_attrs ( ) ;
BUG_ON ( ! wq_update_pod_attrs_buf ) ;
2024-01-19 18:54:39 +03:00
/*
* If nohz_full is enabled , set power efficient workqueue as unbound .
* This allows workqueue items to be moved to HK CPUs .
*/
if ( housekeeping_enabled ( HK_TYPE_TICK ) )
wq_power_efficient = true ;
2023-08-08 04:57:24 +03:00
/* initialize WQ_AFFN_SYSTEM pods */
pt - > pod_cpus = kcalloc ( 1 , sizeof ( pt - > pod_cpus [ 0 ] ) , GFP_KERNEL ) ;
pt - > pod_node = kcalloc ( 1 , sizeof ( pt - > pod_node [ 0 ] ) , GFP_KERNEL ) ;
pt - > cpu_pod = kcalloc ( nr_cpu_ids , sizeof ( pt - > cpu_pod [ 0 ] ) , GFP_KERNEL ) ;
BUG_ON ( ! pt - > pod_cpus | | ! pt - > pod_node | | ! pt - > cpu_pod ) ;
BUG_ON ( ! zalloc_cpumask_var_node ( & pt - > pod_cpus [ 0 ] , GFP_KERNEL , NUMA_NO_NODE ) ) ;
pt - > nr_pods = 1 ;
cpumask_copy ( pt - > pod_cpus [ 0 ] , cpu_possible_mask ) ;
pt - > pod_node [ 0 ] = NUMA_NO_NODE ;
pt - > cpu_pod [ 0 ] = 0 ;
2024-02-05 00:28:06 +03:00
/* initialize BH and CPU pools */
2013-03-12 22:30:03 +04:00
for_each_possible_cpu ( cpu ) {
2012-07-14 09:16:44 +04:00
struct worker_pool * pool ;
2010-06-29 12:07:12 +04:00
2024-02-05 00:28:06 +03:00
i = 0 ;
for_each_bh_worker_pool ( pool , cpu ) {
2024-02-14 21:33:55 +03:00
init_cpu_worker_pool ( pool , cpu , std_nice [ i ] ) ;
2024-02-05 00:28:06 +03:00
pool - > flags | = POOL_BH ;
2024-02-14 21:33:55 +03:00
init_irq_work ( bh_pool_irq_work ( pool ) , irq_work_fns [ i ] ) ;
i + + ;
2024-02-05 00:28:06 +03:00
}
2013-03-12 22:30:00 +04:00
i = 0 ;
2024-02-05 00:28:06 +03:00
for_each_cpu_worker_pool ( pool , cpu )
init_cpu_worker_pool ( pool , cpu , std_nice [ i + + ] ) ;
2010-06-29 12:07:12 +04:00
}
2013-09-05 20:30:04 +04:00
/* create default unbound and ordered wq attrs */
2013-03-12 22:30:03 +04:00
for ( i = 0 ; i < NR_STD_WORKER_POOLS ; i + + ) {
struct workqueue_attrs * attrs ;
2019-06-26 17:52:38 +03:00
BUG_ON ( ! ( attrs = alloc_workqueue_attrs ( ) ) ) ;
2013-03-12 22:30:03 +04:00
attrs - > nice = std_nice [ i ] ;
unbound_std_wq_attrs [ i ] = attrs ;
2013-09-05 20:30:04 +04:00
/*
* An ordered wq should have only one pwq as ordering is
* guaranteed by max_active which is enforced by pwqs .
*/
2019-06-26 17:52:38 +03:00
BUG_ON ( ! ( attrs = alloc_workqueue_attrs ( ) ) ) ;
2013-09-05 20:30:04 +04:00
attrs - > nice = std_nice [ i ] ;
2023-08-08 04:57:23 +03:00
attrs - > ordered = true ;
2013-09-05 20:30:04 +04:00
ordered_wq_attrs [ i ] = attrs ;
2013-03-12 22:30:03 +04:00
}
2010-06-29 12:07:14 +04:00
system_wq = alloc_workqueue ( " events " , 0 , 0 ) ;
2012-08-15 18:25:39 +04:00
system_highpri_wq = alloc_workqueue ( " events_highpri " , WQ_HIGHPRI , 0 ) ;
2010-06-29 12:07:14 +04:00
system_long_wq = alloc_workqueue ( " events_long " , 0 , 0 ) ;
2010-07-02 12:03:51 +04:00
system_unbound_wq = alloc_workqueue ( " events_unbound " , WQ_UNBOUND ,
workqueue: Make unbound workqueues to use per-cpu pool_workqueues
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
2023-08-08 04:57:23 +03:00
WQ_MAX_ACTIVE ) ;
2011-02-21 11:52:50 +03:00
system_freezable_wq = alloc_workqueue ( " events_freezable " ,
WQ_FREEZABLE , 0 ) ;
2013-04-24 15:42:54 +04:00
system_power_efficient_wq = alloc_workqueue ( " events_power_efficient " ,
WQ_POWER_EFFICIENT , 0 ) ;
2024-01-25 22:05:32 +03:00
system_freezable_power_efficient_wq = alloc_workqueue ( " events_freezable_pwr_efficient " ,
2013-04-24 15:42:54 +04:00
WQ_FREEZABLE | WQ_POWER_EFFICIENT ,
0 ) ;
2024-02-05 00:28:06 +03:00
system_bh_wq = alloc_workqueue ( " events_bh " , WQ_BH , 0 ) ;
system_bh_highpri_wq = alloc_workqueue ( " events_bh_highpri " ,
WQ_BH | WQ_HIGHPRI , 0 ) ;
2012-08-15 18:25:39 +04:00
BUG_ON ( ! system_wq | | ! system_highpri_wq | | ! system_long_wq | |
2013-04-24 15:42:54 +04:00
! system_unbound_wq | | ! system_freezable_wq | |
! system_power_efficient_wq | |
2024-02-05 00:28:06 +03:00
! system_freezable_power_efficient_wq | |
! system_bh_wq | | ! system_bh_highpri_wq ) ;
2016-09-16 22:49:32 +03:00
}
2023-07-18 01:50:02 +03:00
static void __init wq_cpu_intensive_thresh_init ( void )
{
unsigned long thresh ;
unsigned long bogo ;
2023-09-11 11:27:22 +03:00
pwq_release_worker = kthread_create_worker ( 0 , " pool_workqueue_release " ) ;
BUG_ON ( IS_ERR ( pwq_release_worker ) ) ;
2023-07-18 01:50:02 +03:00
/* if the user set it to a specific value, keep it */
if ( wq_cpu_intensive_thresh_us ! = ULONG_MAX )
return ;
/*
* The default of 10 ms is derived from the fact that most modern ( as of
* 2023 ) processors can do a lot in 10 ms and that it ' s just below what
* most consider human - perceivable . However , the kernel also runs on a
* lot slower CPUs including microcontrollers where the threshold is way
* too low .
*
* Let ' s scale up the threshold upto 1 second if BogoMips is below 4000.
* This is by no means accurate but it doesn ' t have to be . The mechanism
* is still useful even when the threshold is fully scaled up . Also , as
* the reports would usually be applicable to everyone , some machines
* operating on longer thresholds won ' t significantly diminish their
* usefulness .
*/
thresh = 10 * USEC_PER_MSEC ;
/* see init/calibrate.c for lpj -> BogoMIPS calculation */
bogo = max_t ( unsigned long , loops_per_jiffy / 500000 * HZ , 1 ) ;
if ( bogo < 4000 )
thresh = min_t ( unsigned long , thresh * 4000 / bogo , USEC_PER_SEC ) ;
pr_debug ( " wq_cpu_intensive_thresh: lpj=%lu BogoMIPS=%lu thresh_us=%lu \n " ,
loops_per_jiffy , bogo , thresh ) ;
wq_cpu_intensive_thresh_us = thresh ;
}
2016-09-16 22:49:32 +03:00
/**
* workqueue_init - bring workqueue subsystem fully online
*
2023-08-08 04:57:24 +03:00
* This is the second step of three - staged workqueue subsystem initialization
* and invoked as soon as kthreads can be created and scheduled . Workqueues have
* been created and work items queued on them , but there are no kworkers
* executing the work items yet . Populate the worker pools with the initial
* workers and enable future kworker creations .
2016-09-16 22:49:32 +03:00
*/
2020-02-23 10:28:52 +03:00
void __init workqueue_init ( void )
2016-09-16 22:49:32 +03:00
{
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 19:01:27 +03:00
struct workqueue_struct * wq ;
2016-09-16 22:49:32 +03:00
struct worker_pool * pool ;
int cpu , bkt ;
2023-07-18 01:50:02 +03:00
wq_cpu_intensive_thresh_init ( ) ;
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 19:01:27 +03:00
mutex_lock ( & wq_pool_mutex ) ;
2023-08-08 04:57:24 +03:00
/*
* Per - cpu pools created earlier could be missing node hint . Fix them
* up . Also , create a rescuer for workqueues that requested it .
*/
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 19:01:27 +03:00
for_each_possible_cpu ( cpu ) {
2024-02-05 00:28:06 +03:00
for_each_bh_worker_pool ( pool , cpu )
pool - > node = cpu_to_node ( cpu ) ;
for_each_cpu_worker_pool ( pool , cpu )
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 19:01:27 +03:00
pool - > node = cpu_to_node ( cpu ) ;
}
2018-01-08 16:38:37 +03:00
list_for_each_entry ( wq , & workqueues , list ) {
WARN ( init_rescuer ( wq ) ,
" workqueue: failed to create early rescuer for %s " ,
wq - > name ) ;
}
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 19:01:27 +03:00
mutex_unlock ( & wq_pool_mutex ) ;
2024-02-05 00:28:06 +03:00
/*
* Create the initial workers . A BH pool has one pseudo worker that
* represents the shared BH execution context and thus doesn ' t get
* affected by hotplug events . Create the BH pseudo workers for all
* possible CPUs here .
*/
for_each_possible_cpu ( cpu )
for_each_bh_worker_pool ( pool , cpu )
BUG_ON ( ! create_worker ( pool ) ) ;
2016-09-16 22:49:32 +03:00
for_each_online_cpu ( cpu ) {
for_each_cpu_worker_pool ( pool , cpu ) {
pool - > flags & = ~ POOL_DISASSOCIATED ;
BUG_ON ( ! create_worker ( pool ) ) ;
}
}
hash_for_each ( unbound_pool_hash , bkt , pool , hash_node )
BUG_ON ( ! create_worker ( pool ) ) ;
wq_online = true ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 19:28:04 +03:00
wq_watchdog_init ( ) ;
2005-04-17 02:20:36 +04:00
}
2022-06-01 10:32:47 +03:00
2023-08-08 04:57:24 +03:00
/*
* Initialize @ pt by first initializing @ pt - > cpu_pod [ ] with pod IDs according to
* @ cpu_shares_pod ( ) . Each subset of CPUs that share a pod is assigned a unique
* and consecutive pod ID . The rest of @ pt is initialized accordingly .
*/
static void __init init_pod_type ( struct wq_pod_type * pt ,
bool ( * cpus_share_pod ) ( int , int ) )
{
int cur , pre , cpu , pod ;
pt - > nr_pods = 0 ;
/* init @pt->cpu_pod[] according to @cpus_share_pod() */
pt - > cpu_pod = kcalloc ( nr_cpu_ids , sizeof ( pt - > cpu_pod [ 0 ] ) , GFP_KERNEL ) ;
BUG_ON ( ! pt - > cpu_pod ) ;
for_each_possible_cpu ( cur ) {
for_each_possible_cpu ( pre ) {
if ( pre > = cur ) {
pt - > cpu_pod [ cur ] = pt - > nr_pods + + ;
break ;
}
if ( cpus_share_pod ( cur , pre ) ) {
pt - > cpu_pod [ cur ] = pt - > cpu_pod [ pre ] ;
break ;
}
}
}
/* init the rest to match @pt->cpu_pod[] */
pt - > pod_cpus = kcalloc ( pt - > nr_pods , sizeof ( pt - > pod_cpus [ 0 ] ) , GFP_KERNEL ) ;
pt - > pod_node = kcalloc ( pt - > nr_pods , sizeof ( pt - > pod_node [ 0 ] ) , GFP_KERNEL ) ;
BUG_ON ( ! pt - > pod_cpus | | ! pt - > pod_node ) ;
for ( pod = 0 ; pod < pt - > nr_pods ; pod + + )
BUG_ON ( ! zalloc_cpumask_var ( & pt - > pod_cpus [ pod ] , GFP_KERNEL ) ) ;
for_each_possible_cpu ( cpu ) {
cpumask_set_cpu ( cpu , pt - > pod_cpus [ pt - > cpu_pod [ cpu ] ] ) ;
pt - > pod_node [ pt - > cpu_pod [ cpu ] ] = cpu_to_node ( cpu ) ;
}
}
2023-08-08 04:57:24 +03:00
static bool __init cpus_dont_share ( int cpu0 , int cpu1 )
{
return false ;
}
static bool __init cpus_share_smt ( int cpu0 , int cpu1 )
{
# ifdef CONFIG_SCHED_SMT
return cpumask_test_cpu ( cpu0 , cpu_smt_mask ( cpu1 ) ) ;
# else
return false ;
# endif
}
2023-08-08 04:57:24 +03:00
static bool __init cpus_share_numa ( int cpu0 , int cpu1 )
{
return cpu_to_node ( cpu0 ) = = cpu_to_node ( cpu1 ) ;
}
2023-08-08 04:57:24 +03:00
/**
* workqueue_init_topology - initialize CPU pods for unbound workqueues
*
2024-02-05 03:31:52 +03:00
* This is the third step of three - staged workqueue subsystem initialization and
2023-08-08 04:57:24 +03:00
* invoked after SMP and topology information are fully initialized . It
* initializes the unbound CPU pods accordingly .
*/
void __init workqueue_init_topology ( void )
2023-08-08 04:57:24 +03:00
{
2023-08-08 04:57:24 +03:00
struct workqueue_struct * wq ;
2023-08-08 04:57:24 +03:00
int cpu ;
2023-08-08 04:57:24 +03:00
2023-08-08 04:57:24 +03:00
init_pod_type ( & wq_pod_types [ WQ_AFFN_CPU ] , cpus_dont_share ) ;
init_pod_type ( & wq_pod_types [ WQ_AFFN_SMT ] , cpus_share_smt ) ;
init_pod_type ( & wq_pod_types [ WQ_AFFN_CACHE ] , cpus_share_cache ) ;
2023-08-08 04:57:24 +03:00
init_pod_type ( & wq_pod_types [ WQ_AFFN_NUMA ] , cpus_share_numa ) ;
2023-08-08 04:57:24 +03:00
workqueue: Avoid premature init of wq->node_nr_active[].max
System workqueues are allocated early during boot from
workqueue_init_early(). While allocating unbound workqueues,
wq_update_node_max_active() is invoked from apply_workqueue_attrs() and
accesses NUMA topology to initialize wq->node_nr_active[].max.
However, topology information may not be set up at this point.
wq_update_node_max_active() is explicitly invoked from
workqueue_init_topology() later when topology information is known to be
available.
This doesn't seem to crash anything but it's doing useless work with dubious
data. Let's skip the premature and duplicate node_max_active updates by
initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making
wq_update_node_max_active() noop until workqueue_init_topology().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/workqueue.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9221a4c57ae1..a65081ec6780 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_SYSTEM] = "system",
};
+static bool wq_topo_initialized = false;
+
/*
* Per-cpu work items which run for longer than the following threshold are
* automatically considered CPU intensive and excluded from concurrency
@@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu)
lockdep_assert_held(&wq->mutex);
+ if (!wq_topo_initialized)
+ return;
+
if (!cpumask_test_cpu(off_cpu, effective))
off_cpu = -1;
@@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar)
static void init_node_nr_active(struct wq_node_nr_active *nna)
{
+ nna->max = WQ_DFL_MIN_ACTIVE;
atomic_set(&nna->nr, 0);
raw_spin_lock_init(&nna->lock);
INIT_LIST_HEAD(&nna->pending_pwqs);
@@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void)
init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
+ wq_topo_initialized = true;
+
mutex_lock(&wq_pool_mutex);
/*
2024-01-31 08:06:43 +03:00
wq_topo_initialized = true ;
2023-08-08 04:57:24 +03:00
mutex_lock ( & wq_pool_mutex ) ;
2023-08-08 04:57:24 +03:00
2023-08-08 04:57:24 +03:00
/*
* Workqueues allocated earlier would have all CPUs sharing the default
* worker pool . Explicitly call wq_update_pod ( ) on all workqueue and CPU
* combinations to apply per - pod sharing .
*/
list_for_each_entry ( wq , & workqueues , list ) {
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
for_each_online_cpu ( cpu )
2023-08-08 04:57:24 +03:00
wq_update_pod ( wq , cpu , cpu , true ) ;
workqueue: Implement system-wide nr_active enforcement for unbound workqueues
A pool_workqueue (pwq) represents the connection between a workqueue and a
worker_pool. One of the roles that a pwq plays is enforcement of the
max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU
for per-cpu workqueues and per each NUMA node for unbound workqueues, which
was a natural result of per-cpu workqueues being served by per-cpu pools and
unbound by per-NUMA pools.
In terms of max_active enforcement, this was, while not perfect, workable.
For per-cpu workqueues, it was fine. For unbound, it wasn't great in that
NUMA machines would get max_active that's multiplied by the number of nodes
but didn't cause huge problems because NUMA machines are relatively rare and
the node count is usually pretty low.
However, cache layouts are more complex now and sharing a worker pool across
a whole node didn't really work well for unbound workqueues. Thus, a series
of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues
to use per-cpu pool_workqueues") implemented more flexible affinity
mechanism for unbound workqueues which enables using e.g. last-level-cache
aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound
workqueues to use per-cpu pool_workqueues") made unbound workqueues use
per-cpu pwqs like per-cpu workqueues.
While the change was necessary to enable more flexible affinity scopes, this
came with the side effect of blowing up the effective max_active for unbound
workqueues. Before, the effective max_active for unbound workqueues was
multiplied by the number of nodes. After, by the number of CPUs.
636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu
pool_workqueues") claims that this should generally be okay. It is okay for
users which self-regulates concurrency level which are the vast majority;
however, there are enough use cases which actually depend on max_active to
prevent the level of concurrency from going bonkers including several IO
handling workqueues that can issue a work item for each in-flight IO. With
targeted benchmarks, the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.
Unfortunately, there is no way to express what these use cases need using
per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want
to set max_active too low but as soon as we increase max_active a bit, we
can end up with unreasonable number of in-flight work items when many CPUs
issue IOs at the same time. ie. The acceptable lowest max_active is higher
than the acceptable highest max_active.
Ideally, max_active for an unbound workqueue should be system-wide so that
the users can regulate the total level of concurrency regardless of node and
cache layout. The reasons workqueue hasn't implemented that yet are:
- One max_active enforcement decouples from pool boundaires, chaining
execution after a work item finishes requires inter-pool operations which
would require lock dancing, which is nasty.
- Sharing a single nr_active count across the whole system can be pretty
expensive on NUMA machines.
- Per-pwq enforcement had been more or less okay while we were using
per-node pools.
It looks like we no longer can avoid decoupling max_active enforcement from
pool boundaries. This patch implements system-wide nr_active mechanism with
the following design characteristics:
- To avoid sharing a single counter across multiple nodes, the configured
max_active is split across nodes according to the proportion of each
workqueue's online effective CPUs per node. e.g. A node with twice more
online effective CPUs will get twice higher portion of max_active.
- Workqueue used to be able to process a chain of interdependent work items
which is as long as max_active. We can't do this anymore as max_active is
distributed across the nodes. Instead, a new parameter min_active is
introduced which determines the minimum level of concurrency within a node
regardless of how max_active distribution comes out to be.
It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8.
This can lead to higher effective max_weight than configured and also
deadlocks if a workqueue was depending on being able to handle chains of
interdependent work items that are longer than 8.
I believe these should be fine given that the number of CPUs in each NUMA
node is usually higher than 8 and work item chain longer than 8 is pretty
unlikely. However, if these assumptions turn out to be wrong, we'll need
to add an interface to adjust min_active.
- Each unbound wq has an array of struct wq_node_nr_active which tracks
per-node nr_active. When its pwq wants to run a work item, it has to
obtain the matching node's nr_active. If over the node's max_active, the
pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish,
the completion path round-robins the pending pwqs activating the first
inactive work item of each, which involves some pool lock dancing and
kicking other pools. It's not the simplest code but doesn't look too bad.
v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active().
- wq_adjust_max_active() is now protected by wq->mutex instead of
wq_pool_mutex.
v3: - wq_node_max_active() used to calculate per-node max_active on the fly
based on system-wide CPU online states. Lai pointed out that this can
lead to skewed distributions for workqueues with restricted cpumasks.
Update the max_active distribution to use per-workqueue effective
online CPU counts instead of system-wide and cache the calculation
results in node_nr_active->max.
v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com>
Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3
Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29 21:11:25 +03:00
if ( wq - > flags & WQ_UNBOUND ) {
mutex_lock ( & wq - > mutex ) ;
wq_update_node_max_active ( wq , - 1 ) ;
mutex_unlock ( & wq - > mutex ) ;
2023-08-08 04:57:24 +03:00
}
}
mutex_unlock ( & wq_pool_mutex ) ;
2023-08-08 04:57:24 +03:00
}
2023-06-30 15:28:53 +03:00
void __warn_flushing_systemwide_wq ( void )
{
pr_warn ( " WARNING: Flushing system-wide workqueues will be prohibited in near future. \n " ) ;
dump_stack ( ) ;
}
2022-06-01 10:32:47 +03:00
EXPORT_SYMBOL ( __warn_flushing_systemwide_wq ) ;
2023-06-29 06:50:50 +03:00
static int __init workqueue_unbound_cpus_setup ( char * str )
{
if ( cpulist_parse ( str , & wq_cmdline_cpumask ) < 0 ) {
cpumask_clear ( & wq_cmdline_cpumask ) ;
pr_warn ( " workqueue.unbound_cpus: incorrect CPU range, using default \n " ) ;
}
return 1 ;
}
__setup ( " workqueue.unbound_cpus= " , workqueue_unbound_cpus_setup ) ;