2019-05-19 13:08:55 +01:00
// SPDX-License-Identifier: GPL-2.0-only
2005-04-16 15:20:36 -07:00
/*
2010-09-10 16:51:36 +02:00
* kernel / workqueue . c - generic async execution with shared worker pool
2005-04-16 15:20:36 -07:00
*
2010-09-10 16:51:36 +02:00
* Copyright ( C ) 2002 Ingo Molnar
2005-04-16 15:20:36 -07:00
*
2010-09-10 16:51:36 +02:00
* Derived from the taskqueue / keventd code by :
* David Woodhouse < dwmw2 @ infradead . org >
* Andrew Morton
* Kai Petzke < wpp @ marie . physik . tu - berlin . de >
* Theodore Ts ' o < tytso @ mit . edu >
2005-04-16 15:20:36 -07:00
*
2010-09-10 16:51:36 +02:00
* Made to use alloc_percpu by Christoph Lameter .
2005-04-16 15:20:36 -07:00
*
2010-09-10 16:51:36 +02:00
* Copyright ( C ) 2010 SUSE Linux Products GmbH
* Copyright ( C ) 2010 Tejun Heo < tj @ kernel . org >
2005-10-30 15:01:59 -08:00
*
2010-09-10 16:51:36 +02:00
* This is the generic async execution mechanism . Work items as are
* executed in process context . The worker pool is shared and
2013-08-21 08:50:39 +08:00
* automatically managed . There are two worker pools for each CPU ( one for
* normal work items and the other for high priority ones ) and some extra
* pools for workqueues which are not bound to any specific CPU - the
* number of these backing pools is dynamic .
2010-09-10 16:51:36 +02:00
*
2017-08-06 19:33:22 -07:00
* Please read Documentation / core - api / workqueue . rst for details .
2005-04-16 15:20:36 -07:00
*/
2011-05-23 14:51:41 -04:00
# include <linux/export.h>
2005-04-16 15:20:36 -07:00
# include <linux/kernel.h>
# include <linux/sched.h>
# include <linux/init.h>
# include <linux/signal.h>
# include <linux/completion.h>
# include <linux/workqueue.h>
# include <linux/slab.h>
# include <linux/cpu.h>
# include <linux/notifier.h>
# include <linux/kthread.h>
2006-02-23 12:43:43 -06:00
# include <linux/hardirq.h>
2006-10-11 01:21:26 -07:00
# include <linux/mempolicy.h>
2006-12-06 20:34:49 -08:00
# include <linux/freezer.h>
2006-12-06 20:37:26 -08:00
# include <linux/debug_locks.h>
2007-10-18 23:39:55 -07:00
# include <linux/lockdep.h>
2010-06-29 10:07:11 +02:00
# include <linux/idr.h>
2013-03-12 11:30:03 -07:00
# include <linux/jhash.h>
2012-12-17 10:01:23 -05:00
# include <linux/hashtable.h>
2013-03-12 11:30:00 -07:00
# include <linux/rculist.h>
2013-04-01 11:23:32 -07:00
# include <linux/nodemask.h>
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
# include <linux/moduleparam.h>
2013-04-30 15:27:22 -07:00
# include <linux/uaccess.h>
2017-11-03 17:27:50 +02:00
# include <linux/sched/isolation.h>
2018-01-11 09:53:35 +09:00
# include <linux/nmi.h>
2021-05-20 19:14:22 +09:00
# include <linux/kvm_para.h>
2010-06-29 10:07:14 +02:00
2013-01-18 14:05:55 -08:00
# include "workqueue_internal.h"
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:12 +02:00
enum {
2013-01-24 11:01:33 -08:00
/*
* worker_pool flags
2012-07-17 12:39:27 -07:00
*
2013-01-24 11:01:33 -08:00
* A bound pool is either associated or disassociated with its CPU .
2012-07-17 12:39:27 -07:00
* While associated ( ! DISASSOCIATED ) , all workers are bound to the
* CPU and none has % WORKER_UNBOUND set and concurrency management
* is in effect .
*
* While DISASSOCIATED , the cpu may be offline and all workers have
* % WORKER_UNBOUND set and concurrency management disabled , and may
2013-01-24 11:01:33 -08:00
* be executing on any CPU . The pool behaves as an unbound one .
2012-07-17 12:39:27 -07:00
*
2013-03-13 19:47:39 -07:00
* Note that DISASSOCIATED should be flipped only while holding
2018-05-18 08:47:13 -07:00
* wq_pool_attach_mutex to avoid changing binding state while
2014-05-20 17:46:35 +08:00
* worker_attach_to_pool ( ) is in progress .
2012-07-17 12:39:27 -07:00
*/
2017-10-09 08:04:13 -07:00
POOL_MANAGER_ACTIVE = 1 < < 0 , /* being managed */
2013-01-24 11:01:33 -08:00
POOL_DISASSOCIATED = 1 < < 2 , /* cpu can't serve workers */
2010-06-29 10:07:12 +02:00
2010-06-29 10:07:12 +02:00
/* worker flags */
WORKER_DIE = 1 < < 1 , /* die die die */
WORKER_IDLE = 1 < < 2 , /* is idle */
2010-06-29 10:07:14 +02:00
WORKER_PREP = 1 < < 3 , /* preparing to run works */
2010-06-29 10:07:15 +02:00
WORKER_CPU_INTENSIVE = 1 < < 6 , /* cpu intensive */
2010-07-02 10:03:51 +02:00
WORKER_UNBOUND = 1 < < 7 , /* worker is unbound */
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
WORKER_REBOUND = 1 < < 8 , /* worker was rebound */
2010-06-29 10:07:14 +02:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
WORKER_NOT_RUNNING = WORKER_PREP | WORKER_CPU_INTENSIVE |
WORKER_UNBOUND | WORKER_REBOUND ,
2010-06-29 10:07:12 +02:00
2013-01-24 11:01:33 -08:00
NR_STD_WORKER_POOLS = 2 , /* # standard pools per cpu */
2012-07-13 22:16:44 -07:00
2013-03-12 11:30:03 -07:00
UNBOUND_POOL_HASH_ORDER = 6 , /* hashed by pool->attrs */
2010-06-29 10:07:12 +02:00
BUSY_WORKER_HASH_ORDER = 6 , /* 64 pointers */
2010-06-29 10:07:12 +02:00
2010-06-29 10:07:14 +02:00
MAX_IDLE_WORKERS_RATIO = 4 , /* 1/4 of busy can be idle */
IDLE_WORKER_TIMEOUT = 300 * HZ , /* keep idle ones for 5 mins */
2011-02-16 18:10:19 +01:00
MAYDAY_INITIAL_TIMEOUT = HZ / 100 > = 2 ? HZ / 100 : 2 ,
/* call for help after 10ms
( min two ticks ) */
2010-06-29 10:07:14 +02:00
MAYDAY_INTERVAL = HZ / 10 , /* and then every 100ms */
CREATE_COOLDOWN = HZ , /* time to breath after fail */
/*
* Rescue workers are used only on emergencies and shared by
2014-03-11 18:09:12 +08:00
* all cpus . Give MIN_NICE .
2010-06-29 10:07:14 +02:00
*/
2014-03-11 18:09:12 +08:00
RESCUER_NICE_LEVEL = MIN_NICE ,
HIGHPRI_NICE_LEVEL = MIN_NICE ,
2013-04-01 11:23:34 -07:00
WQ_NAME_LEN = 24 ,
2010-06-29 10:07:12 +02:00
} ;
2005-04-16 15:20:36 -07:00
/*
2010-06-29 10:07:10 +02:00
* Structure fields follow one of the following exclusion rules .
*
2010-08-24 14:22:47 +02:00
* I : Modifiable by initialization / destruction paths and read - only for
* everyone else .
2010-06-29 10:07:10 +02:00
*
2010-06-29 10:07:14 +02:00
* P : Preemption protected . Disabling preemption is enough and should
* only be modified and accessed from the local cpu .
*
2013-01-24 11:01:33 -08:00
* L : pool - > lock protected . Access with pool - > lock held .
2010-06-29 10:07:10 +02:00
*
2013-01-24 11:01:33 -08:00
* X : During normal operation , modification requires pool - > lock and should
* be done only from local cpu . Either disabling preemption on local
* cpu or grabbing pool - > lock is enough for read access . If
* POOL_DISASSOCIATED is set , it ' s identical to L .
2010-06-29 10:07:14 +02:00
*
2018-05-18 08:47:13 -07:00
* A : wq_pool_attach_mutex protected .
2013-03-19 13:45:21 -07:00
*
2013-03-25 16:57:17 -07:00
* PL : wq_pool_mutex protected .
2013-03-13 19:47:40 -07:00
*
2019-03-13 17:55:47 +01:00
* PR : wq_pool_mutex protected for writes . RCU protected for reads .
2013-03-12 11:30:00 -07:00
*
2015-05-12 20:32:29 +08:00
* PW : wq_pool_mutex and wq - > mutex protected for writes . Either for reads .
*
* PWR : wq_pool_mutex and wq - > mutex protected for writes . Either or
2019-03-13 17:55:47 +01:00
* RCU for reads .
2015-05-12 20:32:29 +08:00
*
2013-03-25 16:57:17 -07:00
* WQ : wq - > mutex protected .
*
2019-03-13 17:55:47 +01:00
* WR : wq - > mutex protected for writes . RCU protected for reads .
2013-03-13 19:47:40 -07:00
*
* MD : wq_mayday_lock protected .
2005-04-16 15:20:36 -07:00
*/
2013-01-18 14:05:55 -08:00
/* struct worker is defined in workqueue_internal.h */
2010-06-29 10:07:11 +02:00
2012-07-12 14:46:37 -07:00
struct worker_pool {
2020-05-27 21:46:33 +02:00
raw_spinlock_t lock ; /* the pool lock */
2013-03-12 11:29:59 -07:00
int cpu ; /* I: the associated cpu */
2013-04-01 11:23:34 -07:00
int node ; /* I: the associated node ID */
2013-01-24 11:01:33 -08:00
int id ; /* I: pool ID */
2012-07-12 14:46:37 -07:00
unsigned int flags ; /* X: flags */
2012-07-12 14:46:37 -07:00
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
unsigned long watchdog_ts ; /* L: watchdog timestamp */
2021-12-23 20:31:40 +08:00
/*
* The counter is incremented in a process context on the associated CPU
* w / preemption disabled , and decremented or reset in the same context
* but w / pool - > lock held . The readers grab pool - > lock and are
* guaranteed to see if the counter reached zero .
*/
int nr_running ;
2021-12-07 15:35:42 +08:00
2012-07-12 14:46:37 -07:00
struct list_head worklist ; /* L: list of pending works */
workqueue: reimplement idle worker rebinding
Currently rebind_workers() uses rebinds idle workers synchronously
before proceeding to requesting busy workers to rebind. This is
necessary because all workers on @worker_pool->idle_list must be bound
before concurrency management local wake-ups from the busy workers
take place.
Unfortunately, the synchronous idle rebinding is quite complicated.
This patch reimplements idle rebinding to simplify the code path.
Rather than trying to make all idle workers bound before rebinding
busy workers, we simply remove all to-be-bound idle workers from the
idle list and let them add themselves back after completing rebinding
(successful or not).
As only workers which finished rebinding can on on the idle worker
list, the idle worker list is guaranteed to have only bound workers
unless CPU went down again and local wake-ups are safe.
After the change, @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list. More specifically,
nr_idle may be non-zero while ->idle_list is empty. All users of
->nr_idle and ->idle_list are audited. The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.
After this patch, rebind_workers() no longer performs the nasty
idle-rebind retries which require temporary release of gcwq->lock, and
both unbinding and rebinding are atomic w.r.t. global_cwq->lock.
worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
and removed along with the definition of struct idle_rebind.
Changed from V1:
1) remove unlikely from too_many_workers(), ->idle_list can be empty
anytime, even before this patch, no reason to use unlikely.
2) fix a small rebasing mistake.
(which is from rebasing the orignal fixing patch to for-next)
3) add a lot of comments.
4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-09-18 09:59:22 -07:00
2018-03-20 17:24:05 +08:00
int nr_workers ; /* L: total number of workers */
int nr_idle ; /* L: currently idle workers */
2012-07-12 14:46:37 -07:00
2021-12-23 20:31:38 +08:00
struct list_head idle_list ; /* L: list of idle workers */
2012-07-12 14:46:37 -07:00
struct timer_list idle_timer ; /* L: worker idle timeout */
struct timer_list mayday_timer ; /* L: SOS timer for workers */
2013-03-13 16:51:36 -07:00
/* a workers is either on busy_hash or idle_list, or the manager */
2013-01-24 11:01:33 -08:00
DECLARE_HASHTABLE ( busy_hash , BUSY_WORKER_HASH_ORDER ) ;
/* L: hash of busy workers */
2015-03-09 09:22:28 -04:00
struct worker * manager ; /* L: purely informational */
2014-05-20 17:46:34 +08:00
struct list_head workers ; /* A: attached workers */
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
struct completion * detach_completion ; /* all workers detached */
2013-01-24 11:39:44 -08:00
2014-05-20 17:46:32 +08:00
struct ida worker_ida ; /* worker IDs for task name */
2013-01-24 11:39:44 -08:00
2013-03-12 11:30:00 -07:00
struct workqueue_attrs * attrs ; /* I: worker attributes */
2013-03-25 16:57:17 -07:00
struct hlist_node hash_node ; /* PL: unbound_pool_hash node */
int refcnt ; /* PL: refcnt for unbound pools */
2013-03-12 11:30:00 -07:00
2013-03-12 11:30:03 -07:00
/*
2019-03-13 17:55:47 +01:00
* Destruction of pool is RCU protected to allow dereferences
2013-03-12 11:30:03 -07:00
* from get_work_pool ( ) .
*/
struct rcu_head rcu ;
2021-12-07 15:35:42 +08:00
} ;
2010-06-29 10:07:12 +02:00
2005-04-16 15:20:36 -07:00
/*
2013-02-13 19:29:12 -08:00
* The per - pool workqueue . While queued , the lower WORK_STRUCT_FLAG_BITS
* of work_struct - > data are used for flags and the remaining high bits
* point to the pwq ; thus , pwqs need to be aligned at two ' s power of the
* number of flag bits .
2005-04-16 15:20:36 -07:00
*/
2013-02-13 19:29:12 -08:00
struct pool_workqueue {
2012-07-12 14:46:37 -07:00
struct worker_pool * pool ; /* I: the associated pool */
2010-06-29 10:07:10 +02:00
struct workqueue_struct * wq ; /* I: the owning workqueue */
2010-06-29 10:07:11 +02:00
int work_color ; /* L: current color */
int flush_color ; /* L: flushing color */
2013-03-12 11:30:04 -07:00
int refcnt ; /* L: reference count */
2010-06-29 10:07:11 +02:00
int nr_in_flight [ WORK_NR_COLORS ] ;
/* L: nr of in_flight works */
2021-08-17 09:32:37 +08:00
/*
* nr_active management and WORK_STRUCT_INACTIVE :
*
* When pwq - > nr_active > = max_active , new work item is queued to
* pwq - > inactive_works instead of pool - > worklist and marked with
* WORK_STRUCT_INACTIVE .
*
* All work items marked with WORK_STRUCT_INACTIVE do not participate
* in pwq - > nr_active and all work items in pwq - > inactive_works are
* marked with WORK_STRUCT_INACTIVE . But not all WORK_STRUCT_INACTIVE
* work items are in pwq - > inactive_works . Some of them are ready to
* run in pool - > worklist or worker - > scheduled . Those work itmes are
* only struct wq_barrier which is used for flush_work ( ) and should
* not participate in pwq - > nr_active . For non - barrier work item , it
* is marked with WORK_STRUCT_INACTIVE iff it is in pwq - > inactive_works .
*/
2010-06-29 10:07:12 +02:00
int nr_active ; /* L: nr of active works */
2010-06-29 10:07:12 +02:00
int max_active ; /* L: max active works */
2021-08-17 09:32:34 +08:00
struct list_head inactive_works ; /* L: inactive works */
2013-03-25 16:57:17 -07:00
struct list_head pwqs_node ; /* WR: node on wq->pwqs */
2013-03-13 19:47:40 -07:00
struct list_head mayday_node ; /* MD: node on wq->maydays */
2013-03-12 11:30:04 -07:00
/*
* Release of unbound pwq is punted to system_wq . See put_pwq ( )
* and pwq_unbound_release_workfn ( ) for details . pool_workqueue
2019-03-13 17:55:47 +01:00
* itself is also RCU protected so that the first pwq can be
2013-03-25 16:57:18 -07:00
* determined without grabbing wq - > mutex .
2013-03-12 11:30:04 -07:00
*/
struct work_struct unbound_release_work ;
struct rcu_head rcu ;
2013-03-12 11:29:57 -07:00
} __aligned ( 1 < < WORK_STRUCT_FLAG_BITS ) ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:11 +02:00
/*
* Structure used to wait for workqueue flush .
*/
struct wq_flusher {
2013-03-25 16:57:17 -07:00
struct list_head list ; /* WQ: list of flushers */
int flush_color ; /* WQ: flush color waiting for */
2010-06-29 10:07:11 +02:00
struct completion done ; /* flush completion */
} ;
2013-03-12 11:30:05 -07:00
struct wq_device ;
2005-04-16 15:20:36 -07:00
/*
2013-03-13 16:51:36 -07:00
* The externally visible workqueue . It relays the issued work items to
* the appropriate worker_pool through its pool_workqueues .
2005-04-16 15:20:36 -07:00
*/
struct workqueue_struct {
2013-03-25 16:57:17 -07:00
struct list_head pwqs ; /* WR: all pwqs of this wq */
2015-03-09 09:22:28 -04:00
struct list_head list ; /* PR: list of all workqueues */
2010-06-29 10:07:11 +02:00
2013-03-25 16:57:17 -07:00
struct mutex mutex ; /* protects this wq */
int work_color ; /* WQ: current work color */
int flush_color ; /* WQ: current flush color */
2013-02-13 19:29:12 -08:00
atomic_t nr_pwqs_to_flush ; /* flush in progress */
2013-03-25 16:57:17 -07:00
struct wq_flusher * first_flusher ; /* WQ: first flusher */
struct list_head flusher_queue ; /* WQ: flush waiters */
struct list_head flusher_overflow ; /* WQ: flush overflow list */
2010-06-29 10:07:11 +02:00
2013-03-13 19:47:40 -07:00
struct list_head maydays ; /* MD: pwqs requesting rescue */
2019-09-20 14:09:14 -07:00
struct worker * rescuer ; /* MD: rescue worker */
2010-06-29 10:07:14 +02:00
2013-03-25 16:57:18 -07:00
int nr_drainers ; /* WQ: drain in progress */
2013-03-25 16:57:19 -07:00
int saved_max_active ; /* WQ: saved pwq max_active */
2013-03-12 11:30:05 -07:00
2015-05-12 20:32:29 +08:00
struct workqueue_attrs * unbound_attrs ; /* PW: only for unbound wqs */
struct pool_workqueue * dfl_pwq ; /* PW: only for unbound wqs */
2013-04-01 11:23:34 -07:00
2013-03-12 11:30:05 -07:00
# ifdef CONFIG_SYSFS
struct wq_device * wq_dev ; /* I: for sysfs interface */
# endif
2007-10-18 23:39:55 -07:00
# ifdef CONFIG_LOCKDEP
2019-02-14 15:00:54 -08:00
char * lock_name ;
struct lock_class_key key ;
2010-06-29 10:07:10 +02:00
struct lockdep_map lockdep_map ;
2007-10-18 23:39:55 -07:00
# endif
2013-04-01 11:23:34 -07:00
char name [ WQ_NAME_LEN ] ; /* I: workqueue name */
2013-04-01 11:23:35 -07:00
2015-03-09 09:22:28 -04:00
/*
2019-03-13 17:55:47 +01:00
* Destruction of workqueue_struct is RCU protected to allow walking
* the workqueues list without grabbing wq_pool_mutex .
2015-03-09 09:22:28 -04:00
* This is used to dump all workqueues from sysrq .
*/
struct rcu_head rcu ;
2013-04-01 11:23:35 -07:00
/* hot fields used during command issue, aligned to cacheline */
unsigned int flags ____cacheline_aligned ; /* WQ: WQ_* flags */
struct pool_workqueue __percpu * cpu_pwqs ; /* I: per-cpu pwqs */
2015-05-12 20:32:29 +08:00
struct pool_workqueue __rcu * numa_pwq_tbl [ ] ; /* PWR: unbound pwqs indexed by node */
2005-04-16 15:20:36 -07:00
} ;
2013-03-12 11:29:57 -07:00
static struct kmem_cache * pwq_cache ;
2013-04-01 11:23:32 -07:00
static cpumask_var_t * wq_numa_possible_cpumask ;
/* possible CPUs of each node */
2013-04-01 11:23:38 -07:00
static bool wq_disable_numa ;
module_param_named ( disable_numa , wq_disable_numa , bool , 0444 ) ;
2013-04-08 16:45:40 +05:30
/* see the comment above the definition of WQ_POWER_EFFICIENT */
2015-05-27 11:09:39 +09:30
static bool wq_power_efficient = IS_ENABLED ( CONFIG_WQ_POWER_EFFICIENT_DEFAULT ) ;
2013-04-08 16:45:40 +05:30
module_param_named ( power_efficient , wq_power_efficient , bool , 0444 ) ;
2016-09-16 15:49:34 -04:00
static bool wq_online ; /* can kworkers be created yet? */
2016-09-16 15:49:32 -04:00
2013-04-01 11:23:32 -07:00
static bool wq_numa_enabled ; /* unbound NUMA affinity enabled */
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/* buf for wq_update_unbound_numa_attrs(), protected by CPU hotplug exclusion */
static struct workqueue_attrs * wq_update_unbound_numa_attrs_buf ;
2013-03-25 16:57:17 -07:00
static DEFINE_MUTEX ( wq_pool_mutex ) ; /* protects pools and workqueues list */
2018-05-18 08:47:13 -07:00
static DEFINE_MUTEX ( wq_pool_attach_mutex ) ; /* protects worker attach/detach */
2020-05-27 21:46:33 +02:00
static DEFINE_RAW_SPINLOCK ( wq_mayday_lock ) ; /* protects wq->maydays list */
2020-05-27 21:46:32 +02:00
/* wait for manager to go away */
static struct rcuwait manager_wait = __RCUWAIT_INITIALIZER ( manager_wait ) ;
2013-03-13 19:47:40 -07:00
2015-03-09 09:22:28 -04:00
static LIST_HEAD ( workqueues ) ; /* PR: list of all workqueues */
2013-03-25 16:57:17 -07:00
static bool workqueue_freezing ; /* PL: have wqs started freezing? */
2013-03-13 19:47:40 -07:00
2016-02-09 17:59:38 -05:00
/* PL: allowable cpus for unbound wqs and work items */
static cpumask_var_t wq_unbound_cpumask ;
/* CPU where unbound work was last round robin scheduled from this CPU */
static DEFINE_PER_CPU ( int , wq_rr_cpu_last ) ;
2015-04-27 17:58:39 +08:00
2016-02-09 17:59:38 -05:00
/*
* Local execution of unbound work items is no longer guaranteed . The
* following always forces round - robin CPU selection on unbound work items
* to uncover usages which depend on it .
*/
# ifdef CONFIG_DEBUG_WQ_FORCE_RR_CPU
static bool wq_debug_force_rr_cpu = true ;
# else
static bool wq_debug_force_rr_cpu = false ;
# endif
module_param_named ( debug_force_rr_cpu , wq_debug_force_rr_cpu , bool , 0644 ) ;
2013-03-13 19:47:40 -07:00
/* the per-cpu worker pools */
2016-03-15 14:52:49 -07:00
static DEFINE_PER_CPU_SHARED_ALIGNED ( struct worker_pool [ NR_STD_WORKER_POOLS ] , cpu_worker_pools ) ;
2013-03-13 19:47:40 -07:00
2013-03-25 16:57:17 -07:00
static DEFINE_IDR ( worker_pool_idr ) ; /* PR: idr of all pools */
2013-03-13 19:47:40 -07:00
2013-03-25 16:57:17 -07:00
/* PL: hash of all unbound pools keyed by pool->attrs */
2013-03-12 11:30:03 -07:00
static DEFINE_HASHTABLE ( unbound_pool_hash , UNBOUND_POOL_HASH_ORDER ) ;
2013-03-13 16:51:36 -07:00
/* I: attributes used when instantiating standard unbound pools on demand */
2013-03-12 11:30:03 -07:00
static struct workqueue_attrs * unbound_std_wq_attrs [ NR_STD_WORKER_POOLS ] ;
2013-09-05 12:30:04 -04:00
/* I: attributes used when instantiating ordered pools on demand */
static struct workqueue_attrs * ordered_wq_attrs [ NR_STD_WORKER_POOLS ] ;
2010-06-29 10:07:14 +02:00
struct workqueue_struct * system_wq __read_mostly ;
2013-05-06 17:44:55 -04:00
EXPORT_SYMBOL ( system_wq ) ;
2012-08-19 00:52:42 +03:00
struct workqueue_struct * system_highpri_wq __read_mostly ;
2012-08-15 23:25:39 +09:00
EXPORT_SYMBOL_GPL ( system_highpri_wq ) ;
2012-08-19 00:52:42 +03:00
struct workqueue_struct * system_long_wq __read_mostly ;
2010-06-29 10:07:14 +02:00
EXPORT_SYMBOL_GPL ( system_long_wq ) ;
2012-08-19 00:52:42 +03:00
struct workqueue_struct * system_unbound_wq __read_mostly ;
2010-07-02 10:03:51 +02:00
EXPORT_SYMBOL_GPL ( system_unbound_wq ) ;
2012-08-19 00:52:42 +03:00
struct workqueue_struct * system_freezable_wq __read_mostly ;
2011-02-21 09:52:50 +01:00
EXPORT_SYMBOL_GPL ( system_freezable_wq ) ;
2013-04-24 17:12:54 +05:30
struct workqueue_struct * system_power_efficient_wq __read_mostly ;
EXPORT_SYMBOL_GPL ( system_power_efficient_wq ) ;
struct workqueue_struct * system_freezable_power_efficient_wq __read_mostly ;
EXPORT_SYMBOL_GPL ( system_freezable_power_efficient_wq ) ;
2010-06-29 10:07:14 +02:00
2013-03-13 19:47:40 -07:00
static int worker_thread ( void * __worker ) ;
2015-04-02 19:14:39 +08:00
static void workqueue_sysfs_unregister ( struct workqueue_struct * wq ) ;
2019-09-23 11:08:58 -07:00
static void show_pwq ( struct pool_workqueue * pwq ) ;
2021-10-20 14:09:00 +11:00
static void show_one_worker_pool ( struct worker_pool * pool ) ;
2013-03-13 19:47:40 -07:00
2010-10-05 10:41:14 +02:00
# define CREATE_TRACE_POINTS
# include <trace/events/workqueue.h>
2013-03-25 16:57:17 -07:00
# define assert_rcu_or_pool_mutex() \
2019-03-13 17:55:47 +01:00
RCU_LOCKDEP_WARN ( ! rcu_read_lock_held ( ) & & \
2015-06-18 15:50:02 -07:00
! lockdep_is_held ( & wq_pool_mutex ) , \
2019-03-13 17:55:47 +01:00
" RCU or wq_pool_mutex should be held " )
2013-03-13 19:47:40 -07:00
2015-05-12 20:32:29 +08:00
# define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \
2019-03-13 17:55:47 +01:00
RCU_LOCKDEP_WARN ( ! rcu_read_lock_held ( ) & & \
2015-06-18 15:50:02 -07:00
! lockdep_is_held ( & wq - > mutex ) & & \
! lockdep_is_held ( & wq_pool_mutex ) , \
2019-03-13 17:55:47 +01:00
" RCU, wq->mutex or wq_pool_mutex should be held " )
2015-05-12 20:32:29 +08:00
2013-03-12 11:30:03 -07:00
# define for_each_cpu_worker_pool(pool, cpu) \
for ( ( pool ) = & per_cpu ( cpu_worker_pools , cpu ) [ 0 ] ; \
( pool ) < & per_cpu ( cpu_worker_pools , cpu ) [ NR_STD_WORKER_POOLS ] ; \
2013-03-12 11:30:03 -07:00
( pool ) + + )
2012-07-13 22:16:44 -07:00
2013-03-12 11:29:58 -07:00
/**
* for_each_pool - iterate through all worker_pools in the system
* @ pool : iteration cursor
2013-03-13 16:51:36 -07:00
* @ pi : integer used for iteration
2013-03-12 11:30:00 -07:00
*
2019-03-13 17:55:47 +01:00
* This must be called either with wq_pool_mutex held or RCU read
2013-03-25 16:57:17 -07:00
* locked . If the pool needs to be used beyond the locking in effect , the
* caller is responsible for guaranteeing that the pool stays online .
2013-03-12 11:30:00 -07:00
*
* The if / else clause exists only for the lockdep assertion and can be
* ignored .
2013-03-12 11:29:58 -07:00
*/
2013-03-13 16:51:36 -07:00
# define for_each_pool(pool, pi) \
idr_for_each_entry ( & worker_pool_idr , pool , pi ) \
2013-03-25 16:57:17 -07:00
if ( ( { assert_rcu_or_pool_mutex ( ) ; false ; } ) ) { } \
2013-03-12 11:30:00 -07:00
else
2013-03-12 11:29:58 -07:00
2013-03-19 13:45:21 -07:00
/**
* for_each_pool_worker - iterate through all workers of a worker_pool
* @ worker : iteration cursor
* @ pool : worker_pool to iterate workers of
*
2018-05-18 08:47:13 -07:00
* This must be called with wq_pool_attach_mutex .
2013-03-19 13:45:21 -07:00
*
* The if / else clause exists only for the lockdep assertion and can be
* ignored .
*/
2014-05-20 17:46:31 +08:00
# define for_each_pool_worker(worker, pool) \
list_for_each_entry ( ( worker ) , & ( pool ) - > workers , node ) \
2018-05-18 08:47:13 -07:00
if ( ( { lockdep_assert_held ( & wq_pool_attach_mutex ) ; false ; } ) ) { } \
2013-03-19 13:45:21 -07:00
else
2013-03-12 11:29:58 -07:00
/**
* for_each_pwq - iterate through all pool_workqueues of the specified workqueue
* @ pwq : iteration cursor
* @ wq : the target workqueue
2013-03-12 11:30:00 -07:00
*
2019-03-13 17:55:47 +01:00
* This must be called either with wq - > mutex held or RCU read locked .
2013-03-13 19:47:40 -07:00
* If the pwq needs to be used beyond the locking in effect , the caller is
* responsible for guaranteeing that the pwq stays online .
2013-03-12 11:30:00 -07:00
*
* The if / else clause exists only for the lockdep assertion and can be
* ignored .
2013-03-12 11:29:58 -07:00
*/
# define for_each_pwq(pwq, wq) \
2019-11-15 19:01:25 +01:00
list_for_each_entry_rcu ( ( pwq ) , & ( wq ) - > pwqs , pwqs_node , \
2019-08-15 10:18:42 -04:00
lockdep_is_held ( & ( wq - > mutex ) ) )
2010-07-02 10:03:51 +02:00
2009-11-16 01:09:48 +09:00
# ifdef CONFIG_DEBUG_OBJECTS_WORK
2020-08-14 17:40:27 -07:00
static const struct debug_obj_descr work_debug_descr ;
2009-11-16 01:09:48 +09:00
2011-03-07 09:58:33 +01:00
static void * work_debug_hint ( void * addr )
{
return ( ( struct work_struct * ) addr ) - > func ;
}
2016-05-19 17:09:41 -07:00
static bool work_is_static_object ( void * addr )
{
struct work_struct * work = addr ;
return test_bit ( WORK_STRUCT_STATIC_BIT , work_data_bits ( work ) ) ;
}
2009-11-16 01:09:48 +09:00
/*
* fixup_init is called when :
* - an active object is initialized
*/
2016-05-19 17:09:26 -07:00
static bool work_fixup_init ( void * addr , enum debug_obj_state state )
2009-11-16 01:09:48 +09:00
{
struct work_struct * work = addr ;
switch ( state ) {
case ODEBUG_STATE_ACTIVE :
cancel_work_sync ( work ) ;
debug_object_init ( work , & work_debug_descr ) ;
2016-05-19 17:09:26 -07:00
return true ;
2009-11-16 01:09:48 +09:00
default :
2016-05-19 17:09:26 -07:00
return false ;
2009-11-16 01:09:48 +09:00
}
}
/*
* fixup_free is called when :
* - an active object is freed
*/
2016-05-19 17:09:26 -07:00
static bool work_fixup_free ( void * addr , enum debug_obj_state state )
2009-11-16 01:09:48 +09:00
{
struct work_struct * work = addr ;
switch ( state ) {
case ODEBUG_STATE_ACTIVE :
cancel_work_sync ( work ) ;
debug_object_free ( work , & work_debug_descr ) ;
2016-05-19 17:09:26 -07:00
return true ;
2009-11-16 01:09:48 +09:00
default :
2016-05-19 17:09:26 -07:00
return false ;
2009-11-16 01:09:48 +09:00
}
}
2020-08-14 17:40:27 -07:00
static const struct debug_obj_descr work_debug_descr = {
2009-11-16 01:09:48 +09:00
. name = " work_struct " ,
2011-03-07 09:58:33 +01:00
. debug_hint = work_debug_hint ,
2016-05-19 17:09:41 -07:00
. is_static_object = work_is_static_object ,
2009-11-16 01:09:48 +09:00
. fixup_init = work_fixup_init ,
. fixup_free = work_fixup_free ,
} ;
static inline void debug_work_activate ( struct work_struct * work )
{
debug_object_activate ( work , & work_debug_descr ) ;
}
static inline void debug_work_deactivate ( struct work_struct * work )
{
debug_object_deactivate ( work , & work_debug_descr ) ;
}
void __init_work ( struct work_struct * work , int onstack )
{
if ( onstack )
debug_object_init_on_stack ( work , & work_debug_descr ) ;
else
debug_object_init ( work , & work_debug_descr ) ;
}
EXPORT_SYMBOL_GPL ( __init_work ) ;
void destroy_work_on_stack ( struct work_struct * work )
{
debug_object_free ( work , & work_debug_descr ) ;
}
EXPORT_SYMBOL_GPL ( destroy_work_on_stack ) ;
2014-03-23 14:20:44 +00:00
void destroy_delayed_work_on_stack ( struct delayed_work * work )
{
destroy_timer_on_stack ( & work - > timer ) ;
debug_object_free ( & work - > work , & work_debug_descr ) ;
}
EXPORT_SYMBOL_GPL ( destroy_delayed_work_on_stack ) ;
2009-11-16 01:09:48 +09:00
# else
static inline void debug_work_activate ( struct work_struct * work ) { }
static inline void debug_work_deactivate ( struct work_struct * work ) { }
# endif
2013-09-10 09:52:35 +08:00
/**
2021-07-31 08:01:29 +08:00
* worker_pool_assign_id - allocate ID and assign it to @ pool
2013-09-10 09:52:35 +08:00
* @ pool : the pool pointer of interest
*
* Returns 0 if ID in [ 0 , WORK_OFFQ_POOL_NONE ) is allocated and assigned
* successfully , - errno on failure .
*/
2013-01-24 11:01:33 -08:00
static int worker_pool_assign_id ( struct worker_pool * pool )
{
int ret ;
2013-03-25 16:57:17 -07:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-13 19:47:40 -07:00
2013-09-10 09:52:35 +08:00
ret = idr_alloc ( & worker_pool_idr , pool , 0 , WORK_OFFQ_POOL_NONE ,
GFP_KERNEL ) ;
Linux 3.9-rc5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
=q15K
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc5' into wq/for-3.10
Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.
The two conflicts and their resolutions:
* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);
do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
return ret < 0 ? ret : 0;
}
We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
lockdep_assert_held(&wq_pool_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}
* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
schedule();
<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);
schedule();
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01 17:08:13 -07:00
if ( ret > = 0 ) {
2013-03-13 14:59:38 -07:00
pool - > id = ret ;
Linux 3.9-rc5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
=q15K
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc5' into wq/for-3.10
Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.
The two conflicts and their resolutions:
* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);
do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
return ret < 0 ? ret : 0;
}
We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
lockdep_assert_held(&wq_pool_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}
* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
schedule();
<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);
schedule();
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01 17:08:13 -07:00
return 0 ;
}
2013-03-12 11:30:00 -07:00
return ret ;
2013-01-24 11:01:33 -08:00
}
2013-04-01 11:23:35 -07:00
/**
* unbound_pwq_by_node - return the unbound pool_workqueue for the given node
* @ wq : the target workqueue
* @ node : the node ID
*
2019-03-13 17:55:47 +01:00
* This must be called with any of wq_pool_mutex , wq - > mutex or RCU
2015-05-12 20:32:29 +08:00
* read locked .
2013-04-01 11:23:35 -07:00
* If the pwq needs to be used beyond the locking in effect , the caller is
* responsible for guaranteeing that the pwq stays online .
2013-07-31 14:59:24 -07:00
*
* Return : The unbound pool_workqueue for @ node .
2013-04-01 11:23:35 -07:00
*/
static struct pool_workqueue * unbound_pwq_by_node ( struct workqueue_struct * wq ,
int node )
{
2015-05-12 20:32:29 +08:00
assert_rcu_or_wq_mutex_or_pool_mutex ( wq ) ;
2016-02-03 13:54:25 -05:00
/*
* XXX : @ node can be NUMA_NO_NODE if CPU goes offline while a
* delayed item is pending . The plan is to keep CPU - > NODE
* mapping valid and stable across CPU on / offlines . Once that
* happens , this workaround can be removed .
*/
if ( unlikely ( node = = NUMA_NO_NODE ) )
return wq - > dfl_pwq ;
2013-04-01 11:23:35 -07:00
return rcu_dereference_raw ( wq - > numa_pwq_tbl [ node ] ) ;
}
2010-06-29 10:07:11 +02:00
static unsigned int work_color_to_flags ( int color )
{
return color < < WORK_STRUCT_COLOR_SHIFT ;
}
2021-08-17 09:32:35 +08:00
static int get_work_color ( unsigned long work_data )
2010-06-29 10:07:11 +02:00
{
2021-08-17 09:32:35 +08:00
return ( work_data > > WORK_STRUCT_COLOR_SHIFT ) &
2010-06-29 10:07:11 +02:00
( ( 1 < < WORK_STRUCT_COLOR_BITS ) - 1 ) ;
}
static int work_next_color ( int color )
{
return ( color + 1 ) % WORK_NR_COLORS ;
}
2005-04-16 15:20:36 -07:00
2007-05-23 13:57:57 -07:00
/*
2013-02-13 19:29:12 -08:00
* While queued , % WORK_STRUCT_PWQ is set and non flag bits of a work ' s data
* contain the pointer to the queued pwq . Once execution starts , the flag
2013-01-24 11:01:33 -08:00
* is cleared and the high bits contain OFFQ flags and pool ID .
2010-06-29 10:07:13 +02:00
*
2013-02-13 19:29:12 -08:00
* set_work_pwq ( ) , set_work_pool_and_clear_pending ( ) , mark_work_canceling ( )
* and clear_work_data ( ) can be used to set the pwq , pool or clear
2012-08-03 10:30:46 -07:00
* work - > data . These functions should only be called while the work is
* owned - ie . while the PENDING bit is set .
2010-06-29 10:07:13 +02:00
*
2013-02-13 19:29:12 -08:00
* get_work_pool ( ) and get_work_pwq ( ) can be used to obtain the pool or pwq
2013-01-24 11:01:33 -08:00
* corresponding to a work . Pool is available once the work has been
2013-02-13 19:29:12 -08:00
* queued anywhere after initialization until it is sync canceled . pwq is
2013-01-24 11:01:33 -08:00
* available only while the work item is queued .
2010-06-29 10:07:13 +02:00
*
2012-08-03 10:30:46 -07:00
* % WORK_OFFQ_CANCELING is used to mark a work item which is being
* canceled . While being canceled , a work item may have its PENDING set
* but stay off timer and worklist for arbitrarily long and nobody should
* try to steal the PENDING bit .
2007-05-23 13:57:57 -07:00
*/
2010-06-29 10:07:13 +02:00
static inline void set_work_data ( struct work_struct * work , unsigned long data ,
unsigned long flags )
2006-11-22 14:54:49 +00:00
{
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( ! work_pending ( work ) ) ;
2010-06-29 10:07:13 +02:00
atomic_long_set ( & work - > data , data | flags | work_static ( work ) ) ;
}
2006-11-22 14:54:49 +00:00
2013-02-13 19:29:12 -08:00
static void set_work_pwq ( struct work_struct * work , struct pool_workqueue * pwq ,
2010-06-29 10:07:13 +02:00
unsigned long extra_flags )
{
2013-02-13 19:29:12 -08:00
set_work_data ( work , ( unsigned long ) pwq ,
WORK_STRUCT_PENDING | WORK_STRUCT_PWQ | extra_flags ) ;
2006-11-22 14:54:49 +00:00
}
2013-02-06 18:04:53 -08:00
static void set_work_pool_and_keep_pending ( struct work_struct * work ,
int pool_id )
{
set_work_data ( work , ( unsigned long ) pool_id < < WORK_OFFQ_POOL_SHIFT ,
WORK_STRUCT_PENDING ) ;
}
2013-01-24 11:01:33 -08:00
static void set_work_pool_and_clear_pending ( struct work_struct * work ,
int pool_id )
2010-06-29 10:07:13 +02:00
{
2012-08-13 17:08:19 -07:00
/*
* The following wmb is paired with the implied mb in
* test_and_set_bit ( PENDING ) and ensures all updates to @ work made
* here are visible to and precede any updates by the next PENDING
* owner .
*/
smp_wmb ( ) ;
2013-01-24 11:01:33 -08:00
set_work_data ( work , ( unsigned long ) pool_id < < WORK_OFFQ_POOL_SHIFT , 0 ) ;
workqueue: fix ghost PENDING flag while doing MQ IO
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-04-26 13:15:35 +02:00
/*
* The following mb guarantees that previous clear of a PENDING bit
* will not be reordered with any speculative LOADS or STORES from
* work - > current_func , which is executed afterwards . This possible
2019-02-19 23:53:27 +08:00
* reordering can lead to a missed execution on attempt to queue
workqueue: fix ghost PENDING flag while doing MQ IO
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-04-26 13:15:35 +02:00
* the same @ work . E . g . consider this case :
*
* CPU # 0 CPU # 1
* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
*
* 1 STORE event_indicated
* 2 queue_work_on ( ) {
* 3 test_and_set_bit ( PENDING )
* 4 } set_ . . . _and_clear_pending ( ) {
* 5 set_work_data ( ) # clear bit
* 6 smp_mb ( )
* 7 work - > current_func ( ) {
* 8 LOAD event_indicated
* }
*
* Without an explicit full barrier speculative LOAD on line 8 can
* be executed before CPU # 0 does STORE on line 1. If that happens ,
* CPU # 0 observes the PENDING bit is still set and new execution of
* a @ work is not queued in a hope , that CPU # 1 will eventually
* finish the queued @ work . Meanwhile CPU # 1 does not see
* event_indicated is set , because speculative LOAD was executed
* before actual STORE .
*/
smp_mb ( ) ;
2010-06-29 10:07:13 +02:00
}
2006-01-08 01:05:12 -08:00
2010-06-29 10:07:13 +02:00
static void clear_work_data ( struct work_struct * work )
2005-04-16 15:20:36 -07:00
{
2013-01-24 11:01:33 -08:00
smp_wmb ( ) ; /* see set_work_pool_and_clear_pending() */
set_work_data ( work , WORK_STRUCT_NO_POOL , 0 ) ;
2005-04-16 15:20:36 -07:00
}
2013-02-13 19:29:12 -08:00
static struct pool_workqueue * get_work_pwq ( struct work_struct * work )
2007-05-09 02:34:12 -07:00
{
2010-07-22 14:14:25 +02:00
unsigned long data = atomic_long_read ( & work - > data ) ;
2010-06-29 10:07:13 +02:00
2013-02-13 19:29:12 -08:00
if ( data & WORK_STRUCT_PWQ )
2010-07-22 14:14:25 +02:00
return ( void * ) ( data & WORK_STRUCT_WQ_DATA_MASK ) ;
else
return NULL ;
2010-04-23 17:40:40 +02:00
}
2013-01-24 11:01:33 -08:00
/**
* get_work_pool - return the worker_pool a given work was associated with
* @ work : the work item of interest
*
2013-03-25 16:57:17 -07:00
* Pools are created and destroyed under wq_pool_mutex , and allows read
2019-03-13 17:55:47 +01:00
* access under RCU read lock . As such , this function should be
* called under wq_pool_mutex or inside of a rcu_read_lock ( ) region .
2013-03-12 11:30:00 -07:00
*
* All fields of the returned pool are accessible as long as the above
* mentioned locking is in effect . If the returned pool needs to be used
* beyond the critical section , the caller is responsible for ensuring the
* returned pool is and stays online .
2013-07-31 14:59:24 -07:00
*
* Return : The worker_pool @ work was last associated with . % NULL if none .
2013-01-24 11:01:33 -08:00
*/
static struct worker_pool * get_work_pool ( struct work_struct * work )
2006-11-22 14:54:49 +00:00
{
2010-07-22 14:14:25 +02:00
unsigned long data = atomic_long_read ( & work - > data ) ;
2013-01-24 11:01:33 -08:00
int pool_id ;
2010-06-29 10:07:13 +02:00
2013-03-25 16:57:17 -07:00
assert_rcu_or_pool_mutex ( ) ;
2013-03-12 11:30:00 -07:00
2013-02-13 19:29:12 -08:00
if ( data & WORK_STRUCT_PWQ )
return ( ( struct pool_workqueue * )
2013-01-24 11:01:33 -08:00
( data & WORK_STRUCT_WQ_DATA_MASK ) ) - > pool ;
2010-06-29 10:07:13 +02:00
2013-01-24 11:01:33 -08:00
pool_id = data > > WORK_OFFQ_POOL_SHIFT ;
if ( pool_id = = WORK_OFFQ_POOL_NONE )
2010-06-29 10:07:13 +02:00
return NULL ;
2013-03-12 11:30:00 -07:00
return idr_find ( & worker_pool_idr , pool_id ) ;
2013-01-24 11:01:33 -08:00
}
/**
* get_work_pool_id - return the worker pool ID a given work is associated with
* @ work : the work item of interest
*
2013-07-31 14:59:24 -07:00
* Return : The worker_pool ID @ work was last associated with .
2013-01-24 11:01:33 -08:00
* % WORK_OFFQ_POOL_NONE if none .
*/
static int get_work_pool_id ( struct work_struct * work )
{
2013-02-07 13:14:20 -08:00
unsigned long data = atomic_long_read ( & work - > data ) ;
2013-02-13 19:29:12 -08:00
if ( data & WORK_STRUCT_PWQ )
return ( ( struct pool_workqueue * )
2013-02-07 13:14:20 -08:00
( data & WORK_STRUCT_WQ_DATA_MASK ) ) - > pool - > id ;
2013-01-24 11:01:33 -08:00
2013-02-07 13:14:20 -08:00
return data > > WORK_OFFQ_POOL_SHIFT ;
2013-01-24 11:01:33 -08:00
}
2012-08-03 10:30:46 -07:00
static void mark_work_canceling ( struct work_struct * work )
{
2013-01-24 11:01:33 -08:00
unsigned long pool_id = get_work_pool_id ( work ) ;
2012-08-03 10:30:46 -07:00
2013-01-24 11:01:33 -08:00
pool_id < < = WORK_OFFQ_POOL_SHIFT ;
set_work_data ( work , pool_id | WORK_OFFQ_CANCELING , WORK_STRUCT_PENDING ) ;
2012-08-03 10:30:46 -07:00
}
static bool work_is_canceling ( struct work_struct * work )
{
unsigned long data = atomic_long_read ( & work - > data ) ;
2013-02-13 19:29:12 -08:00
return ! ( data & WORK_STRUCT_PWQ ) & & ( data & WORK_OFFQ_CANCELING ) ;
2012-08-03 10:30:46 -07:00
}
2010-06-29 10:07:14 +02:00
/*
2012-07-13 22:16:45 -07:00
* Policy functions . These define the policies on how the global worker
* pools are managed . Unless noted otherwise , these functions assume that
2013-01-24 11:01:33 -08:00
* they ' re being called with pool - > lock held .
2010-06-29 10:07:14 +02:00
*/
2012-07-12 14:46:37 -07:00
static bool __need_more_worker ( struct worker_pool * pool )
2007-05-09 02:34:17 -07:00
{
2021-12-23 20:31:40 +08:00
return ! pool - > nr_running ;
2007-05-09 02:34:17 -07:00
}
[PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.
The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:
(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().
Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.
If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.
(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.
(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).
Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.
So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.
The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.
If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 11:33:26 +00:00
/*
2010-06-29 10:07:14 +02:00
* Need to wake up a worker ? Called from anything but currently
* running workers .
2012-07-12 14:46:37 -07:00
*
* Note that , because unbound workers never contribute to nr_running , this
2013-01-24 11:01:34 -08:00
* function will always return % true for unbound pools as long as the
2012-07-12 14:46:37 -07:00
* worklist isn ' t empty .
[PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.
The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:
(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().
Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.
If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.
(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.
(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).
Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.
So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.
The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.
If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 11:33:26 +00:00
*/
2012-07-12 14:46:37 -07:00
static bool need_more_worker ( struct worker_pool * pool )
2006-11-22 14:54:49 +00:00
{
2012-07-12 14:46:37 -07:00
return ! list_empty ( & pool - > worklist ) & & __need_more_worker ( pool ) ;
2010-06-29 10:07:14 +02:00
}
[PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.
The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:
(1) The pending flag may have been unset or may be cleared. However, given
where it's called, the pending flag is _always_ set. I don't think it
can be unset whilst we're in set_wq_data().
Once the work is enqueued to be actually run, the only way off the queue
is for it to be actually run.
If it's a delayed work item, then the bit can't be cleared by the timer
because we haven't started the timer yet. Also, the pending bit can't be
cleared by cancelling the delayed work _until_ the work item has had its
timer started.
(2) The workqueue pointer might change. This can only happen in two cases:
(a) The work item has just been queued to actually run, and so we're
protected by the appropriate workqueue spinlock.
(b) A delayed work item is being queued, and so the timer hasn't been
started yet, and so no one else knows about the work item or can
access it (the pending bit protects us).
Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
it can be assigned instead.
So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.
The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.
If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 11:33:26 +00:00
2010-06-29 10:07:14 +02:00
/* Can I start working? Called from busy but !running workers. */
2012-07-12 14:46:37 -07:00
static bool may_start_working ( struct worker_pool * pool )
2010-06-29 10:07:14 +02:00
{
2012-07-12 14:46:37 -07:00
return pool - > nr_idle ;
2010-06-29 10:07:14 +02:00
}
/* Do I need to keep working? Called from currently running workers. */
2012-07-12 14:46:37 -07:00
static bool keep_working ( struct worker_pool * pool )
2010-06-29 10:07:14 +02:00
{
2021-12-23 20:31:40 +08:00
return ! list_empty ( & pool - > worklist ) & & ( pool - > nr_running < = 1 ) ;
2010-06-29 10:07:14 +02:00
}
/* Do we need a new worker? Called from manager. */
2012-07-12 14:46:37 -07:00
static bool need_to_create_worker ( struct worker_pool * pool )
2010-06-29 10:07:14 +02:00
{
2012-07-12 14:46:37 -07:00
return need_more_worker ( pool ) & & ! may_start_working ( pool ) ;
2010-06-29 10:07:14 +02:00
}
2006-11-22 14:54:49 +00:00
2010-06-29 10:07:14 +02:00
/* Do we have too many workers and should some go away? */
2012-07-12 14:46:37 -07:00
static bool too_many_workers ( struct worker_pool * pool )
2010-06-29 10:07:14 +02:00
{
2017-10-09 08:04:13 -07:00
bool managing = pool - > flags & POOL_MANAGER_ACTIVE ;
2012-07-12 14:46:37 -07:00
int nr_idle = pool - > nr_idle + managing ; /* manager is considered idle */
int nr_busy = pool - > nr_workers - nr_idle ;
2010-06-29 10:07:14 +02:00
return nr_idle > 2 & & ( nr_idle - 2 ) * MAX_IDLE_WORKERS_RATIO > = nr_busy ;
2006-11-22 14:54:49 +00:00
}
2010-04-23 17:40:40 +02:00
/*
2010-06-29 10:07:14 +02:00
* Wake up functions .
*/
2021-12-23 20:31:38 +08:00
/* Return the first idle worker. Called with pool->lock held. */
2014-05-22 16:44:07 +08:00
static struct worker * first_idle_worker ( struct worker_pool * pool )
2010-06-29 10:07:13 +02:00
{
2012-07-12 14:46:37 -07:00
if ( unlikely ( list_empty ( & pool - > idle_list ) ) )
2010-06-29 10:07:13 +02:00
return NULL ;
2012-07-12 14:46:37 -07:00
return list_first_entry ( & pool - > idle_list , struct worker , entry ) ;
2010-06-29 10:07:13 +02:00
}
/**
* wake_up_worker - wake up an idle worker
2012-07-12 14:46:37 -07:00
* @ pool : worker pool to wake worker from
2010-06-29 10:07:13 +02:00
*
2012-07-12 14:46:37 -07:00
* Wake up the first idle worker of @ pool .
2010-06-29 10:07:13 +02:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:13 +02:00
*/
2012-07-12 14:46:37 -07:00
static void wake_up_worker ( struct worker_pool * pool )
2010-06-29 10:07:13 +02:00
{
2014-05-22 16:44:07 +08:00
struct worker * worker = first_idle_worker ( pool ) ;
2010-06-29 10:07:13 +02:00
if ( likely ( worker ) )
wake_up_process ( worker - > task ) ;
}
2010-06-29 10:07:13 +02:00
/**
2019-03-13 17:55:48 +01:00
* wq_worker_running - a worker is running again
2010-06-29 10:07:14 +02:00
* @ task : task waking up
*
2019-03-13 17:55:48 +01:00
* This function is called when a worker returns from schedule ( )
2010-06-29 10:07:14 +02:00
*/
2019-03-13 17:55:48 +01:00
void wq_worker_running ( struct task_struct * task )
2010-06-29 10:07:14 +02:00
{
struct worker * worker = kthread_data ( task ) ;
2019-03-13 17:55:48 +01:00
if ( ! worker - > sleeping )
return ;
workqueue: Fix unbind_workers() VS wq_worker_running() race
At CPU-hotplug time, unbind_worker() may preempt a worker while it is
waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running()
-------------- -------------------
if (!(worker->flags & WORKER_NOT_RUNNING))
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of 1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == 1", further queued work may
end up not being served, because no worker is awaken at work insert time.
This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in
wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with
unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-01 16:19:44 +01:00
/*
* If preempted by unbind_workers ( ) between the WORKER_NOT_RUNNING check
* and the nr_running increment below , we may ruin the nr_running reset
* and leave with an unexpected pool - > nr_running = = 1 on the newly unbound
* pool . Protect against such race .
*/
preempt_disable ( ) ;
2019-03-13 17:55:48 +01:00
if ( ! ( worker - > flags & WORKER_NOT_RUNNING ) )
2021-12-23 20:31:40 +08:00
worker - > pool - > nr_running + + ;
workqueue: Fix unbind_workers() VS wq_worker_running() race
At CPU-hotplug time, unbind_worker() may preempt a worker while it is
waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running()
-------------- -------------------
if (!(worker->flags & WORKER_NOT_RUNNING))
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of 1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == 1", further queued work may
end up not being served, because no worker is awaken at work insert time.
This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in
wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with
unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-01 16:19:44 +01:00
preempt_enable ( ) ;
2019-03-13 17:55:48 +01:00
worker - > sleeping = 0 ;
2010-06-29 10:07:14 +02:00
}
/**
* wq_worker_sleeping - a worker is going to sleep
* @ task : task going to sleep
*
2019-03-13 17:55:48 +01:00
* This function is called from schedule ( ) when a busy worker is
2021-12-07 15:35:37 +08:00
* going to sleep .
2010-06-29 10:07:14 +02:00
*/
2019-03-13 17:55:48 +01:00
void wq_worker_sleeping ( struct task_struct * task )
2010-06-29 10:07:14 +02:00
{
2021-12-23 20:31:39 +08:00
struct worker * worker = kthread_data ( task ) ;
2013-01-17 17:16:24 -08:00
struct worker_pool * pool ;
2010-06-29 10:07:14 +02:00
2013-01-17 17:16:24 -08:00
/*
* Rescuers , which may not have all the fields set up like normal
* workers , also reach here , let ' s not access anything before
* checking NOT_RUNNING .
*/
workqueue: It is likely that WORKER_NOT_RUNNING is true
Running the annotate branch profiler on three boxes, including my
main box that runs firefox, evolution, xchat, and is part of the distcc farm,
showed this with the likelys in the workqueue code:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
96 996253 99 wq_worker_sleeping workqueue.c 703
96 996247 99 wq_worker_waking_up workqueue.c 677
The likely()s in this case were assuming that WORKER_NOT_RUNNING will
most likely be false. But this is not the case. The reason is
(and shown by adding trace_printks and testing it) that most of the time
WORKER_PREP is set.
In worker_thread() we have:
worker_clr_flags(worker, WORKER_PREP);
[ do work stuff ]
worker_set_flags(worker, WORKER_PREP, false);
(that 'false' means not to wake up an idle worker)
The wq_worker_sleeping() is called from schedule when a worker thread
is putting itself to sleep. Which happens most of the time outside
of that [ do work stuff ].
The wq_worker_waking_up is called by the wakeup worker code, which
is also callod outside that [ do work stuff ].
Thus, the likely and unlikely used by those two functions are actually
backwards.
Remove the annotation and let gcc figure it out.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-03 23:12:33 -05:00
if ( worker - > flags & WORKER_NOT_RUNNING )
2019-03-13 17:55:48 +01:00
return ;
2010-06-29 10:07:14 +02:00
2013-01-17 17:16:24 -08:00
pool = worker - > pool ;
2020-03-28 00:29:59 +01:00
/* Return if preempted before wq_worker_running() was reached */
if ( worker - > sleeping )
2019-03-13 17:55:48 +01:00
return ;
worker - > sleeping = 1 ;
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 10:07:14 +02:00
2021-12-01 16:19:45 +01:00
/*
* Recheck in case unbind_workers ( ) preempted us . We don ' t
* want to decrement nr_running after the worker is unbound
* and nr_running has been reset .
*/
if ( worker - > flags & WORKER_NOT_RUNNING ) {
raw_spin_unlock_irq ( & pool - > lock ) ;
return ;
}
2021-12-23 20:31:40 +08:00
pool - > nr_running - - ;
if ( need_more_worker ( pool ) )
2021-12-23 20:31:39 +08:00
wake_up_worker ( pool ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 10:07:14 +02:00
}
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 14:20:42 -08:00
/**
* wq_worker_last_func - retrieve worker ' s last work function
2019-03-19 10:45:09 -07:00
* @ task : Task to retrieve last work function of .
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 14:20:42 -08:00
*
* Determine the last function a worker executed . This is called from
* the scheduler to get a worker ' s last known identity .
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( rq - > lock )
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 14:20:42 -08:00
*
2019-03-07 16:29:30 -08:00
* This function is called during schedule ( ) when a kworker is going
* to sleep . It ' s used by psi to identify aggregation workers during
* dequeuing , to allow periodic aggregation to shut - off when that
* worker is the last task in the system or cgroup to go to sleep .
*
* As this function doesn ' t involve any workqueue - related locking , it
* only returns stable values when called from inside the scheduler ' s
* queuing and dequeuing paths , when @ task , which must be a kworker ,
* is guaranteed to not be processing any works .
*
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 14:20:42 -08:00
* Return :
* The last work function % current executed as a worker , NULL if it
* hasn ' t executed any work yet .
*/
work_func_t wq_worker_last_func ( struct task_struct * task )
{
struct worker * worker = kthread_data ( task ) ;
return worker - > last_func ;
}
2010-06-29 10:07:14 +02:00
/**
* worker_set_flags - set worker flags and adjust nr_running accordingly
2010-07-02 10:03:50 +02:00
* @ worker : self
2010-06-29 10:07:13 +02:00
* @ flags : flags to set
*
2014-07-22 13:02:00 +08:00
* Set @ flags in @ worker - > flags and adjust nr_running accordingly .
2010-06-29 10:07:13 +02:00
*
2010-07-02 10:03:50 +02:00
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock )
2010-06-29 10:07:13 +02:00
*/
2014-07-22 13:02:00 +08:00
static inline void worker_set_flags ( struct worker * worker , unsigned int flags )
2010-06-29 10:07:13 +02:00
{
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 10:07:14 +02:00
2010-07-02 10:03:50 +02:00
WARN_ON_ONCE ( worker - > task ! = current ) ;
2014-07-22 13:02:00 +08:00
/* If transitioning into NOT_RUNNING, adjust nr_running. */
2010-06-29 10:07:14 +02:00
if ( ( flags & WORKER_NOT_RUNNING ) & &
! ( worker - > flags & WORKER_NOT_RUNNING ) ) {
2021-12-23 20:31:40 +08:00
pool - > nr_running - - ;
2010-06-29 10:07:14 +02:00
}
2010-06-29 10:07:13 +02:00
worker - > flags | = flags ;
}
/**
2010-06-29 10:07:14 +02:00
* worker_clr_flags - clear worker flags and adjust nr_running accordingly
2010-07-02 10:03:50 +02:00
* @ worker : self
2010-06-29 10:07:13 +02:00
* @ flags : flags to clear
*
2010-06-29 10:07:14 +02:00
* Clear @ flags in @ worker - > flags and adjust nr_running accordingly .
2010-06-29 10:07:13 +02:00
*
2010-07-02 10:03:50 +02:00
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock )
2010-06-29 10:07:13 +02:00
*/
static inline void worker_clr_flags ( struct worker * worker , unsigned int flags )
{
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 10:07:14 +02:00
unsigned int oflags = worker - > flags ;
2010-07-02 10:03:50 +02:00
WARN_ON_ONCE ( worker - > task ! = current ) ;
2010-06-29 10:07:13 +02:00
worker - > flags & = ~ flags ;
2010-06-29 10:07:14 +02:00
2011-01-11 15:58:49 +01:00
/*
* If transitioning out of NOT_RUNNING , increment nr_running . Note
* that the nested NOT_RUNNING is not a noop . NOT_RUNNING is mask
* of multiple flags , not a single flag .
*/
2010-06-29 10:07:14 +02:00
if ( ( flags & WORKER_NOT_RUNNING ) & & ( oflags & WORKER_NOT_RUNNING ) )
if ( ! ( worker - > flags & WORKER_NOT_RUNNING ) )
2021-12-23 20:31:40 +08:00
pool - > nr_running + + ;
2010-06-29 10:07:13 +02:00
}
2010-06-29 10:07:13 +02:00
/**
* find_worker_executing_work - find worker which is executing a work
2013-01-24 11:01:33 -08:00
* @ pool : pool of interest
2010-06-29 10:07:13 +02:00
* @ work : work to find worker for
*
2013-01-24 11:01:33 -08:00
* Find a worker which is executing @ work on @ pool by searching
* @ pool - > busy_hash which is keyed by the address of @ work . For a worker
2012-12-18 10:35:02 -08:00
* to match , its current execution should match the address of @ work and
* its work function . This is to avoid unwanted dependency between
* unrelated work executions through a work item being recycled while still
* being executed .
*
* This is a bit tricky . A work item may be freed once its execution
* starts and nothing prevents the freed area from being recycled for
* another work item . If the same work item address ends up being reused
* before the original execution finishes , workqueue will identify the
* recycled work item as currently executing and make it wait until the
* current execution finishes , introducing an unwanted dependency .
*
2013-03-13 16:51:36 -07:00
* This function checks the work item address and work function to avoid
* false positives . Note that this isn ' t complete as one may construct a
* work function which can introduce dependency onto itself through a
* recycled work item . Well , if somebody wants to shoot oneself in the
* foot that badly , there ' s only so much we can do , and if such deadlock
* actually occurs , it should be easy to locate the culprit work function .
2010-06-29 10:07:13 +02:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:13 +02:00
*
2013-07-31 14:59:24 -07:00
* Return :
* Pointer to worker which is executing @ work if found , % NULL
2010-06-29 10:07:13 +02:00
* otherwise .
2010-04-23 17:40:40 +02:00
*/
2013-01-24 11:01:33 -08:00
static struct worker * find_worker_executing_work ( struct worker_pool * pool ,
2010-06-29 10:07:13 +02:00
struct work_struct * work )
2010-04-23 17:40:40 +02:00
{
2012-12-17 10:01:23 -05:00
struct worker * worker ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
hash_for_each_possible ( pool - > busy_hash , worker , hentry ,
2012-12-18 10:35:02 -08:00
( unsigned long ) work )
if ( worker - > current_work = = work & &
worker - > current_func = = work - > func )
2012-12-17 10:01:23 -05:00
return worker ;
return NULL ;
2010-04-23 17:40:40 +02:00
}
2012-08-03 10:30:46 -07:00
/**
* move_linked_works - move linked works to a list
* @ work : start of series of works to be scheduled
* @ head : target list to append @ work to
2015-05-23 10:38:14 +05:30
* @ nextp : out parameter for nested worklist walking
2012-08-03 10:30:46 -07:00
*
* Schedule linked works starting from @ work to @ head . Work series to
* be scheduled starts at @ work and includes any consecutive work with
* WORK_STRUCT_LINKED set in its predecessor .
*
* If @ nextp is not NULL , it ' s updated to point to the next work of
* the last scheduled work . This allows move_linked_works ( ) to be
* nested inside outer list_for_each_entry_safe ( ) .
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2012-08-03 10:30:46 -07:00
*/
static void move_linked_works ( struct work_struct * work , struct list_head * head ,
struct work_struct * * nextp )
{
struct work_struct * n ;
/*
* Linked worklist will always end before the end of the list ,
* use NULL for list head .
*/
list_for_each_entry_safe_from ( work , n , NULL , entry ) {
list_move_tail ( & work - > entry , head ) ;
if ( ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) )
break ;
}
/*
* If we ' re already inside safe list traversal and have moved
* multiple works to the scheduled queue , the next position
* needs to be updated .
*/
if ( nextp )
* nextp = n ;
}
2013-03-12 11:30:04 -07:00
/**
* get_pwq - get an extra reference on the specified pool_workqueue
* @ pwq : pool_workqueue to get
*
* Obtain an extra reference on @ pwq . The caller should guarantee that
* @ pwq has positive refcnt and be holding the matching pool - > lock .
*/
static void get_pwq ( struct pool_workqueue * pwq )
{
lockdep_assert_held ( & pwq - > pool - > lock ) ;
WARN_ON_ONCE ( pwq - > refcnt < = 0 ) ;
pwq - > refcnt + + ;
}
/**
* put_pwq - put a pool_workqueue reference
* @ pwq : pool_workqueue to put
*
* Drop a reference of @ pwq . If its refcnt reaches zero , schedule its
* destruction . The caller should be holding the matching pool - > lock .
*/
static void put_pwq ( struct pool_workqueue * pwq )
{
lockdep_assert_held ( & pwq - > pool - > lock ) ;
if ( likely ( - - pwq - > refcnt ) )
return ;
if ( WARN_ON_ONCE ( ! ( pwq - > wq - > flags & WQ_UNBOUND ) ) )
return ;
/*
* @ pwq can ' t be released under pool - > lock , bounce to
* pwq_unbound_release_workfn ( ) . This never recurses on the same
* pool - > lock as this path is taken only for unbound workqueues and
* the release work item is scheduled on a per - cpu workqueue . To
* avoid lockdep warning , unbound pool - > locks are given lockdep
* subclass of 1 in get_unbound_pool ( ) .
*/
schedule_work ( & pwq - > unbound_release_work ) ;
}
2013-04-01 11:23:35 -07:00
/**
* put_pwq_unlocked - put_pwq ( ) with surrounding pool lock / unlock
* @ pwq : pool_workqueue to put ( can be % NULL )
*
* put_pwq ( ) with locking . This function also allows % NULL @ pwq .
*/
static void put_pwq_unlocked ( struct pool_workqueue * pwq )
{
if ( pwq ) {
/*
2019-03-13 17:55:47 +01:00
* As both pwqs and pools are RCU protected , the
2013-04-01 11:23:35 -07:00
* following lock operations are safe .
*/
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
2013-04-01 11:23:35 -07:00
put_pwq ( pwq ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2013-04-01 11:23:35 -07:00
}
}
2021-08-17 09:32:34 +08:00
static void pwq_activate_inactive_work ( struct work_struct * work )
2012-08-03 10:30:46 -07:00
{
2013-02-13 19:29:12 -08:00
struct pool_workqueue * pwq = get_work_pwq ( work ) ;
2012-08-03 10:30:46 -07:00
trace_workqueue_activate_work ( work ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
if ( list_empty ( & pwq - > pool - > worklist ) )
pwq - > pool - > watchdog_ts = jiffies ;
2013-02-13 19:29:12 -08:00
move_linked_works ( work , & pwq - > pool - > worklist , NULL ) ;
2021-08-17 09:32:34 +08:00
__clear_bit ( WORK_STRUCT_INACTIVE_BIT , work_data_bits ( work ) ) ;
2013-02-13 19:29:12 -08:00
pwq - > nr_active + + ;
2012-08-03 10:30:46 -07:00
}
2021-08-17 09:32:34 +08:00
static void pwq_activate_first_inactive ( struct pool_workqueue * pwq )
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2012-09-18 10:40:00 -07:00
{
2021-08-17 09:32:34 +08:00
struct work_struct * work = list_first_entry ( & pwq - > inactive_works ,
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2012-09-18 10:40:00 -07:00
struct work_struct , entry ) ;
2021-08-17 09:32:34 +08:00
pwq_activate_inactive_work ( work ) ;
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
2012-09-18 10:40:00 -07:00
}
2012-08-03 10:30:46 -07:00
/**
2013-02-13 19:29:12 -08:00
* pwq_dec_nr_in_flight - decrement pwq ' s nr_in_flight
* @ pwq : pwq of interest
2021-08-17 09:32:35 +08:00
* @ work_data : work_data of work which left the queue
2012-08-03 10:30:46 -07:00
*
* A work either has completed or is removed from pending queue ,
2013-02-13 19:29:12 -08:00
* decrement nr_in_flight of its pwq and handle workqueue flushing .
2012-08-03 10:30:46 -07:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2012-08-03 10:30:46 -07:00
*/
2021-08-17 09:32:35 +08:00
static void pwq_dec_nr_in_flight ( struct pool_workqueue * pwq , unsigned long work_data )
2012-08-03 10:30:46 -07:00
{
2021-08-17 09:32:35 +08:00
int color = get_work_color ( work_data ) ;
2021-08-17 09:32:37 +08:00
if ( ! ( work_data & WORK_STRUCT_INACTIVE ) ) {
pwq - > nr_active - - ;
if ( ! list_empty ( & pwq - > inactive_works ) ) {
/* one down, submit an inactive one */
if ( pwq - > nr_active < pwq - > max_active )
pwq_activate_first_inactive ( pwq ) ;
}
}
2013-02-13 19:29:12 -08:00
pwq - > nr_in_flight [ color ] - - ;
2012-08-03 10:30:46 -07:00
/* is flush in progress and are we at the flushing tip? */
2013-02-13 19:29:12 -08:00
if ( likely ( pwq - > flush_color ! = color ) )
2013-03-12 11:30:04 -07:00
goto out_put ;
2012-08-03 10:30:46 -07:00
/* are there still in-flight works? */
2013-02-13 19:29:12 -08:00
if ( pwq - > nr_in_flight [ color ] )
2013-03-12 11:30:04 -07:00
goto out_put ;
2012-08-03 10:30:46 -07:00
2013-02-13 19:29:12 -08:00
/* this pwq is done, clear flush_color */
pwq - > flush_color = - 1 ;
2012-08-03 10:30:46 -07:00
/*
2013-02-13 19:29:12 -08:00
* If this was the last pwq , wake up the first flusher . It
2012-08-03 10:30:46 -07:00
* will handle the rest .
*/
2013-02-13 19:29:12 -08:00
if ( atomic_dec_and_test ( & pwq - > wq - > nr_pwqs_to_flush ) )
complete ( & pwq - > wq - > first_flusher - > done ) ;
2013-03-12 11:30:04 -07:00
out_put :
put_pwq ( pwq ) ;
2012-08-03 10:30:46 -07:00
}
2012-08-03 10:30:46 -07:00
/**
2012-08-03 10:30:46 -07:00
* try_to_grab_pending - steal work item from worklist and disable irq
2012-08-03 10:30:46 -07:00
* @ work : work item to steal
* @ is_dwork : @ work is a delayed_work
2012-08-03 10:30:46 -07:00
* @ flags : place to store irq state
2012-08-03 10:30:46 -07:00
*
* Try to grab PENDING bit of @ work . This function can handle @ work in any
2013-07-31 14:59:24 -07:00
* stable state - idle , on timer or on worklist .
2012-08-03 10:30:46 -07:00
*
2013-07-31 14:59:24 -07:00
* Return :
2020-09-29 13:12:51 +02:00
*
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
2012-08-03 10:30:46 -07:00
* 1 if @ work was pending and we successfully stole PENDING
* 0 if @ work was idle and we claimed PENDING
* - EAGAIN if PENDING couldn ' t be grabbed at the moment , safe to busy - retry
2012-08-03 10:30:46 -07:00
* - ENOENT if someone else is canceling @ work , this state may persist
* for arbitrarily long
2020-09-29 13:12:51 +02:00
* = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
2012-08-03 10:30:46 -07:00
*
2013-07-31 14:59:24 -07:00
* Note :
2012-08-03 10:30:46 -07:00
* On > = 0 return , the caller owns @ work ' s PENDING bit . To avoid getting
2012-08-21 13:18:24 -07:00
* interrupted while holding PENDING and @ work off queue , irq must be
* disabled on entry . This , combined with delayed_work - > timer being
* irqsafe , ensures that we return - EAGAIN for finite short period of time .
2012-08-03 10:30:46 -07:00
*
* On successful return , > = 0 , irq is disabled and the caller is
* responsible for releasing it using local_irq_restore ( * @ flags ) .
*
2012-08-21 13:18:24 -07:00
* This function is safe to call from any context including IRQ handler .
2012-08-03 10:30:46 -07:00
*/
2012-08-03 10:30:46 -07:00
static int try_to_grab_pending ( struct work_struct * work , bool is_dwork ,
unsigned long * flags )
2012-08-03 10:30:46 -07:00
{
2013-01-24 11:01:33 -08:00
struct worker_pool * pool ;
2013-02-13 19:29:12 -08:00
struct pool_workqueue * pwq ;
2012-08-03 10:30:46 -07:00
2012-08-03 10:30:46 -07:00
local_irq_save ( * flags ) ;
2012-08-03 10:30:46 -07:00
/* try to steal the timer if it exists */
if ( is_dwork ) {
struct delayed_work * dwork = to_delayed_work ( work ) ;
2012-08-21 13:18:24 -07:00
/*
* dwork - > timer is irqsafe . If del_timer ( ) fails , it ' s
* guaranteed that the timer is not queued anywhere and not
* running on the local CPU .
*/
2012-08-03 10:30:46 -07:00
if ( likely ( del_timer ( & dwork - > timer ) ) )
return 1 ;
}
/* try to claim PENDING the normal way */
2012-08-03 10:30:46 -07:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) )
return 0 ;
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
2012-08-03 10:30:46 -07:00
/*
* The queueing is in progress , or it is already queued . Try to
* steal it from - > worklist without clearing WORK_STRUCT_PENDING .
*/
2013-01-24 11:01:33 -08:00
pool = get_work_pool ( work ) ;
if ( ! pool )
2012-08-03 10:30:46 -07:00
goto fail ;
2012-08-03 10:30:46 -07:00
2020-05-27 21:46:33 +02:00
raw_spin_lock ( & pool - > lock ) ;
workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
/*
2013-02-13 19:29:12 -08:00
* work - > data is guaranteed to point to pwq only while the work
* item is queued on pwq - > wq , and both updating work - > data to point
* to pwq on queueing and to pool on dequeueing are done under
* pwq - > pool - > lock . This in turn guarantees that , if work - > data
* points to pwq which is associated with a locked pool , the work
workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
* item is currently queued on that pool .
*/
2013-02-13 19:29:12 -08:00
pwq = get_work_pwq ( work ) ;
if ( pwq & & pwq - > pool = = pool ) {
2013-02-06 18:04:53 -08:00
debug_work_deactivate ( work ) ;
/*
2021-08-17 09:32:37 +08:00
* A cancelable inactive work item must be in the
* pwq - > inactive_works since a queued barrier can ' t be
* canceled ( see the comments in insert_wq_barrier ( ) ) .
*
2021-08-17 09:32:34 +08:00
* An inactive work item cannot be grabbed directly because
2021-08-17 09:32:38 +08:00
* it might have linked barrier work items which , if left
2021-08-17 09:32:34 +08:00
* on the inactive_works list , will confuse pwq - > nr_active
2013-02-06 18:04:53 -08:00
* management later on and cause stall . Make sure the work
* item is activated before grabbing .
*/
2021-08-17 09:32:34 +08:00
if ( * work_data_bits ( work ) & WORK_STRUCT_INACTIVE )
pwq_activate_inactive_work ( work ) ;
2013-02-06 18:04:53 -08:00
list_del_init ( & work - > entry ) ;
2021-08-17 09:32:35 +08:00
pwq_dec_nr_in_flight ( pwq , * work_data_bits ( work ) ) ;
2013-02-06 18:04:53 -08:00
2013-02-13 19:29:12 -08:00
/* work->data points to pwq iff queued, point to pool */
2013-02-06 18:04:53 -08:00
set_work_pool_and_keep_pending ( work , pool - > id ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & pool - > lock ) ;
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2013-02-06 18:04:53 -08:00
return 1 ;
2012-08-03 10:30:46 -07:00
}
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & pool - > lock ) ;
2012-08-03 10:30:46 -07:00
fail :
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2012-08-03 10:30:46 -07:00
local_irq_restore ( * flags ) ;
if ( work_is_canceling ( work ) )
return - ENOENT ;
cpu_relax ( ) ;
2012-08-03 10:30:46 -07:00
return - EAGAIN ;
2012-08-03 10:30:46 -07:00
}
2010-06-29 10:07:10 +02:00
/**
2013-01-24 11:01:34 -08:00
* insert_work - insert a work into a pool
2013-02-13 19:29:12 -08:00
* @ pwq : pwq @ work belongs to
2010-06-29 10:07:10 +02:00
* @ work : work to insert
* @ head : insertion point
* @ extra_flags : extra WORK_STRUCT_ * flags to set
*
2013-02-13 19:29:12 -08:00
* Insert @ work which belongs to @ pwq after @ head . @ extra_flags is or ' d to
2013-01-24 11:01:34 -08:00
* work_struct flags .
2010-06-29 10:07:10 +02:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:10 +02:00
*/
2013-02-13 19:29:12 -08:00
static void insert_work ( struct pool_workqueue * pwq , struct work_struct * work ,
struct list_head * head , unsigned int extra_flags )
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:33:52 -07:00
{
2013-02-13 19:29:12 -08:00
struct worker_pool * pool = pwq - > pool ;
2010-06-29 10:07:14 +02:00
2020-12-14 19:09:09 -08:00
/* record the work call stack in order to print it in KASAN reports */
workqueue, kasan: avoid alloc_pages() when recording stack
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
| kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
| it tries to allocate memory attempting to acquire spinlock in page
| allocation code while holding workqueue pool raw_spinlock.
|
| There are several instances of this problem when block layer tries
| to __queue_work(). Call trace from one of these instances is below:
|
| kblockd_mod_delayed_work_on()
| mod_delayed_work_on()
| __queue_delayed_work()
| __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
| insert_work()
| kasan_record_aux_stack()
| kasan_save_stack()
| stack_depot_save()
| alloc_pages()
| __alloc_pages()
| get_page_from_freelist()
| rm_queue()
| rm_queue_pcplist()
| local_lock_irqsave(&pagesets.lock, flags);
| [ BUG: Invalid wait context triggered ]
The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
In general, however, it is not even possible to use either GFP_ATOMIC
nor GFP_NOWAIT in certain non-preemptive contexts, including
raw_spin_locks (see gfp.h and commmit ab00db216c9c7).
Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using
kasan_record_aux_stack_noalloc().
While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit).
For frequent calls from the same location, it therefore becomes
extremely unlikely that kasan_record_aux_stack_noalloc() fails.
Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 13:35:50 -07:00
kasan_record_aux_stack_noalloc ( work ) ;
2020-12-14 19:09:09 -08:00
2010-06-29 10:07:10 +02:00
/* we own @work, set data and link */
2013-02-13 19:29:12 -08:00
set_work_pwq ( work , pwq , extra_flags ) ;
2008-07-25 01:47:47 -07:00
list_add_tail ( & work - > entry , head ) ;
2013-03-12 11:30:04 -07:00
get_pwq ( pwq ) ;
2010-06-29 10:07:14 +02:00
2012-07-12 14:46:37 -07:00
if ( __need_more_worker ( pool ) )
wake_up_worker ( pool ) ;
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:33:52 -07:00
}
2010-12-20 19:32:04 +01:00
/*
* Test whether @ work is being queued from another work executing on the
2013-02-13 19:29:10 -08:00
* same workqueue .
2010-12-20 19:32:04 +01:00
*/
static bool is_chained_work ( struct workqueue_struct * wq )
{
2013-02-13 19:29:10 -08:00
struct worker * worker ;
worker = current_wq_worker ( ) ;
/*
2019-03-01 13:57:25 -08:00
* Return % true iff I ' m a worker executing a work item on @ wq . If
2013-02-13 19:29:10 -08:00
* I ' m @ worker , it ' s safe to dereference it without locking .
*/
2013-02-13 19:29:12 -08:00
return worker & & worker - > current_pwq - > wq = = wq ;
2010-12-20 19:32:04 +01:00
}
2016-02-09 17:59:38 -05:00
/*
* When queueing an unbound work item to a wq , prefer local CPU if allowed
* by wq_unbound_cpumask . Otherwise , round robin among the allowed ones to
* avoid perturbing sensitive tasks .
*/
static int wq_select_unbound_cpu ( int cpu )
{
2016-02-09 17:59:38 -05:00
static bool printed_dbg_warning ;
2016-02-09 17:59:38 -05:00
int new_cpu ;
2016-02-09 17:59:38 -05:00
if ( likely ( ! wq_debug_force_rr_cpu ) ) {
if ( cpumask_test_cpu ( cpu , wq_unbound_cpumask ) )
return cpu ;
} else if ( ! printed_dbg_warning ) {
pr_warn ( " workqueue: round-robin CPU selection forced, expect performance impact \n " ) ;
printed_dbg_warning = true ;
}
2016-02-09 17:59:38 -05:00
if ( cpumask_empty ( wq_unbound_cpumask ) )
return cpu ;
new_cpu = __this_cpu_read ( wq_rr_cpu_last ) ;
new_cpu = cpumask_next_and ( new_cpu , wq_unbound_cpumask , cpu_online_mask ) ;
if ( unlikely ( new_cpu > = nr_cpu_ids ) ) {
new_cpu = cpumask_first_and ( wq_unbound_cpumask , cpu_online_mask ) ;
if ( unlikely ( new_cpu > = nr_cpu_ids ) )
return cpu ;
}
__this_cpu_write ( wq_rr_cpu_last , new_cpu ) ;
return new_cpu ;
}
2013-03-12 11:29:59 -07:00
static void __queue_work ( int cpu , struct workqueue_struct * wq ,
2005-04-16 15:20:36 -07:00
struct work_struct * work )
{
2013-02-13 19:29:12 -08:00
struct pool_workqueue * pwq ;
2013-03-12 11:30:04 -07:00
struct worker_pool * last_pool ;
2010-06-29 10:07:12 +02:00
struct list_head * worklist ;
2010-08-25 10:33:56 +02:00
unsigned int work_flags ;
2012-08-15 23:25:37 +09:00
unsigned int req_cpu = cpu ;
2012-08-03 10:30:45 -07:00
/*
* While a work item is PENDING & & off queue , a task trying to
* steal the PENDING will busy - loop waiting for it to either get
* queued or lose PENDING . Grabbing PENDING and queueing should
* happen with IRQ disabled .
*/
2017-11-06 16:01:19 +01:00
lockdep_assert_irqs_disabled ( ) ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:12 +02:00
2013-09-09 13:13:58 +08:00
/* if draining, only works from the same workqueue are allowed */
2013-03-12 11:30:04 -07:00
if ( unlikely ( wq - > flags & __WQ_DRAINING ) & &
2010-12-20 19:32:04 +01:00
WARN_ON_ONCE ( ! is_chained_work ( wq ) ) )
2010-08-24 14:22:47 +02:00
return ;
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
2013-03-12 11:30:04 -07:00
retry :
2013-03-12 11:30:04 -07:00
/* pwq which will be used unless @work is executing elsewhere */
2020-01-24 20:14:45 -05:00
if ( wq - > flags & WQ_UNBOUND ) {
if ( req_cpu = = WORK_CPU_UNBOUND )
cpu = wq_select_unbound_cpu ( raw_smp_processor_id ( ) ) ;
2013-04-01 11:23:35 -07:00
pwq = unbound_pwq_by_node ( wq , cpu_to_node ( cpu ) ) ;
2020-01-24 20:14:45 -05:00
} else {
if ( req_cpu = = WORK_CPU_UNBOUND )
cpu = raw_smp_processor_id ( ) ;
pwq = per_cpu_ptr ( wq - > cpu_pwqs , cpu ) ;
}
workqueue: make all workqueues non-reentrant
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20 14:51:23 -07:00
2013-03-12 11:30:04 -07:00
/*
* If @ work was previously on a different pool , it might still be
* running there , in which case the work needs to be queued on that
* pool to guarantee non - reentrancy .
*/
last_pool = get_work_pool ( work ) ;
if ( last_pool & & last_pool ! = pwq - > pool ) {
struct worker * worker ;
2010-06-29 10:07:13 +02:00
2020-05-27 21:46:33 +02:00
raw_spin_lock ( & last_pool - > lock ) ;
2010-06-29 10:07:13 +02:00
2013-03-12 11:30:04 -07:00
worker = find_worker_executing_work ( last_pool , work ) ;
2010-06-29 10:07:13 +02:00
2013-03-12 11:30:04 -07:00
if ( worker & & worker - > current_pwq - > wq = = wq ) {
pwq = worker - > current_pwq ;
2012-08-03 10:30:45 -07:00
} else {
2013-03-12 11:30:04 -07:00
/* meh... not running there, queue here */
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & last_pool - > lock ) ;
raw_spin_lock ( & pwq - > pool - > lock ) ;
2012-08-03 10:30:45 -07:00
}
2010-07-02 10:03:51 +02:00
} else {
2020-05-27 21:46:33 +02:00
raw_spin_lock ( & pwq - > pool - > lock ) ;
2010-06-29 10:07:13 +02:00
}
2013-03-12 11:30:04 -07:00
/*
* pwq is determined and locked . For unbound pools , we could have
* raced with pwq release and it could already be dead . If its
* refcnt is zero , repeat pwq selection . Note that pwqs never die
2013-04-01 11:23:35 -07:00
* without another pwq replacing it in the numa_pwq_tbl or while
* work items are executing on it , so the retrying is guaranteed to
2013-03-12 11:30:04 -07:00
* make forward - progress .
*/
if ( unlikely ( ! pwq - > refcnt ) ) {
if ( wq - > flags & WQ_UNBOUND ) {
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & pwq - > pool - > lock ) ;
2013-03-12 11:30:04 -07:00
cpu_relax ( ) ;
goto retry ;
}
/* oops */
WARN_ONCE ( true , " workqueue: per-cpu pwq for %s on cpu%d has 0 refcnt " ,
wq - > name , cpu ) ;
}
2013-02-13 19:29:12 -08:00
/* pwq determined, queue */
trace_workqueue_queue_work ( req_cpu , pwq , work ) ;
2010-06-29 10:07:13 +02:00
2019-03-13 17:55:47 +01:00
if ( WARN_ON ( ! list_empty ( & work - > entry ) ) )
goto out ;
2010-06-29 10:07:12 +02:00
2013-02-13 19:29:12 -08:00
pwq - > nr_in_flight [ pwq - > work_color ] + + ;
work_flags = work_color_to_flags ( pwq - > work_color ) ;
2010-06-29 10:07:12 +02:00
2013-02-13 19:29:12 -08:00
if ( likely ( pwq - > nr_active < pwq - > max_active ) ) {
2010-10-05 10:49:55 +02:00
trace_workqueue_activate_work ( work ) ;
2013-02-13 19:29:12 -08:00
pwq - > nr_active + + ;
worklist = & pwq - > pool - > worklist ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
if ( list_empty ( worklist ) )
pwq - > pool - > watchdog_ts = jiffies ;
2010-08-25 10:33:56 +02:00
} else {
2021-08-17 09:32:34 +08:00
work_flags | = WORK_STRUCT_INACTIVE ;
worklist = & pwq - > inactive_works ;
2010-08-25 10:33:56 +02:00
}
2010-06-29 10:07:12 +02:00
2021-02-18 11:16:49 +08:00
debug_work_activate ( work ) ;
2013-02-13 19:29:12 -08:00
insert_work ( pwq , work , worklist , work_flags ) ;
2010-06-29 10:07:12 +02:00
2019-03-13 17:55:47 +01:00
out :
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & pwq - > pool - > lock ) ;
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2005-04-16 15:20:36 -07:00
}
2006-07-30 03:03:42 -07:00
/**
2008-07-23 21:28:39 -07:00
* queue_work_on - queue work on specific cpu
* @ cpu : CPU number to execute work on
2006-07-30 03:03:42 -07:00
* @ wq : workqueue to use
* @ work : work to queue
*
2008-07-23 21:28:39 -07:00
* We queue the work to a specific CPU , the caller must ensure it
2021-11-30 17:00:30 -08:00
* can ' t go away . Callers that fail to ensure that the specified
* CPU cannot go away will execute on a randomly chosen CPU .
2013-07-31 14:59:24 -07:00
*
* Return : % false if @ work was already on a queue , % true otherwise .
2005-04-16 15:20:36 -07:00
*/
2012-08-03 10:30:44 -07:00
bool queue_work_on ( int cpu , struct workqueue_struct * wq ,
struct work_struct * work )
2005-04-16 15:20:36 -07:00
{
2012-08-03 10:30:44 -07:00
bool ret = false ;
2012-08-03 10:30:45 -07:00
unsigned long flags ;
2008-07-25 01:47:53 -07:00
2012-08-03 10:30:45 -07:00
local_irq_save ( flags ) ;
2008-07-23 21:28:39 -07:00
2010-06-29 10:07:10 +02:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) ) {
2010-06-29 10:07:10 +02:00
__queue_work ( cpu , wq , work ) ;
2012-08-03 10:30:44 -07:00
ret = true ;
2008-07-23 21:28:39 -07:00
}
2008-07-25 01:47:53 -07:00
2012-08-03 10:30:45 -07:00
local_irq_restore ( flags ) ;
2005-04-16 15:20:36 -07:00
return ret ;
}
2013-05-06 17:44:55 -04:00
EXPORT_SYMBOL ( queue_work_on ) ;
2005-04-16 15:20:36 -07:00
2019-01-22 10:39:26 -08:00
/**
* workqueue_select_cpu_near - Select a CPU based on NUMA node
* @ node : NUMA node ID that we want to select a CPU from
*
* This function will attempt to find a " random " cpu available on a given
* node . If there are no CPUs available on the given node it will return
* WORK_CPU_UNBOUND indicating that we should just schedule to any
* available CPU if we need to schedule this work .
*/
static int workqueue_select_cpu_near ( int node )
{
int cpu ;
/* No point in doing this if NUMA isn't enabled for workqueues */
if ( ! wq_numa_enabled )
return WORK_CPU_UNBOUND ;
/* Delay binding to CPU if node is not valid or online */
if ( node < 0 | | node > = MAX_NUMNODES | | ! node_online ( node ) )
return WORK_CPU_UNBOUND ;
/* Use local node/cpu if we are already there */
cpu = raw_smp_processor_id ( ) ;
if ( node = = cpu_to_node ( cpu ) )
return cpu ;
/* Use "random" otherwise know as "first" online CPU of node */
cpu = cpumask_any_and ( cpumask_of_node ( node ) , cpu_online_mask ) ;
/* If CPU is valid return that, otherwise just defer */
return cpu < nr_cpu_ids ? cpu : WORK_CPU_UNBOUND ;
}
/**
* queue_work_node - queue work on a " random " cpu for a given NUMA node
* @ node : NUMA node that we are targeting the work for
* @ wq : workqueue to use
* @ work : work to queue
*
* We queue the work to a " random " CPU within a given NUMA node . The basic
* idea here is to provide a way to somehow associate work with a given
* NUMA node .
*
* This function will only make a best effort attempt at getting this onto
* the right NUMA node . If no node is requested or the requested node is
* offline then we just fall back to standard queue_work behavior .
*
* Currently the " random " CPU ends up being the first available CPU in the
* intersection of cpu_online_mask and the cpumask of the node , unless we
* are running on the node . In that case we just use the current CPU .
*
* Return : % false if @ work was already on a queue , % true otherwise .
*/
bool queue_work_node ( int node , struct workqueue_struct * wq ,
struct work_struct * work )
{
unsigned long flags ;
bool ret = false ;
/*
* This current implementation is specific to unbound workqueues .
* Specifically we only return the first available CPU for a given
* node instead of cycling through individual CPUs within the node .
*
* If this is used with a per - cpu workqueue then the logic in
* workqueue_select_cpu_near would need to be updated to allow for
* some round robin type logic .
*/
WARN_ON_ONCE ( ! ( wq - > flags & WQ_UNBOUND ) ) ;
local_irq_save ( flags ) ;
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) ) {
int cpu = workqueue_select_cpu_near ( node ) ;
__queue_work ( cpu , wq , work ) ;
ret = true ;
}
local_irq_restore ( flags ) ;
return ret ;
}
EXPORT_SYMBOL_GPL ( queue_work_node ) ;
2017-10-04 16:27:07 -07:00
void delayed_work_timer_fn ( struct timer_list * t )
2005-04-16 15:20:36 -07:00
{
2017-10-04 16:27:07 -07:00
struct delayed_work * dwork = from_timer ( dwork , t , timer ) ;
2005-04-16 15:20:36 -07:00
2012-08-21 13:18:24 -07:00
/* should have been called from irqsafe timer with irq already off */
2013-02-06 18:04:53 -08:00
__queue_work ( dwork - > cpu , dwork - > wq , & dwork - > work ) ;
2005-04-16 15:20:36 -07:00
}
2013-01-24 16:36:31 +04:00
EXPORT_SYMBOL ( delayed_work_timer_fn ) ;
2005-04-16 15:20:36 -07:00
2012-08-03 10:30:46 -07:00
static void __queue_delayed_work ( int cpu , struct workqueue_struct * wq ,
struct delayed_work * dwork , unsigned long delay )
2005-04-16 15:20:36 -07:00
{
2012-08-03 10:30:46 -07:00
struct timer_list * timer = & dwork - > timer ;
struct work_struct * work = & dwork - > work ;
2017-03-06 15:33:42 -05:00
WARN_ON_ONCE ( ! wq ) ;
2022-09-08 14:54:56 -07:00
WARN_ON_ONCE ( timer - > function ! = delayed_work_timer_fn ) ;
2012-12-04 07:40:39 -08:00
WARN_ON_ONCE ( timer_pending ( timer ) ) ;
WARN_ON_ONCE ( ! list_empty ( & work - > entry ) ) ;
2012-08-03 10:30:46 -07:00
2012-12-01 16:23:42 -08:00
/*
* If @ delay is 0 , queue @ dwork - > work immediately . This is for
* both optimization and correctness . The earliest @ timer can
* expire is on the closest next tick and delayed_work users depend
* on that there ' s no such delay when @ delay is 0.
*/
if ( ! delay ) {
__queue_work ( cpu , wq , & dwork - > work ) ;
return ;
}
2013-02-06 18:04:53 -08:00
dwork - > wq = wq ;
2012-08-08 09:38:42 -07:00
dwork - > cpu = cpu ;
2012-08-03 10:30:46 -07:00
timer - > expires = jiffies + delay ;
2016-02-09 16:11:26 -05:00
if ( unlikely ( cpu ! = WORK_CPU_UNBOUND ) )
add_timer_on ( timer , cpu ) ;
else
add_timer ( timer ) ;
2005-04-16 15:20:36 -07:00
}
2006-07-30 03:03:42 -07:00
/**
* queue_delayed_work_on - queue work on specific CPU after delay
* @ cpu : CPU number to execute work on
* @ wq : workqueue to use
2006-12-22 01:06:52 -08:00
* @ dwork : work to queue
2006-07-30 03:03:42 -07:00
* @ delay : number of jiffies to wait before queueing
*
2013-07-31 14:59:24 -07:00
* Return : % false if @ work was already on a queue , % true otherwise . If
2012-08-03 10:30:46 -07:00
* @ delay is zero and @ dwork is idle , it will be scheduled for immediate
* execution .
2006-07-30 03:03:42 -07:00
*/
2012-08-03 10:30:44 -07:00
bool queue_delayed_work_on ( int cpu , struct workqueue_struct * wq ,
struct delayed_work * dwork , unsigned long delay )
2006-06-28 13:50:33 -07:00
{
2006-11-22 14:54:01 +00:00
struct work_struct * work = & dwork - > work ;
2012-08-03 10:30:44 -07:00
bool ret = false ;
2012-08-03 10:30:45 -07:00
unsigned long flags ;
2006-06-28 13:50:33 -07:00
2012-08-03 10:30:45 -07:00
/* read the comment in __queue_work() */
local_irq_save ( flags ) ;
2006-06-28 13:50:33 -07:00
2010-06-29 10:07:10 +02:00
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) ) {
2012-08-03 10:30:46 -07:00
__queue_delayed_work ( cpu , wq , dwork , delay ) ;
2012-08-03 10:30:44 -07:00
ret = true ;
2006-06-28 13:50:33 -07:00
}
2008-05-01 04:35:14 -07:00
2012-08-03 10:30:45 -07:00
local_irq_restore ( flags ) ;
2006-06-28 13:50:33 -07:00
return ret ;
}
2013-05-06 17:44:55 -04:00
EXPORT_SYMBOL ( queue_delayed_work_on ) ;
2010-07-02 10:03:51 +02:00
2012-08-03 10:30:47 -07:00
/**
* mod_delayed_work_on - modify delay of or queue a delayed work on specific CPU
* @ cpu : CPU number to execute work on
* @ wq : workqueue to use
* @ dwork : work to queue
* @ delay : number of jiffies to wait before queueing
*
* If @ dwork is idle , equivalent to queue_delayed_work_on ( ) ; otherwise ,
* modify @ dwork ' s timer so that it expires after @ delay . If @ delay is
* zero , @ work is guaranteed to be scheduled immediately regardless of its
* current state .
*
2013-07-31 14:59:24 -07:00
* Return : % false if @ dwork was idle and queued , % true if @ dwork was
2012-08-03 10:30:47 -07:00
* pending and its timer was modified .
*
2012-08-21 13:18:24 -07:00
* This function is safe to call from any context including IRQ handler .
2012-08-03 10:30:47 -07:00
* See try_to_grab_pending ( ) for details .
*/
bool mod_delayed_work_on ( int cpu , struct workqueue_struct * wq ,
struct delayed_work * dwork , unsigned long delay )
{
unsigned long flags ;
int ret ;
2010-07-02 10:03:51 +02:00
2012-08-03 10:30:47 -07:00
do {
ret = try_to_grab_pending ( & dwork - > work , true , & flags ) ;
} while ( unlikely ( ret = = - EAGAIN ) ) ;
2007-05-09 02:34:16 -07:00
2012-08-03 10:30:47 -07:00
if ( likely ( ret > = 0 ) ) {
__queue_delayed_work ( cpu , wq , dwork , delay ) ;
local_irq_restore ( flags ) ;
2006-06-28 13:50:33 -07:00
}
2012-08-03 10:30:47 -07:00
/* -ENOENT from try_to_grab_pending() becomes %true */
2006-06-28 13:50:33 -07:00
return ret ;
}
2012-08-03 10:30:47 -07:00
EXPORT_SYMBOL_GPL ( mod_delayed_work_on ) ;
2018-03-14 12:45:13 -07:00
static void rcu_work_rcufn ( struct rcu_head * rcu )
{
struct rcu_work * rwork = container_of ( rcu , struct rcu_work , rcu ) ;
/* read the comment in __queue_work() */
local_irq_disable ( ) ;
__queue_work ( WORK_CPU_UNBOUND , rwork - > wq , & rwork - > work ) ;
local_irq_enable ( ) ;
}
/**
* queue_rcu_work - queue work after a RCU grace period
* @ wq : workqueue to use
* @ rwork : work to queue
*
* Return : % false if @ rwork was already pending , % true otherwise . Note
* that a full RCU grace period is guaranteed only after a % true return .
2019-03-01 13:57:25 -08:00
* While @ rwork is guaranteed to be executed after a % false return , the
2018-03-14 12:45:13 -07:00
* execution may happen before a full RCU grace period has passed .
*/
bool queue_rcu_work ( struct workqueue_struct * wq , struct rcu_work * rwork )
{
struct work_struct * work = & rwork - > work ;
if ( ! test_and_set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( work ) ) ) {
rwork - > wq = wq ;
workqueue: Make queue_rcu_work() use call_rcu_hurry()
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.
And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.
Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-16 16:23:03 +00:00
call_rcu_hurry ( & rwork - > rcu , rcu_work_rcufn ) ;
2018-03-14 12:45:13 -07:00
return true ;
}
return false ;
}
EXPORT_SYMBOL ( queue_rcu_work ) ;
2010-06-29 10:07:12 +02:00
/**
* worker_enter_idle - enter idle state
* @ worker : worker which is entering idle state
*
* @ worker is entering idle state . Update stats and idle timer if
* necessary .
*
* LOCKING :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:12 +02:00
*/
static void worker_enter_idle ( struct worker * worker )
2005-04-16 15:20:36 -07:00
{
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 10:07:12 +02:00
2013-03-12 11:29:57 -07:00
if ( WARN_ON_ONCE ( worker - > flags & WORKER_IDLE ) | |
WARN_ON_ONCE ( ! list_empty ( & worker - > entry ) & &
( worker - > hentry . next | | worker - > hentry . pprev ) ) )
return ;
2010-06-29 10:07:12 +02:00
2014-07-22 13:03:02 +08:00
/* can't use worker_set_flags(), also called from create_worker() */
2010-07-02 10:03:50 +02:00
worker - > flags | = WORKER_IDLE ;
2012-07-12 14:46:37 -07:00
pool - > nr_idle + + ;
2010-06-29 10:07:14 +02:00
worker - > last_active = jiffies ;
2010-06-29 10:07:12 +02:00
/* idle_list is LIFO */
2012-07-12 14:46:37 -07:00
list_add ( & worker - > entry , & pool - > idle_list ) ;
2010-06-29 10:07:12 +02:00
2012-07-17 12:39:27 -07:00
if ( too_many_workers ( pool ) & & ! timer_pending ( & pool - > idle_timer ) )
mod_timer ( & pool - > idle_timer , jiffies + IDLE_WORKER_TIMEOUT ) ;
2010-07-02 10:03:50 +02:00
2021-12-07 15:35:41 +08:00
/* Sanity check nr_running. */
2021-12-23 20:31:40 +08:00
WARN_ON_ONCE ( pool - > nr_workers = = pool - > nr_idle & & pool - > nr_running ) ;
2010-06-29 10:07:12 +02:00
}
/**
* worker_leave_idle - leave idle state
* @ worker : worker which is leaving idle state
*
* @ worker is leaving idle state . Update stats .
*
* LOCKING :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:12 +02:00
*/
static void worker_leave_idle ( struct worker * worker )
{
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 10:07:12 +02:00
2013-03-12 11:29:57 -07:00
if ( WARN_ON_ONCE ( ! ( worker - > flags & WORKER_IDLE ) ) )
return ;
2010-06-29 10:07:13 +02:00
worker_clr_flags ( worker , WORKER_IDLE ) ;
2012-07-12 14:46:37 -07:00
pool - > nr_idle - - ;
2010-06-29 10:07:12 +02:00
list_del_init ( & worker - > entry ) ;
}
2014-07-15 17:24:15 +08:00
static struct worker * alloc_worker ( int node )
2010-06-29 10:07:11 +02:00
{
struct worker * worker ;
2014-07-15 17:24:15 +08:00
worker = kzalloc_node ( sizeof ( * worker ) , GFP_KERNEL , node ) ;
2010-06-29 10:07:12 +02:00
if ( worker ) {
INIT_LIST_HEAD ( & worker - > entry ) ;
2010-06-29 10:07:12 +02:00
INIT_LIST_HEAD ( & worker - > scheduled ) ;
2014-05-20 17:46:31 +08:00
INIT_LIST_HEAD ( & worker - > node ) ;
2010-06-29 10:07:14 +02:00
/* on creation a worker is in !idle && prep state */
worker - > flags = WORKER_PREP ;
2010-06-29 10:07:12 +02:00
}
2010-06-29 10:07:11 +02:00
return worker ;
}
2014-05-20 17:46:35 +08:00
/**
* worker_attach_to_pool ( ) - attach a worker to a pool
* @ worker : worker to be attached
* @ pool : the target pool
*
* Attach @ worker to @ pool . Once attached , the % WORKER_UNBOUND flag and
* cpu - binding of @ worker are kept coordinated with the pool across
* cpu - [ un ] hotplugs .
*/
static void worker_attach_to_pool ( struct worker * worker ,
struct worker_pool * pool )
{
2018-05-18 08:47:13 -07:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2014-05-20 17:46:35 +08:00
/*
2018-05-18 08:47:13 -07:00
* The wq_pool_attach_mutex ensures % POOL_DISASSOCIATED remains
* stable across this function . See the comments above the flag
* definition for details .
2014-05-20 17:46:35 +08:00
*/
if ( pool - > flags & POOL_DISASSOCIATED )
worker - > flags | = WORKER_UNBOUND ;
2021-01-12 11:26:49 +01:00
else
kthread_set_per_cpu ( worker - > task , pool - > cpu ) ;
2014-05-20 17:46:35 +08:00
2021-01-15 19:08:36 +01:00
if ( worker - > rescue_wq )
set_cpus_allowed_ptr ( worker - > task , pool - > attrs - > cpumask ) ;
2014-05-20 17:46:35 +08:00
list_add_tail ( & worker - > node , & pool - > workers ) ;
2018-05-18 08:47:13 -07:00
worker - > pool = pool ;
2014-05-20 17:46:35 +08:00
2018-05-18 08:47:13 -07:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2014-05-20 17:46:35 +08:00
}
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
/**
* worker_detach_from_pool ( ) - detach a worker from its pool
* @ worker : worker which is attached to its pool
*
2014-05-20 17:46:35 +08:00
* Undo the attaching which had been done in worker_attach_to_pool ( ) . The
* caller worker shouldn ' t access to the pool after detached except it has
* other reference to the pool .
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
*/
2018-05-18 08:47:13 -07:00
static void worker_detach_from_pool ( struct worker * worker )
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
{
2018-05-18 08:47:13 -07:00
struct worker_pool * pool = worker - > pool ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
struct completion * detach_completion = NULL ;
2018-05-18 08:47:13 -07:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2018-05-18 08:47:13 -07:00
2021-01-12 11:26:49 +01:00
kthread_set_per_cpu ( worker - > task , - 1 ) ;
2014-05-20 17:46:31 +08:00
list_del ( & worker - > node ) ;
2018-05-18 08:47:13 -07:00
worker - > pool = NULL ;
2014-05-20 17:46:31 +08:00
if ( list_empty ( & pool - > workers ) )
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
detach_completion = pool - > detach_completion ;
2018-05-18 08:47:13 -07:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
2014-06-03 15:32:52 +08:00
/* clear leftover flags without pool->lock after it is detached */
worker - > flags & = ~ ( WORKER_UNBOUND | WORKER_REBOUND ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
if ( detach_completion )
complete ( detach_completion ) ;
}
2010-06-29 10:07:11 +02:00
/**
* create_worker - create a new workqueue worker
2012-07-12 14:46:37 -07:00
* @ pool : pool the new worker will belong to
2010-06-29 10:07:11 +02:00
*
2014-07-22 13:03:02 +08:00
* Create and start a new worker which is attached to @ pool .
2010-06-29 10:07:11 +02:00
*
* CONTEXT :
* Might sleep . Does GFP_KERNEL allocations .
*
2013-07-31 14:59:24 -07:00
* Return :
2010-06-29 10:07:11 +02:00
* Pointer to the newly created worker .
*/
2012-07-17 12:39:27 -07:00
static struct worker * create_worker ( struct worker_pool * pool )
2010-06-29 10:07:11 +02:00
{
2021-08-04 11:50:36 +08:00
struct worker * worker ;
int id ;
2013-04-01 11:23:32 -07:00
char id_buf [ 16 ] ;
2010-06-29 10:07:11 +02:00
2014-05-20 17:46:32 +08:00
/* ID is needed to determine kthread name */
2021-08-04 11:50:36 +08:00
id = ida_alloc ( & pool - > worker_ida , GFP_KERNEL ) ;
2013-03-19 13:45:21 -07:00
if ( id < 0 )
2021-08-04 11:50:36 +08:00
return NULL ;
2010-06-29 10:07:11 +02:00
2014-07-15 17:24:15 +08:00
worker = alloc_worker ( pool - > node ) ;
2010-06-29 10:07:11 +02:00
if ( ! worker )
goto fail ;
worker - > id = id ;
2013-03-12 11:30:03 -07:00
if ( pool - > cpu > = 0 )
2013-04-01 11:23:32 -07:00
snprintf ( id_buf , sizeof ( id_buf ) , " %d:%d%s " , pool - > cpu , id ,
pool - > attrs - > nice < 0 ? " H " : " " ) ;
2010-07-02 10:03:51 +02:00
else
2013-04-01 11:23:32 -07:00
snprintf ( id_buf , sizeof ( id_buf ) , " u%d:%d " , pool - > id , id ) ;
2013-04-01 11:23:34 -07:00
worker - > task = kthread_create_on_node ( worker_thread , worker , pool - > node ,
2013-04-01 11:23:32 -07:00
" kworker/%s " , id_buf ) ;
2010-06-29 10:07:11 +02:00
if ( IS_ERR ( worker - > task ) )
goto fail ;
2013-11-14 12:56:18 +01:00
set_user_nice ( worker - > task , pool - > attrs - > nice ) ;
2015-05-15 17:43:34 +02:00
kthread_bind_mask ( worker - > task , pool - > attrs - > cpumask ) ;
2013-11-14 12:56:18 +01:00
2014-05-20 17:46:31 +08:00
/* successful, attach the worker to the pool */
2014-05-20 17:46:35 +08:00
worker_attach_to_pool ( worker , pool ) ;
2013-03-19 13:45:21 -07:00
2014-07-22 13:03:02 +08:00
/* start the newly created worker */
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2014-07-22 13:03:02 +08:00
worker - > pool - > nr_workers + + ;
worker_enter_idle ( worker ) ;
wake_up_process ( worker - > task ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2014-07-22 13:03:02 +08:00
2010-06-29 10:07:11 +02:00
return worker ;
2013-03-19 13:45:21 -07:00
2010-06-29 10:07:11 +02:00
fail :
2021-08-04 11:50:36 +08:00
ida_free ( & pool - > worker_ida , id ) ;
2010-06-29 10:07:11 +02:00
kfree ( worker ) ;
return NULL ;
}
/**
* destroy_worker - destroy a workqueue worker
* @ worker : worker to be destroyed
*
2014-05-20 17:46:28 +08:00
* Destroy @ worker and adjust @ pool stats accordingly . The worker should
* be idle .
2010-06-29 10:07:12 +02:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:11 +02:00
*/
static void destroy_worker ( struct worker * worker )
{
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 10:07:11 +02:00
2013-03-13 19:47:39 -07:00
lockdep_assert_held ( & pool - > lock ) ;
2010-06-29 10:07:11 +02:00
/* sanity check frenzy */
2013-03-12 11:29:57 -07:00
if ( WARN_ON ( worker - > current_work ) | |
2014-05-20 17:46:28 +08:00
WARN_ON ( ! list_empty ( & worker - > scheduled ) ) | |
WARN_ON ( ! ( worker - > flags & WORKER_IDLE ) ) )
2013-03-12 11:29:57 -07:00
return ;
2010-06-29 10:07:11 +02:00
2014-05-20 17:46:28 +08:00
pool - > nr_workers - - ;
pool - > nr_idle - - ;
2014-02-15 22:02:28 +08:00
2010-06-29 10:07:12 +02:00
list_del_init ( & worker - > entry ) ;
2010-07-02 10:03:50 +02:00
worker - > flags | = WORKER_DIE ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
wake_up_process ( worker - > task ) ;
2010-06-29 10:07:11 +02:00
}
2017-10-16 15:58:25 -07:00
static void idle_worker_timeout ( struct timer_list * t )
2010-06-29 10:07:14 +02:00
{
2017-10-16 15:58:25 -07:00
struct worker_pool * pool = from_timer ( pool , t , idle_timer ) ;
2010-06-29 10:07:14 +02:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 10:07:14 +02:00
2014-05-20 17:46:30 +08:00
while ( too_many_workers ( pool ) ) {
2010-06-29 10:07:14 +02:00
struct worker * worker ;
unsigned long expires ;
/* idle_list is kept in LIFO order, check the last one */
2012-07-12 14:46:37 -07:00
worker = list_entry ( pool - > idle_list . prev , struct worker , entry ) ;
2010-06-29 10:07:14 +02:00
expires = worker - > last_active + IDLE_WORKER_TIMEOUT ;
2014-05-20 17:46:30 +08:00
if ( time_before ( jiffies , expires ) ) {
2012-07-12 14:46:37 -07:00
mod_timer ( & pool - > idle_timer , expires ) ;
2014-05-20 17:46:30 +08:00
break ;
2006-12-06 20:37:26 -08:00
}
2014-05-20 17:46:30 +08:00
destroy_worker ( worker ) ;
2010-06-29 10:07:14 +02:00
}
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 10:07:14 +02:00
}
2006-12-06 20:37:26 -08:00
2013-03-12 11:29:59 -07:00
static void send_mayday ( struct work_struct * work )
2010-06-29 10:07:14 +02:00
{
2013-02-13 19:29:12 -08:00
struct pool_workqueue * pwq = get_work_pwq ( work ) ;
struct workqueue_struct * wq = pwq - > wq ;
2013-03-12 11:29:59 -07:00
2013-03-13 19:47:40 -07:00
lockdep_assert_held ( & wq_mayday_lock ) ;
2010-06-29 10:07:14 +02:00
2013-03-12 11:30:03 -07:00
if ( ! wq - > rescuer )
2013-03-12 11:29:59 -07:00
return ;
2010-06-29 10:07:14 +02:00
/* mayday mayday mayday */
2013-03-12 11:29:59 -07:00
if ( list_empty ( & pwq - > mayday_node ) ) {
2014-04-18 11:04:16 -04:00
/*
* If @ pwq is for an unbound wq , its base ref may be put at
* any time due to an attribute change . Pin @ pwq until the
* rescuer is done with it .
*/
get_pwq ( pwq ) ;
2013-03-12 11:29:59 -07:00
list_add_tail ( & pwq - > mayday_node , & wq - > maydays ) ;
2010-06-29 10:07:14 +02:00
wake_up_process ( wq - > rescuer - > task ) ;
2013-03-12 11:29:59 -07:00
}
2010-06-29 10:07:14 +02:00
}
2017-10-16 15:58:25 -07:00
static void pool_mayday_timeout ( struct timer_list * t )
2010-06-29 10:07:14 +02:00
{
2017-10-16 15:58:25 -07:00
struct worker_pool * pool = from_timer ( pool , t , mayday_timer ) ;
2010-06-29 10:07:14 +02:00
struct work_struct * work ;
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
raw_spin_lock ( & wq_mayday_lock ) ; /* for wq->maydays */
2010-06-29 10:07:14 +02:00
2012-07-12 14:46:37 -07:00
if ( need_to_create_worker ( pool ) ) {
2010-06-29 10:07:14 +02:00
/*
* We ' ve been trying to create a new worker but
* haven ' t been successful . We might be hitting an
* allocation deadlock . Send distress signals to
* rescuers .
*/
2012-07-12 14:46:37 -07:00
list_for_each_entry ( work , & pool - > worklist , entry )
2010-06-29 10:07:14 +02:00
send_mayday ( work ) ;
2005-04-16 15:20:36 -07:00
}
2010-06-29 10:07:14 +02:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & wq_mayday_lock ) ;
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 10:07:14 +02:00
2012-07-12 14:46:37 -07:00
mod_timer ( & pool - > mayday_timer , jiffies + MAYDAY_INTERVAL ) ;
2005-04-16 15:20:36 -07:00
}
2010-06-29 10:07:14 +02:00
/**
* maybe_create_worker - create a new worker if necessary
2012-07-12 14:46:37 -07:00
* @ pool : pool to create a new worker for
2010-06-29 10:07:14 +02:00
*
2012-07-12 14:46:37 -07:00
* Create a new worker for @ pool if necessary . @ pool is guaranteed to
2010-06-29 10:07:14 +02:00
* have at least one idle worker on return from this function . If
* creating a new worker takes longer than MAYDAY_INTERVAL , mayday is
2012-07-12 14:46:37 -07:00
* sent to all rescuers with works scheduled on @ pool to resolve
2010-06-29 10:07:14 +02:00
* possible allocation deadlock .
*
2013-03-13 16:51:36 -07:00
* On return , need_to_create_worker ( ) is guaranteed to be % false and
* may_start_working ( ) % true .
2010-06-29 10:07:14 +02:00
*
* LOCKING :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) which may be released and regrabbed
2010-06-29 10:07:14 +02:00
* multiple times . Does GFP_KERNEL allocations . Called only from
* manager .
*/
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
static void maybe_create_worker ( struct worker_pool * pool )
2013-01-24 11:01:33 -08:00
__releases ( & pool - > lock )
__acquires ( & pool - > lock )
2005-04-16 15:20:36 -07:00
{
2010-06-29 10:07:14 +02:00
restart :
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-07-14 11:31:20 +02:00
2010-06-29 10:07:14 +02:00
/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
2012-07-12 14:46:37 -07:00
mod_timer ( & pool - > mayday_timer , jiffies + MAYDAY_INITIAL_TIMEOUT ) ;
2010-06-29 10:07:14 +02:00
while ( true ) {
2014-07-22 13:03:02 +08:00
if ( create_worker ( pool ) | | ! need_to_create_worker ( pool ) )
2010-06-29 10:07:14 +02:00
break ;
2005-04-16 15:20:36 -07:00
2014-06-03 15:32:17 +08:00
schedule_timeout_interruptible ( CREATE_COOLDOWN ) ;
2010-07-14 11:31:20 +02:00
2012-07-12 14:46:37 -07:00
if ( ! need_to_create_worker ( pool ) )
2010-06-29 10:07:14 +02:00
break ;
}
2012-07-12 14:46:37 -07:00
del_timer_sync ( & pool - > mayday_timer ) ;
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2014-07-22 13:03:02 +08:00
/*
* This is necessary even after a new worker was just successfully
* created as @ pool - > lock was dropped and the new worker might have
* already become busy .
*/
2012-07-12 14:46:37 -07:00
if ( need_to_create_worker ( pool ) )
2010-06-29 10:07:14 +02:00
goto restart ;
}
2010-06-29 10:07:11 +02:00
/**
2010-06-29 10:07:14 +02:00
* manage_workers - manage worker pool
* @ worker : self
2010-06-29 10:07:11 +02:00
*
2013-01-24 11:01:34 -08:00
* Assume the manager role and manage the worker pool @ worker belongs
2010-06-29 10:07:14 +02:00
* to . At any given time , there can be only zero or one manager per
2013-01-24 11:01:34 -08:00
* pool . The exclusion is handled automatically by this function .
2010-06-29 10:07:14 +02:00
*
* The caller can safely start processing works on false return . On
* true return , it ' s guaranteed that need_to_create_worker ( ) is false
* and may_start_working ( ) is true .
2010-06-29 10:07:11 +02:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) which may be released and regrabbed
2010-06-29 10:07:14 +02:00
* multiple times . Does GFP_KERNEL allocations .
*
2013-07-31 14:59:24 -07:00
* Return :
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
* % false if the pool doesn ' t need management and the caller can safely
* start processing works , % true if management function was performed and
* the conditions that the caller verified before calling the function may
* no longer be true .
2010-06-29 10:07:11 +02:00
*/
2010-06-29 10:07:14 +02:00
static bool manage_workers ( struct worker * worker )
2010-06-29 10:07:11 +02:00
{
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2010-06-29 10:07:11 +02:00
2017-10-09 08:04:13 -07:00
if ( pool - > flags & POOL_MANAGER_ACTIVE )
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
return false ;
2017-10-09 08:04:13 -07:00
pool - > flags | = POOL_MANAGER_ACTIVE ;
2015-03-09 09:22:28 -04:00
pool - > manager = worker ;
2010-06-29 10:07:12 +02:00
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
maybe_create_worker ( pool ) ;
2010-06-29 10:07:14 +02:00
2015-03-09 09:22:28 -04:00
pool - > manager = NULL ;
2017-10-09 08:04:13 -07:00
pool - > flags & = ~ POOL_MANAGER_ACTIVE ;
2020-05-27 21:46:32 +02:00
rcuwait_wake_up ( & manager_wait ) ;
workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
return true ;
2010-06-29 10:07:11 +02:00
}
2010-06-29 10:07:10 +02:00
/**
* process_one_work - process single work
2010-06-29 10:07:11 +02:00
* @ worker : self
2010-06-29 10:07:10 +02:00
* @ work : work to process
*
* Process @ work . This function contains all the logics necessary to
* process a single work including synchronization against and
* interaction with other workers on the same cpu , queueing and
* flushing . As long as context requirement is met , any worker can
* call this function to process a work .
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) which is released and regrabbed .
2010-06-29 10:07:10 +02:00
*/
2010-06-29 10:07:11 +02:00
static void process_one_work ( struct worker * worker , struct work_struct * work )
2013-01-24 11:01:33 -08:00
__releases ( & pool - > lock )
__acquires ( & pool - > lock )
2010-06-29 10:07:10 +02:00
{
2013-02-13 19:29:12 -08:00
struct pool_workqueue * pwq = get_work_pwq ( work ) ;
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2013-02-13 19:29:12 -08:00
bool cpu_intensive = pwq - > wq - > flags & WQ_CPU_INTENSIVE ;
2021-08-17 09:32:35 +08:00
unsigned long work_data ;
2010-06-29 10:07:13 +02:00
struct worker * collision ;
2010-06-29 10:07:10 +02:00
# ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the struct work_struct from
* inside the function that is called from it , this we need to
* take into account for lockdep too . To avoid bogus " held
* lock freed " warnings as well as problems when looking into
* work - > lockdep_map , make a copy and use that here .
*/
lockdep: fix oops in processing workqueue
Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.
Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.
Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.
Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].
Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().
* Patch orginally from Peter. Hugh modified it a bit and wrote the
description.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-15 08:06:19 -07:00
struct lockdep_map lockdep_map ;
lockdep_copy_map ( & lockdep_map , & work - > lockdep_map ) ;
2010-06-29 10:07:10 +02:00
# endif
2014-06-03 15:33:28 +08:00
/* ensure we're on the correct CPU */
2014-06-03 15:33:28 +08:00
WARN_ON_ONCE ( ! ( pool - > flags & POOL_DISASSOCIATED ) & &
2013-01-24 11:01:33 -08:00
raw_smp_processor_id ( ) ! = pool - > cpu ) ;
2012-07-17 12:39:27 -07:00
2010-06-29 10:07:13 +02:00
/*
* A single work shouldn ' t be executed concurrently by
* multiple workers on a single cpu . Check whether anyone is
* already processing the work . If so , defer the work to the
* currently executing one .
*/
2013-01-24 11:01:33 -08:00
collision = find_worker_executing_work ( pool , work ) ;
2010-06-29 10:07:13 +02:00
if ( unlikely ( collision ) ) {
move_linked_works ( work , & collision - > scheduled , NULL ) ;
return ;
}
2012-08-03 10:30:45 -07:00
/* claim and dequeue */
2010-06-29 10:07:10 +02:00
debug_work_deactivate ( work ) ;
2013-01-24 11:01:33 -08:00
hash_add ( pool - > busy_hash , & worker - > hentry , ( unsigned long ) work ) ;
2010-06-29 10:07:11 +02:00
worker - > current_work = work ;
2012-12-18 10:35:02 -08:00
worker - > current_func = work - > func ;
2013-02-13 19:29:12 -08:00
worker - > current_pwq = pwq ;
2021-08-17 09:32:35 +08:00
work_data = * work_data_bits ( work ) ;
2021-08-17 09:32:38 +08:00
worker - > current_color = get_work_color ( work_data ) ;
2010-06-29 10:07:13 +02:00
2018-05-18 08:47:13 -07:00
/*
* Record wq name for cmdline and debug reporting , may get
* overridden through set_worker_desc ( ) .
*/
strscpy ( worker - > desc , pwq - > wq - > name , WORKER_DESC_LEN ) ;
2010-06-29 10:07:10 +02:00
list_del_init ( & work - > entry ) ;
2010-06-29 10:07:15 +02:00
/*
2014-07-22 13:02:00 +08:00
* CPU intensive works don ' t participate in concurrency management .
* They ' re the scheduler ' s responsibility . This takes @ worker out
* of concurrency management and the next code block will chain
* execution of the pending work items .
2010-06-29 10:07:15 +02:00
*/
if ( unlikely ( cpu_intensive ) )
2014-07-22 13:02:00 +08:00
worker_set_flags ( worker , WORKER_CPU_INTENSIVE ) ;
2010-06-29 10:07:15 +02:00
2012-07-12 14:46:37 -07:00
/*
2014-07-22 13:01:59 +08:00
* Wake up another worker if necessary . The condition is always
* false for normal per - cpu workers since nr_running would always
* be > = 1 at this point . This is used to chain execution of the
* pending work items for WORKER_NOT_RUNNING workers such as the
2014-07-22 13:02:00 +08:00
* UNBOUND and CPU_INTENSIVE ones .
2012-07-12 14:46:37 -07:00
*/
2014-07-22 13:01:59 +08:00
if ( need_more_worker ( pool ) )
2012-07-12 14:46:37 -07:00
wake_up_worker ( pool ) ;
2012-07-12 14:46:37 -07:00
2012-08-03 10:30:45 -07:00
/*
2013-01-24 11:01:33 -08:00
* Record the last pool and clear PENDING which should be the last
2013-01-24 11:01:33 -08:00
* update to @ work . Also , do this inside @ pool - > lock so that
2012-08-13 17:08:19 -07:00
* PENDING and queued state changes happen together while IRQ is
* disabled .
2012-08-03 10:30:45 -07:00
*/
2013-01-24 11:01:33 -08:00
set_work_pool_and_clear_pending ( work , pool - > id ) ;
2010-06-29 10:07:10 +02:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 10:07:10 +02:00
2017-08-23 12:52:32 +02:00
lock_map_acquire ( & pwq - > wq - > lockdep_map ) ;
2010-06-29 10:07:10 +02:00
lock_map_acquire ( & lockdep_map ) ;
2017-08-23 13:23:30 +02:00
/*
2017-08-29 10:59:39 +02:00
* Strictly speaking we should mark the invariant state without holding
* any locks , that is , before these two lock_map_acquire ( ) ' s .
2017-08-23 13:23:30 +02:00
*
* However , that would result in :
*
* A ( W1 )
* WFC ( C )
* A ( W1 )
* C ( C )
*
* Which would create W1 - > C - > W1 dependencies , even though there is no
* actual deadlock possible . There are two solutions , using a
* read - recursive acquire on the work ( queue ) ' locks ' , but this will then
2017-08-29 10:59:39 +02:00
* hit the lockdep limitation on recursive locks , or simply discard
2017-08-23 13:23:30 +02:00
* these locks .
*
* AFAICT there is no possible deadlock scenario between the
* flush_work ( ) and complete ( ) primitives ( except for single - threaded
* workqueues ) , so hiding them isn ' t a problem .
*/
2017-08-29 10:59:39 +02:00
lockdep_invariant_state ( true ) ;
2010-08-21 13:07:26 -07:00
trace_workqueue_execute_start ( work ) ;
2012-12-18 10:35:02 -08:00
worker - > current_func ( work ) ;
2010-08-21 13:07:26 -07:00
/*
* While we must be careful to not use " work " after this , the trace
* point will only record its address .
*/
2020-01-13 17:52:39 -05:00
trace_workqueue_execute_end ( work , worker - > current_func ) ;
2010-06-29 10:07:10 +02:00
lock_map_release ( & lockdep_map ) ;
2013-02-13 19:29:12 -08:00
lock_map_release ( & pwq - > wq - > lockdep_map ) ;
2010-06-29 10:07:10 +02:00
if ( unlikely ( in_atomic ( ) | | lockdep_depth ( current ) > 0 ) ) {
2012-08-19 00:52:42 +03:00
pr_err ( " BUG: workqueue leaked lock or atomic: %s/0x%08x/%d \n "
2019-03-25 21:32:28 +02:00
" last function: %ps \n " ,
2012-12-18 10:35:02 -08:00
current - > comm , preempt_count ( ) , task_pid_nr ( current ) ,
worker - > current_func ) ;
2010-06-29 10:07:10 +02:00
debug_show_held_locks ( current ) ;
dump_stack ( ) ;
}
2013-08-28 17:33:37 -04:00
/*
2019-10-15 21:18:21 +02:00
* The following prevents a kworker from hogging CPU on ! PREEMPTION
2013-08-28 17:33:37 -04:00
* kernels , where a requeueing work item waiting for something to
* happen could deadlock with stop_machine as such work item could
* indefinitely requeue itself while all other CPUs are trapped in
2014-10-05 13:24:21 -04:00
* stop_machine . At the same time , report a quiescent RCU state so
* the same condition doesn ' t freeze RCU .
2013-08-28 17:33:37 -04:00
*/
2017-10-24 08:25:02 -07:00
cond_resched ( ) ;
2013-08-28 17:33:37 -04:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 10:07:10 +02:00
2010-06-29 10:07:15 +02:00
/* clear cpu intensive status */
if ( unlikely ( cpu_intensive ) )
worker_clr_flags ( worker , WORKER_CPU_INTENSIVE ) ;
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 14:20:42 -08:00
/* tag the worker for identification in schedule() */
worker - > last_func = worker - > current_func ;
2010-06-29 10:07:10 +02:00
/* we're done with it, release */
2012-12-17 10:01:23 -05:00
hash_del ( & worker - > hentry ) ;
2010-06-29 10:07:11 +02:00
worker - > current_work = NULL ;
2012-12-18 10:35:02 -08:00
worker - > current_func = NULL ;
2013-02-13 19:29:12 -08:00
worker - > current_pwq = NULL ;
2021-08-17 09:32:38 +08:00
worker - > current_color = INT_MAX ;
2021-08-17 09:32:35 +08:00
pwq_dec_nr_in_flight ( pwq , work_data ) ;
2010-06-29 10:07:10 +02:00
}
2010-06-29 10:07:12 +02:00
/**
* process_scheduled_works - process scheduled works
* @ worker : self
*
* Process all scheduled works . Please note that the scheduled list
* may change while processing a work , so this function repeatedly
* fetches a work from the top and executes it .
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) which may be released and regrabbed
2010-06-29 10:07:12 +02:00
* multiple times .
*/
static void process_scheduled_works ( struct worker * worker )
2005-04-16 15:20:36 -07:00
{
2010-06-29 10:07:12 +02:00
while ( ! list_empty ( & worker - > scheduled ) ) {
struct work_struct * work = list_first_entry ( & worker - > scheduled ,
2005-04-16 15:20:36 -07:00
struct work_struct , entry ) ;
2010-06-29 10:07:11 +02:00
process_one_work ( worker , work ) ;
2005-04-16 15:20:36 -07:00
}
}
2018-05-21 08:04:35 -07:00
static void set_pf_worker ( bool val )
{
mutex_lock ( & wq_pool_attach_mutex ) ;
if ( val )
current - > flags | = PF_WQ_WORKER ;
else
current - > flags & = ~ PF_WQ_WORKER ;
mutex_unlock ( & wq_pool_attach_mutex ) ;
}
2010-06-29 10:07:10 +02:00
/**
* worker_thread - the worker thread function
2010-06-29 10:07:11 +02:00
* @ __worker : self
2010-06-29 10:07:10 +02:00
*
2013-03-13 16:51:36 -07:00
* The worker thread function . All workers belong to a worker_pool -
* either a per - cpu one or dynamic unbound one . These workers process all
* work items regardless of their specific target workqueue . The only
* exception is work items which belong to workqueues with a rescuer which
* will be explained in rescuer_thread ( ) .
2013-07-31 14:59:24 -07:00
*
* Return : 0
2010-06-29 10:07:10 +02:00
*/
2010-06-29 10:07:11 +02:00
static int worker_thread ( void * __worker )
2005-04-16 15:20:36 -07:00
{
2010-06-29 10:07:11 +02:00
struct worker * worker = __worker ;
2012-07-12 14:46:37 -07:00
struct worker_pool * pool = worker - > pool ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:14 +02:00
/* tell the scheduler that this is a workqueue worker */
2018-05-21 08:04:35 -07:00
set_pf_worker ( true ) ;
2010-06-29 10:07:12 +02:00
woke_up :
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2005-04-16 15:20:36 -07:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
/* am I supposed to die? */
if ( unlikely ( worker - > flags & WORKER_DIE ) ) {
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
WARN_ON_ONCE ( ! list_empty ( & worker - > entry ) ) ;
2018-05-21 08:04:35 -07:00
set_pf_worker ( false ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
set_task_comm ( worker - > task , " kworker/dying " ) ;
2021-08-04 11:50:36 +08:00
ida_free ( & pool - > worker_ida , worker - > id ) ;
2018-05-18 08:47:13 -07:00
worker_detach_from_pool ( worker ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
kfree ( worker ) ;
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
return 0 ;
2010-06-29 10:07:12 +02:00
}
2010-06-29 10:07:12 +02:00
2010-06-29 10:07:12 +02:00
worker_leave_idle ( worker ) ;
2010-06-29 10:07:12 +02:00
recheck :
2010-06-29 10:07:14 +02:00
/* no more worker necessary? */
2012-07-12 14:46:37 -07:00
if ( ! need_more_worker ( pool ) )
2010-06-29 10:07:14 +02:00
goto sleep ;
/* do we need to manage? */
2012-07-12 14:46:37 -07:00
if ( unlikely ( ! may_start_working ( pool ) ) & & manage_workers ( worker ) )
2010-06-29 10:07:14 +02:00
goto recheck ;
2010-06-29 10:07:12 +02:00
/*
* - > scheduled list can only be filled while a worker is
* preparing to process a work or actually processing it .
* Make sure nobody diddled with it while I was sleeping .
*/
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( ! list_empty ( & worker - > scheduled ) ) ;
2010-06-29 10:07:12 +02:00
2010-06-29 10:07:14 +02:00
/*
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
* Finish PREP stage . We ' re guaranteed to have at least one idle
* worker or that someone else has already assumed the manager
* role . This is where @ worker starts participating in concurrency
* management if applicable and concurrency management is restored
* after being rebound . See rebind_workers ( ) for details .
2010-06-29 10:07:14 +02:00
*/
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
worker_clr_flags ( worker , WORKER_PREP | WORKER_REBOUND ) ;
2010-06-29 10:07:14 +02:00
do {
2010-06-29 10:07:12 +02:00
struct work_struct * work =
2012-07-12 14:46:37 -07:00
list_first_entry ( & pool - > worklist ,
2010-06-29 10:07:12 +02:00
struct work_struct , entry ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
pool - > watchdog_ts = jiffies ;
2010-06-29 10:07:12 +02:00
if ( likely ( ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) ) ) {
/* optimization path, not strictly necessary */
process_one_work ( worker , work ) ;
if ( unlikely ( ! list_empty ( & worker - > scheduled ) ) )
2010-06-29 10:07:12 +02:00
process_scheduled_works ( worker ) ;
2010-06-29 10:07:12 +02:00
} else {
move_linked_works ( work , & worker - > scheduled , NULL ) ;
process_scheduled_works ( worker ) ;
2010-06-29 10:07:12 +02:00
}
2012-07-12 14:46:37 -07:00
} while ( keep_working ( pool ) ) ;
2010-06-29 10:07:14 +02:00
2014-07-22 13:02:00 +08:00
worker_set_flags ( worker , WORKER_PREP ) ;
2010-07-02 10:03:51 +02:00
sleep :
2010-06-29 10:07:12 +02:00
/*
2013-01-24 11:01:33 -08:00
* pool - > lock is held and there ' s no work to process and no need to
* manage , sleep . Workers are woken up only while holding
* pool - > lock or from local cpu , so setting the current state
* before releasing pool - > lock is enough to prevent losing any
* event .
2010-06-29 10:07:12 +02:00
*/
worker_enter_idle ( worker ) ;
2017-08-23 13:58:44 +02:00
__set_current_state ( TASK_IDLE ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 10:07:12 +02:00
schedule ( ) ;
goto woke_up ;
2005-04-16 15:20:36 -07:00
}
2010-06-29 10:07:14 +02:00
/**
* rescuer_thread - the rescuer thread function
2013-01-17 17:16:24 -08:00
* @ __rescuer : self
2010-06-29 10:07:14 +02:00
*
* Workqueue rescuer thread function . There ' s one rescuer for each
2013-03-12 11:30:03 -07:00
* workqueue which has WQ_MEM_RECLAIM set .
2010-06-29 10:07:14 +02:00
*
2013-01-24 11:01:34 -08:00
* Regular work processing on a pool may block trying to create a new
2010-06-29 10:07:14 +02:00
* worker which uses GFP_KERNEL allocation which has slight chance of
* developing into deadlock if some works currently on the same queue
* need to be processed to satisfy the GFP_KERNEL allocation . This is
* the problem rescuer solves .
*
2013-01-24 11:01:34 -08:00
* When such condition is possible , the pool summons rescuers of all
* workqueues which have works queued on the pool and let them process
2010-06-29 10:07:14 +02:00
* those works so that forward progress can be guaranteed .
*
* This should happen rarely .
2013-07-31 14:59:24 -07:00
*
* Return : 0
2010-06-29 10:07:14 +02:00
*/
2013-01-17 17:16:24 -08:00
static int rescuer_thread ( void * __rescuer )
2010-06-29 10:07:14 +02:00
{
2013-01-17 17:16:24 -08:00
struct worker * rescuer = __rescuer ;
struct workqueue_struct * wq = rescuer - > rescue_wq ;
2010-06-29 10:07:14 +02:00
struct list_head * scheduled = & rescuer - > scheduled ;
2014-04-18 11:04:16 -04:00
bool should_stop ;
2010-06-29 10:07:14 +02:00
set_user_nice ( current , RESCUER_NICE_LEVEL ) ;
2013-01-17 17:16:24 -08:00
/*
* Mark rescuer as worker too . As WORKER_PREP is never cleared , it
* doesn ' t participate in concurrency management .
*/
2018-05-21 08:04:35 -07:00
set_pf_worker ( true ) ;
2010-06-29 10:07:14 +02:00
repeat :
2017-08-23 13:58:44 +02:00
set_current_state ( TASK_IDLE ) ;
2010-06-29 10:07:14 +02:00
2014-04-18 11:04:16 -04:00
/*
* By the time the rescuer is requested to stop , the workqueue
* shouldn ' t have any work pending , but @ wq - > maydays may still have
* pwq ( s ) queued . This can happen by non - rescuer workers consuming
* all the work items before the rescuer got to them . Go through
* @ wq - > maydays processing before acting on should_stop so that the
* list is always empty on exit .
*/
should_stop = kthread_should_stop ( ) ;
2010-06-29 10:07:14 +02:00
2013-03-12 11:29:59 -07:00
/* see whether any pwq is asking for help */
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & wq_mayday_lock ) ;
2013-03-12 11:29:59 -07:00
while ( ! list_empty ( & wq - > maydays ) ) {
struct pool_workqueue * pwq = list_first_entry ( & wq - > maydays ,
struct pool_workqueue , mayday_node ) ;
2013-02-13 19:29:12 -08:00
struct worker_pool * pool = pwq - > pool ;
2010-06-29 10:07:14 +02:00
struct work_struct * work , * n ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
bool first = true ;
2010-06-29 10:07:14 +02:00
__set_current_state ( TASK_RUNNING ) ;
2013-03-12 11:29:59 -07:00
list_del_init ( & pwq - > mayday_node ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & wq_mayday_lock ) ;
2010-06-29 10:07:14 +02:00
2014-05-20 17:46:36 +08:00
worker_attach_to_pool ( rescuer , pool ) ;
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2010-06-29 10:07:14 +02:00
/*
* Slurp in all works issued via this workqueue and
* process ' em .
*/
2014-12-04 10:14:13 -05:00
WARN_ON_ONCE ( ! list_empty ( scheduled ) ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
list_for_each_entry_safe ( work , n , & pool - > worklist , entry ) {
if ( get_work_pwq ( work ) = = pwq ) {
if ( first )
pool - > watchdog_ts = jiffies ;
2010-06-29 10:07:14 +02:00
move_linked_works ( work , scheduled , & n ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
}
first = false ;
}
2010-06-29 10:07:14 +02:00
2014-12-08 12:39:16 -05:00
if ( ! list_empty ( scheduled ) ) {
process_scheduled_works ( rescuer ) ;
/*
* The above execution of rescued work items could
* have created more to rescue through
2021-08-17 09:32:34 +08:00
* pwq_activate_first_inactive ( ) or chained
2014-12-08 12:39:16 -05:00
* queueing . Let ' s put @ pwq back on mayday list so
* that such back - to - back work items , which may be
* being used to relieve memory pressure , don ' t
* incur MAYDAY_INTERVAL delay inbetween .
*/
2020-05-29 06:58:59 +00:00
if ( pwq - > nr_active & & need_to_create_worker ( pool ) ) {
2020-05-27 21:46:33 +02:00
raw_spin_lock ( & wq_mayday_lock ) ;
2019-09-25 06:59:15 -07:00
/*
* Queue iff we aren ' t racing destruction
* and somebody else hasn ' t queued it already .
*/
if ( wq - > rescuer & & list_empty ( & pwq - > mayday_node ) ) {
get_pwq ( pwq ) ;
list_add_tail ( & pwq - > mayday_node , & wq - > maydays ) ;
}
2020-05-27 21:46:33 +02:00
raw_spin_unlock ( & wq_mayday_lock ) ;
2014-12-08 12:39:16 -05:00
}
}
2011-02-14 14:04:46 +01:00
2014-04-18 11:04:16 -04:00
/*
* Put the reference grabbed by send_mayday ( ) . @ pool won ' t
2014-07-22 13:03:47 +08:00
* go away while we ' re still attached to it .
2014-04-18 11:04:16 -04:00
*/
put_pwq ( pwq ) ;
2011-02-14 14:04:46 +01:00
/*
2014-07-16 14:56:36 +08:00
* Leave this pool . If need_more_worker ( ) is % true , notify a
2011-02-14 14:04:46 +01:00
* regular worker ; otherwise , we end up with 0 concurrency
* and stalling the execution .
*/
2014-07-16 14:56:36 +08:00
if ( need_more_worker ( pool ) )
2012-07-12 14:46:37 -07:00
wake_up_worker ( pool ) ;
2011-02-14 14:04:46 +01:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2014-07-22 13:03:47 +08:00
2018-05-18 08:47:13 -07:00
worker_detach_from_pool ( rescuer ) ;
2014-07-22 13:03:47 +08:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & wq_mayday_lock ) ;
2010-06-29 10:07:14 +02:00
}
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & wq_mayday_lock ) ;
2013-03-12 11:29:59 -07:00
2014-04-18 11:04:16 -04:00
if ( should_stop ) {
__set_current_state ( TASK_RUNNING ) ;
2018-05-21 08:04:35 -07:00
set_pf_worker ( false ) ;
2014-04-18 11:04:16 -04:00
return 0 ;
}
2013-01-17 17:16:24 -08:00
/* rescuers should never participate in concurrency management */
WARN_ON_ONCE ( ! ( rescuer - > flags & WORKER_NOT_RUNNING ) ) ;
2010-06-29 10:07:14 +02:00
schedule ( ) ;
goto repeat ;
2005-04-16 15:20:36 -07:00
}
2015-12-07 10:58:57 -05:00
/**
* check_flush_dependency - check for flush dependency sanity
* @ target_wq : workqueue being flushed
* @ target_work : work item being flushed ( NULL for workqueue flushes )
*
* % current is trying to flush the whole @ target_wq or @ target_work on it .
* If @ target_wq doesn ' t have % WQ_MEM_RECLAIM , verify that % current is not
* reclaiming memory or running on a workqueue which doesn ' t have
* % WQ_MEM_RECLAIM as that can break forward - progress guarantee leading to
* a deadlock .
*/
static void check_flush_dependency ( struct workqueue_struct * target_wq ,
struct work_struct * target_work )
{
work_func_t target_func = target_work ? target_work - > func : NULL ;
struct worker * worker ;
if ( target_wq - > flags & WQ_MEM_RECLAIM )
return ;
worker = current_wq_worker ( ) ;
WARN_ONCE ( current - > flags & PF_MEMALLOC ,
2019-03-25 21:32:28 +02:00
" workqueue: PF_MEMALLOC task %d(%s) is flushing !WQ_MEM_RECLAIM %s:%ps " ,
2015-12-07 10:58:57 -05:00
current - > pid , current - > comm , target_wq - > name , target_func ) ;
2016-01-29 05:59:46 -05:00
WARN_ONCE ( worker & & ( ( worker - > current_pwq - > wq - > flags &
( WQ_MEM_RECLAIM | __WQ_LEGACY ) ) = = WQ_MEM_RECLAIM ) ,
2019-03-25 21:32:28 +02:00
" workqueue: WQ_MEM_RECLAIM %s:%ps is flushing !WQ_MEM_RECLAIM %s:%ps " ,
2015-12-07 10:58:57 -05:00
worker - > current_pwq - > wq - > name , worker - > current_func ,
target_wq - > name , target_func ) ;
}
2007-05-09 02:33:51 -07:00
struct wq_barrier {
struct work_struct work ;
struct completion done ;
2015-03-09 09:22:28 -04:00
struct task_struct * task ; /* purely informational */
2007-05-09 02:33:51 -07:00
} ;
static void wq_barrier_func ( struct work_struct * work )
{
struct wq_barrier * barr = container_of ( work , struct wq_barrier , work ) ;
complete ( & barr - > done ) ;
}
2010-06-29 10:07:10 +02:00
/**
* insert_wq_barrier - insert a barrier work
2013-02-13 19:29:12 -08:00
* @ pwq : pwq to insert barrier into
2010-06-29 10:07:10 +02:00
* @ barr : wq_barrier to insert
2010-06-29 10:07:12 +02:00
* @ target : target work to attach @ barr to
* @ worker : worker currently executing @ target , NULL if @ target is not executing
2010-06-29 10:07:10 +02:00
*
2010-06-29 10:07:12 +02:00
* @ barr is linked to @ target such that @ barr is completed only after
* @ target finishes execution . Please note that the ordering
* guarantee is observed only with respect to @ target and on the local
* cpu .
*
* Currently , a queued barrier can ' t be canceled . This is because
* try_to_grab_pending ( ) can ' t determine whether the work to be
* grabbed is at the head of the queue and thus can ' t clear LINKED
* flag of the previous work while there must be a valid next work
* after a work with LINKED flag set .
*
* Note that when @ worker is non - NULL , @ target may be modified
2013-02-13 19:29:12 -08:00
* underneath us , so we can ' t reliably determine pwq from @ target .
2010-06-29 10:07:10 +02:00
*
* CONTEXT :
2020-05-27 21:46:33 +02:00
* raw_spin_lock_irq ( pool - > lock ) .
2010-06-29 10:07:10 +02:00
*/
2013-02-13 19:29:12 -08:00
static void insert_wq_barrier ( struct pool_workqueue * pwq ,
2010-06-29 10:07:12 +02:00
struct wq_barrier * barr ,
struct work_struct * target , struct worker * worker )
2007-05-09 02:33:51 -07:00
{
2021-08-17 09:32:38 +08:00
unsigned int work_flags = 0 ;
unsigned int work_color ;
2010-06-29 10:07:12 +02:00
struct list_head * head ;
2009-11-16 01:09:48 +09:00
/*
2013-01-24 11:01:33 -08:00
* debugobject calls are safe here even with pool - > lock locked
2009-11-16 01:09:48 +09:00
* as we know for sure that this will not trigger any of the
* checks and call back into the fixup functions where we
* might deadlock .
*/
2010-10-26 14:22:34 -07:00
INIT_WORK_ONSTACK ( & barr - > work , wq_barrier_func ) ;
2010-06-29 10:07:10 +02:00
__set_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( & barr - > work ) ) ;
locking/lockdep: Explicitly initialize wq_barrier::done::map
With the new lockdep crossrelease feature, which checks completions usage,
a false positive is reported in the workqueue code:
> Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
> Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
> Task C : acquired of lock#3 -> wait for completion of barr->done
> (Task C is in lru_add_drain_all_cpuslocked())
> Worker D : wait for wfc.work to be released -> will complete barr->done
Such a dead lock can not happen because Task C's barr->done and Worker D's
barr->done can not be the same instance.
The reason of this false positive is we initialize all wq_barrier::done
at insert_wq_barrier() via init_completion(), which makes them belong to
the same lock class, therefore, impossible circles are reported.
To fix this, explicitly initialize the lockdep map for wq_barrier::done
in insert_wq_barrier(), so that the lock class key of wq_barrier::done
is a subkey of the corresponding work_struct, as a result we won't build
a dependency between a wq_barrier with a unrelated work, and we can
differ wq barriers based on the related works, so the false positive
above is avoided.
Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
to make the code simple and away from unnecessary #ifdefs.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 17:46:12 +08:00
2017-10-25 17:56:04 +09:00
init_completion_map ( & barr - > done , & target - > lockdep_map ) ;
2015-03-09 09:22:28 -04:00
barr - > task = current ;
2007-05-09 02:33:54 -07:00
2021-08-17 09:32:37 +08:00
/* The barrier work item does not participate in pwq->nr_active. */
work_flags | = WORK_STRUCT_INACTIVE ;
2010-06-29 10:07:12 +02:00
/*
* If @ target is currently being executed , schedule the
* barrier to the worker ; otherwise , put it after @ target .
*/
2021-08-17 09:32:38 +08:00
if ( worker ) {
2010-06-29 10:07:12 +02:00
head = worker - > scheduled . next ;
2021-08-17 09:32:38 +08:00
work_color = worker - > current_color ;
} else {
2010-06-29 10:07:12 +02:00
unsigned long * bits = work_data_bits ( target ) ;
head = target - > entry . next ;
/* there can already be other linked works, inherit and set */
2021-08-17 09:32:36 +08:00
work_flags | = * bits & WORK_STRUCT_LINKED ;
2021-08-17 09:32:38 +08:00
work_color = get_work_color ( * bits ) ;
2010-06-29 10:07:12 +02:00
__set_bit ( WORK_STRUCT_LINKED_BIT , bits ) ;
}
2021-08-17 09:32:38 +08:00
pwq - > nr_in_flight [ work_color ] + + ;
work_flags | = work_color_to_flags ( work_color ) ;
2009-11-16 01:09:48 +09:00
debug_work_activate ( & barr - > work ) ;
2021-08-17 09:32:36 +08:00
insert_work ( pwq , & barr - > work , head , work_flags ) ;
2007-05-09 02:33:51 -07:00
}
2010-06-29 10:07:11 +02:00
/**
2013-02-13 19:29:12 -08:00
* flush_workqueue_prep_pwqs - prepare pwqs for workqueue flushing
2010-06-29 10:07:11 +02:00
* @ wq : workqueue being flushed
* @ flush_color : new flush color , < 0 for no - op
* @ work_color : new work color , < 0 for no - op
*
2013-02-13 19:29:12 -08:00
* Prepare pwqs for workqueue flushing .
2010-06-29 10:07:11 +02:00
*
2013-02-13 19:29:12 -08:00
* If @ flush_color is non - negative , flush_color on all pwqs should be
* - 1. If no pwq has in - flight commands at the specified color , all
* pwq - > flush_color ' s stay at - 1 and % false is returned . If any pwq
* has in flight commands , its pwq - > flush_color is set to
* @ flush_color , @ wq - > nr_pwqs_to_flush is updated accordingly , pwq
2010-06-29 10:07:11 +02:00
* wakeup logic is armed and % true is returned .
*
* The caller should have initialized @ wq - > first_flusher prior to
* calling this function with non - negative @ flush_color . If
* @ flush_color is negative , no flush color update is done and % false
* is returned .
*
2013-02-13 19:29:12 -08:00
* If @ work_color is non - negative , all pwqs should have the same
2010-06-29 10:07:11 +02:00
* work_color which is previous to @ work_color and all will be
* advanced to @ work_color .
*
* CONTEXT :
2013-03-25 16:57:17 -07:00
* mutex_lock ( wq - > mutex ) .
2010-06-29 10:07:11 +02:00
*
2013-07-31 14:59:24 -07:00
* Return :
2010-06-29 10:07:11 +02:00
* % true if @ flush_color > = 0 and there ' s something to flush . % false
* otherwise .
*/
2013-02-13 19:29:12 -08:00
static bool flush_workqueue_prep_pwqs ( struct workqueue_struct * wq ,
2010-06-29 10:07:11 +02:00
int flush_color , int work_color )
2005-04-16 15:20:36 -07:00
{
2010-06-29 10:07:11 +02:00
bool wait = false ;
2013-03-12 11:29:58 -07:00
struct pool_workqueue * pwq ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:11 +02:00
if ( flush_color > = 0 ) {
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( atomic_read ( & wq - > nr_pwqs_to_flush ) ) ;
2013-02-13 19:29:12 -08:00
atomic_set ( & wq - > nr_pwqs_to_flush , 1 ) ;
2005-04-16 15:20:36 -07:00
}
2009-04-02 16:58:24 -07:00
2013-03-12 11:29:58 -07:00
for_each_pwq ( pwq , wq ) {
2013-02-13 19:29:12 -08:00
struct worker_pool * pool = pwq - > pool ;
2007-05-09 02:33:51 -07:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2007-05-09 02:33:54 -07:00
2010-06-29 10:07:11 +02:00
if ( flush_color > = 0 ) {
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( pwq - > flush_color ! = - 1 ) ;
2007-05-09 02:33:51 -07:00
2013-02-13 19:29:12 -08:00
if ( pwq - > nr_in_flight [ flush_color ] ) {
pwq - > flush_color = flush_color ;
atomic_inc ( & wq - > nr_pwqs_to_flush ) ;
2010-06-29 10:07:11 +02:00
wait = true ;
}
}
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:11 +02:00
if ( work_color > = 0 ) {
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( work_color ! = work_next_color ( pwq - > work_color ) ) ;
2013-02-13 19:29:12 -08:00
pwq - > work_color = work_color ;
2010-06-29 10:07:11 +02:00
}
2005-04-16 15:20:36 -07:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2005-04-16 15:20:36 -07:00
}
2009-04-02 16:58:24 -07:00
2013-02-13 19:29:12 -08:00
if ( flush_color > = 0 & & atomic_dec_and_test ( & wq - > nr_pwqs_to_flush ) )
2010-06-29 10:07:11 +02:00
complete ( & wq - > first_flusher - > done ) ;
2007-05-23 13:57:57 -07:00
2010-06-29 10:07:11 +02:00
return wait ;
2005-04-16 15:20:36 -07:00
}
2006-07-30 03:03:42 -07:00
/**
2022-06-01 16:32:47 +09:00
* __flush_workqueue - ensure that any scheduled work has run to completion .
2006-07-30 03:03:42 -07:00
* @ wq : workqueue to flush
2005-04-16 15:20:36 -07:00
*
2013-03-13 16:51:36 -07:00
* This function sleeps until all work items which were queued on entry
* have finished execution , but it is not livelocked by new incoming ones .
2005-04-16 15:20:36 -07:00
*/
2022-06-01 16:32:47 +09:00
void __flush_workqueue ( struct workqueue_struct * wq )
2005-04-16 15:20:36 -07:00
{
2010-06-29 10:07:11 +02:00
struct wq_flusher this_flusher = {
. list = LIST_HEAD_INIT ( this_flusher . list ) ,
. flush_color = - 1 ,
2017-10-25 17:56:04 +09:00
. done = COMPLETION_INITIALIZER_ONSTACK_MAP ( this_flusher . done , wq - > lockdep_map ) ,
2010-06-29 10:07:11 +02:00
} ;
int next_color ;
2005-04-16 15:20:36 -07:00
2016-09-16 15:49:32 -04:00
if ( WARN_ON ( ! wq_online ) )
return ;
2018-08-22 11:49:04 +02:00
lock_map_acquire ( & wq - > lockdep_map ) ;
lock_map_release ( & wq - > lockdep_map ) ;
2013-03-25 16:57:17 -07:00
mutex_lock ( & wq - > mutex ) ;
2010-06-29 10:07:11 +02:00
/*
* Start - to - wait phase
*/
next_color = work_next_color ( wq - > work_color ) ;
if ( next_color ! = wq - > flush_color ) {
/*
* Color space is not full . The current work_color
* becomes our flush_color and work_color is advanced
* by one .
*/
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( ! list_empty ( & wq - > flusher_overflow ) ) ;
2010-06-29 10:07:11 +02:00
this_flusher . flush_color = wq - > work_color ;
wq - > work_color = next_color ;
if ( ! wq - > first_flusher ) {
/* no flush in progress, become the first flusher */
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( wq - > flush_color ! = this_flusher . flush_color ) ;
2010-06-29 10:07:11 +02:00
wq - > first_flusher = & this_flusher ;
2013-02-13 19:29:12 -08:00
if ( ! flush_workqueue_prep_pwqs ( wq , wq - > flush_color ,
2010-06-29 10:07:11 +02:00
wq - > work_color ) ) {
/* nothing to flush, done */
wq - > flush_color = next_color ;
wq - > first_flusher = NULL ;
goto out_unlock ;
}
} else {
/* wait in queue */
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( wq - > flush_color = = this_flusher . flush_color ) ;
2010-06-29 10:07:11 +02:00
list_add_tail ( & this_flusher . list , & wq - > flusher_queue ) ;
2013-02-13 19:29:12 -08:00
flush_workqueue_prep_pwqs ( wq , - 1 , wq - > work_color ) ;
2010-06-29 10:07:11 +02:00
}
} else {
/*
* Oops , color space is full , wait on overflow queue .
* The next flush completion will assign us
* flush_color and transfer to flusher_queue .
*/
list_add_tail ( & this_flusher . list , & wq - > flusher_overflow ) ;
}
2015-12-07 10:58:57 -05:00
check_flush_dependency ( wq , NULL ) ;
2013-03-25 16:57:17 -07:00
mutex_unlock ( & wq - > mutex ) ;
2010-06-29 10:07:11 +02:00
wait_for_completion ( & this_flusher . done ) ;
/*
* Wake - up - and - cascade phase
*
* First flushers are responsible for cascading flushes and
* handling overflow . Non - first flushers can simply return .
*/
2020-03-10 16:23:19 +00:00
if ( READ_ONCE ( wq - > first_flusher ) ! = & this_flusher )
2010-06-29 10:07:11 +02:00
return ;
2013-03-25 16:57:17 -07:00
mutex_lock ( & wq - > mutex ) ;
2010-06-29 10:07:11 +02:00
2010-07-02 10:03:51 +02:00
/* we might have raced, check again with mutex held */
if ( wq - > first_flusher ! = & this_flusher )
goto out_unlock ;
2020-03-10 16:23:19 +00:00
WRITE_ONCE ( wq - > first_flusher , NULL ) ;
2010-06-29 10:07:11 +02:00
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( ! list_empty ( & this_flusher . list ) ) ;
WARN_ON_ONCE ( wq - > flush_color ! = this_flusher . flush_color ) ;
2010-06-29 10:07:11 +02:00
while ( true ) {
struct wq_flusher * next , * tmp ;
/* complete all the flushers sharing the current flush color */
list_for_each_entry_safe ( next , tmp , & wq - > flusher_queue , list ) {
if ( next - > flush_color ! = wq - > flush_color )
break ;
list_del_init ( & next - > list ) ;
complete ( & next - > done ) ;
}
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( ! list_empty ( & wq - > flusher_overflow ) & &
wq - > flush_color ! = work_next_color ( wq - > work_color ) ) ;
2010-06-29 10:07:11 +02:00
/* this flush_color is finished, advance by one */
wq - > flush_color = work_next_color ( wq - > flush_color ) ;
/* one color has been freed, handle overflow queue */
if ( ! list_empty ( & wq - > flusher_overflow ) ) {
/*
* Assign the same color to all overflowed
* flushers , advance work_color and append to
* flusher_queue . This is the start - to - wait
* phase for these overflowed flushers .
*/
list_for_each_entry ( tmp , & wq - > flusher_overflow , list )
tmp - > flush_color = wq - > work_color ;
wq - > work_color = work_next_color ( wq - > work_color ) ;
list_splice_tail_init ( & wq - > flusher_overflow ,
& wq - > flusher_queue ) ;
2013-02-13 19:29:12 -08:00
flush_workqueue_prep_pwqs ( wq , - 1 , wq - > work_color ) ;
2010-06-29 10:07:11 +02:00
}
if ( list_empty ( & wq - > flusher_queue ) ) {
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( wq - > flush_color ! = wq - > work_color ) ;
2010-06-29 10:07:11 +02:00
break ;
}
/*
* Need to flush more colors . Make the next flusher
2013-02-13 19:29:12 -08:00
* the new first flusher and arm pwqs .
2010-06-29 10:07:11 +02:00
*/
2013-03-12 11:29:57 -07:00
WARN_ON_ONCE ( wq - > flush_color = = wq - > work_color ) ;
WARN_ON_ONCE ( wq - > flush_color ! = next - > flush_color ) ;
2010-06-29 10:07:11 +02:00
list_del_init ( & next - > list ) ;
wq - > first_flusher = next ;
2013-02-13 19:29:12 -08:00
if ( flush_workqueue_prep_pwqs ( wq , wq - > flush_color , - 1 ) )
2010-06-29 10:07:11 +02:00
break ;
/*
* Meh . . . this color is already done , clear first
* flusher and repeat cascading .
*/
wq - > first_flusher = NULL ;
}
out_unlock :
2013-03-25 16:57:17 -07:00
mutex_unlock ( & wq - > mutex ) ;
2005-04-16 15:20:36 -07:00
}
2022-06-01 16:32:47 +09:00
EXPORT_SYMBOL ( __flush_workqueue ) ;
2005-04-16 15:20:36 -07:00
2011-04-05 18:01:44 +02:00
/**
* drain_workqueue - drain a workqueue
* @ wq : workqueue to drain
*
* Wait until the workqueue becomes empty . While draining is in progress ,
* only chain queueing is allowed . IOW , only currently pending or running
* work items on @ wq can queue further work items on it . @ wq is flushed
2015-05-13 06:10:05 -04:00
* repeatedly until it becomes empty . The number of flushing is determined
2011-04-05 18:01:44 +02:00
* by the depth of chaining and should be relatively short . Whine if it
* takes too long .
*/
void drain_workqueue ( struct workqueue_struct * wq )
{
unsigned int flush_cnt = 0 ;
2013-03-12 11:29:58 -07:00
struct pool_workqueue * pwq ;
2011-04-05 18:01:44 +02:00
/*
* __queue_work ( ) needs to test whether there are drainers , is much
* hotter than drain_workqueue ( ) and already looks at @ wq - > flags .
2013-03-12 11:30:04 -07:00
* Use __WQ_DRAINING so that queue doesn ' t have to check nr_drainers .
2011-04-05 18:01:44 +02:00
*/
2013-03-25 16:57:18 -07:00
mutex_lock ( & wq - > mutex ) ;
2011-04-05 18:01:44 +02:00
if ( ! wq - > nr_drainers + + )
2013-03-12 11:30:04 -07:00
wq - > flags | = __WQ_DRAINING ;
2013-03-25 16:57:18 -07:00
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 18:01:44 +02:00
reflush :
2022-06-01 16:32:47 +09:00
__flush_workqueue ( wq ) ;
2011-04-05 18:01:44 +02:00
2013-03-25 16:57:18 -07:00
mutex_lock ( & wq - > mutex ) ;
2013-03-12 11:30:00 -07:00
2013-03-12 11:29:58 -07:00
for_each_pwq ( pwq , wq ) {
2011-09-14 16:22:28 -07:00
bool drained ;
2011-04-05 18:01:44 +02:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
2021-08-17 09:32:34 +08:00
drained = ! pwq - > nr_active & & list_empty ( & pwq - > inactive_works ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2011-09-14 16:22:28 -07:00
if ( drained )
2011-04-05 18:01:44 +02:00
continue ;
if ( + + flush_cnt = = 10 | |
( flush_cnt % 100 = = 0 & & flush_cnt < = 1000 ) )
2021-01-23 16:04:00 +08:00
pr_warn ( " workqueue %s: %s() isn't complete after %u tries \n " ,
wq - > name , __func__ , flush_cnt ) ;
2013-03-12 11:30:00 -07:00
2013-03-25 16:57:18 -07:00
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 18:01:44 +02:00
goto reflush ;
}
if ( ! - - wq - > nr_drainers )
2013-03-12 11:30:04 -07:00
wq - > flags & = ~ __WQ_DRAINING ;
2013-03-25 16:57:18 -07:00
mutex_unlock ( & wq - > mutex ) ;
2011-04-05 18:01:44 +02:00
}
EXPORT_SYMBOL_GPL ( drain_workqueue ) ;
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 11:49:03 +02:00
static bool start_flush_work ( struct work_struct * work , struct wq_barrier * barr ,
bool from_cancel )
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 01:47:49 -07:00
{
2010-06-29 10:07:12 +02:00
struct worker * worker = NULL ;
2013-01-24 11:01:33 -08:00
struct worker_pool * pool ;
2013-02-13 19:29:12 -08:00
struct pool_workqueue * pwq ;
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 01:47:49 -07:00
might_sleep ( ) ;
2013-03-12 11:30:00 -07:00
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
2013-01-24 11:01:33 -08:00
pool = get_work_pool ( work ) ;
2013-03-12 11:30:00 -07:00
if ( ! pool ) {
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2010-09-16 10:42:16 +02:00
return false ;
2013-03-12 11:30:00 -07:00
}
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 01:47:49 -07:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing. It goes like the
following.
* When a work item is queued on a pool, work->data is updated before
work->entry is linked to the pending list with a wmb() inbetween.
* When trying to determine whether a work item is currently queued on
a pool pointed to by work->data, it locks the pool and looks at
work->entry. If work->entry is linked, we then do rmb() and then
check whether work->data points to the current pool.
This works because, work->data can only point to a pool if it
currently is or were on the pool and,
* If it currently is on the pool, the tests would obviously succeed.
* It it left the pool, its work->entry was cleared under pool->lock,
so if we're seeing non-empty work->entry, it has to be from the work
item being linked on another pool. Because work->data is updated
before work->entry is linked with wmb() inbetween, work->data update
from another pool is guaranteed to be visible if we do rmb() after
seeing non-empty work->entry. So, we either see empty work->entry
or we see updated work->data pointin to another pool.
While this works, it's convoluted, to put it mildly. With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.
This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.
tj: Rewrote the comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
/* see the comment in try_to_grab_pending() with the same code */
2013-02-13 19:29:12 -08:00
pwq = get_work_pwq ( work ) ;
if ( pwq ) {
if ( unlikely ( pwq - > pool ! = pool ) )
2010-06-29 10:07:10 +02:00
goto already_gone ;
2012-08-20 14:51:23 -07:00
} else {
2013-01-24 11:01:33 -08:00
worker = find_worker_executing_work ( pool , work ) ;
2010-06-29 10:07:12 +02:00
if ( ! worker )
2010-06-29 10:07:10 +02:00
goto already_gone ;
2013-02-13 19:29:12 -08:00
pwq = worker - > current_pwq ;
2012-08-20 14:51:23 -07:00
}
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 01:47:49 -07:00
2015-12-07 10:58:57 -05:00
check_flush_dependency ( pwq - > wq , work ) ;
2013-02-13 19:29:12 -08:00
insert_wq_barrier ( pwq , barr , work , worker ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2010-06-29 10:07:13 +02:00
2011-01-09 23:32:15 +01:00
/*
2017-08-23 12:52:32 +02:00
* Force a lock recursion deadlock when using flush_work ( ) inside a
* single - threaded or rescuer equipped workqueue .
*
* For single threaded workqueues the deadlock happens when the work
* is after the work issuing the flush_work ( ) . For rescuer equipped
* workqueues the deadlock happens when the rescuer stalls , blocking
* forward progress .
2011-01-09 23:32:15 +01:00
*/
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 11:49:03 +02:00
if ( ! from_cancel & &
( pwq - > wq - > saved_max_active = = 1 | | pwq - > wq - > rescuer ) ) {
2013-02-13 19:29:12 -08:00
lock_map_acquire ( & pwq - > wq - > lockdep_map ) ;
2017-08-23 12:52:32 +02:00
lock_map_release ( & pwq - > wq - > lockdep_map ) ;
}
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2010-09-16 10:36:00 +02:00
return true ;
2010-06-29 10:07:10 +02:00
already_gone :
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2010-09-16 10:36:00 +02:00
return false ;
workqueues: implement flush_work()
Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
but sometimes we really need to wait for the completion and cancelling is not
an option. schedule_on_each_cpu() is good example.
Add the new helper, flush_work(work), which waits for the completion of the
specific work_struct. More precisely, it "flushes" the result of of the last
queue_work() which is visible to the caller.
For example, this code
queue_work(wq, work);
/* WINDOW */
queue_work(wq, work);
flush_work(work);
doesn't necessary work "as expected". What can happen in the WINDOW above is
- wq starts the execution of work->func()
- the caller migrates to another CPU
now, after the 2nd queue_work() this work is active on the previous CPU, and
at the same time it is queued on another. In this case flush_work(work) may
return before the first work->func() completes.
It is trivial to add another helper
int flush_work_sync(struct work_struct *work)
{
return flush_work(work) || wait_on_work(work);
}
which works "more correctly", but it has to iterate over all CPUs and thus
it much slower than flush_work().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Jarek Poplawski <jarkao2@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-25 01:47:49 -07:00
}
2010-09-16 10:42:16 +02:00
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 11:49:03 +02:00
static bool __flush_work ( struct work_struct * work , bool from_cancel )
{
struct wq_barrier barr ;
if ( WARN_ON ( ! wq_online ) )
return false ;
2019-01-23 09:44:12 +09:00
if ( WARN_ON ( ! work - > func ) )
return false ;
workqueue: don't skip lockdep work dependency in cancel_work_sync()
Like Hillf Danton mentioned
syzbot should have been able to catch cancel_work_sync() in work context
by checking lockdep_map in __flush_work() for both flush and cancel.
in [1], being unable to report an obvious deadlock scenario shown below is
broken. From locking dependency perspective, sync version of cancel request
should behave as if flush request, for it waits for completion of work if
that work has already started execution.
----------
#include <linux/module.h>
#include <linux/sched.h>
static DEFINE_MUTEX(mutex);
static void work_fn(struct work_struct *work)
{
schedule_timeout_uninterruptible(HZ / 5);
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
static DECLARE_WORK(work, work_fn);
static int __init test_init(void)
{
schedule_work(&work);
schedule_timeout_uninterruptible(HZ / 10);
mutex_lock(&mutex);
cancel_work_sync(&work);
mutex_unlock(&mutex);
return -EINVAL;
}
module_init(test_init);
MODULE_LICENSE("GPL");
----------
The check this patch restores was added by commit 0976dfc1d0cd80a4
("workqueue: Catch more locking problems with flush_work()").
Then, lockdep's crossrelease feature was added by commit b09be676e0ff25bd
("locking/lockdep: Implement the 'crossrelease' feature"). As a result,
this check was once removed by commit fd1a5b04dfb899f8 ("workqueue: Remove
now redundant lock acquisitions wrt. workqueue flushes").
But lockdep's crossrelease feature was removed by commit e966eaeeb623f099
("locking/lockdep: Remove the cross-release locking checks"). At this
point, this check should have been restored.
Then, commit d6e89786bed977f3 ("workqueue: skip lockdep wq dependency in
cancel_work_sync()") introduced a boolean flag in order to distinguish
flush_work() and cancel_work_sync(), for checking "struct workqueue_struct"
dependency when called from cancel_work_sync() was causing false positives.
Then, commit 87915adc3f0acdf0 ("workqueue: re-add lockdep dependencies for
flushing") tried to restore "struct work_struct" dependency check, but by
error checked this boolean flag. Like an example shown above indicates,
"struct work_struct" dependency needs to be checked for both flush_work()
and cancel_work_sync().
Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1]
Reported-by: Hillf Danton <hdanton@sina.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fixes: 87915adc3f0acdf0 ("workqueue: re-add lockdep dependencies for flushing")
Cc: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-29 13:30:23 +09:00
lock_map_acquire ( & work - > lockdep_map ) ;
lock_map_release ( & work - > lockdep_map ) ;
2018-08-22 11:49:04 +02:00
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 11:49:03 +02:00
if ( start_flush_work ( work , & barr , from_cancel ) ) {
wait_for_completion ( & barr . done ) ;
destroy_work_on_stack ( & barr . work ) ;
return true ;
} else {
return false ;
}
}
2010-09-16 10:42:16 +02:00
/**
* flush_work - wait for a work to finish executing the last queueing instance
* @ work : the work to flush
*
2012-08-20 14:51:23 -07:00
* Wait until @ work has finished execution . @ work is guaranteed to be idle
* on return if it hasn ' t been requeued since flush started .
2010-09-16 10:42:16 +02:00
*
2013-07-31 14:59:24 -07:00
* Return :
2010-09-16 10:42:16 +02:00
* % true if flush_work ( ) waited for the work to finish execution ,
* % false if it was already idle .
*/
bool flush_work ( struct work_struct * work )
{
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 11:49:03 +02:00
return __flush_work ( work , false ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
}
2012-08-20 14:51:23 -07:00
EXPORT_SYMBOL_GPL ( flush_work ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
2015-03-05 08:04:13 -05:00
struct cwt_wait {
2017-06-20 12:06:13 +02:00
wait_queue_entry_t wait ;
2015-03-05 08:04:13 -05:00
struct work_struct * work ;
} ;
2017-06-20 12:06:13 +02:00
static int cwt_wakefn ( wait_queue_entry_t * wait , unsigned mode , int sync , void * key )
2015-03-05 08:04:13 -05:00
{
struct cwt_wait * cwait = container_of ( wait , struct cwt_wait , wait ) ;
if ( cwait - > work ! = key )
return 0 ;
return autoremove_wake_function ( wait , mode , sync , key ) ;
}
2012-08-03 10:30:46 -07:00
static bool __cancel_work_timer ( struct work_struct * work , bool is_dwork )
2007-07-15 23:41:44 -07:00
{
2015-03-05 08:04:13 -05:00
static DECLARE_WAIT_QUEUE_HEAD ( cancel_waitq ) ;
2012-08-03 10:30:46 -07:00
unsigned long flags ;
2007-07-15 23:41:44 -07:00
int ret ;
do {
2012-08-03 10:30:46 -07:00
ret = try_to_grab_pending ( work , is_dwork , & flags ) ;
/*
2015-03-05 08:04:13 -05:00
* If someone else is already canceling , wait for it to
* finish . flush_work ( ) doesn ' t work for PREEMPT_NONE
* because we may get scheduled between @ work ' s completion
* and the other canceling task resuming and clearing
* CANCELING - flush_work ( ) will return false immediately
* as @ work is no longer busy , try_to_grab_pending ( ) will
* return - ENOENT as @ work is still being canceled and the
* other canceling task won ' t be able to clear CANCELING as
* we ' re hogging the CPU .
*
* Let ' s wait for completion using a waitqueue . As this
* may lead to the thundering herd problem , use a custom
* wake function which matches @ work along with exclusive
* wait and wakeup .
2012-08-03 10:30:46 -07:00
*/
2015-03-05 08:04:13 -05:00
if ( unlikely ( ret = = - ENOENT ) ) {
struct cwt_wait cwait ;
init_wait ( & cwait . wait ) ;
cwait . wait . func = cwt_wakefn ;
cwait . work = work ;
prepare_to_wait_exclusive ( & cancel_waitq , & cwait . wait ,
TASK_UNINTERRUPTIBLE ) ;
if ( work_is_canceling ( work ) )
schedule ( ) ;
finish_wait ( & cancel_waitq , & cwait . wait ) ;
}
2007-07-15 23:41:44 -07:00
} while ( unlikely ( ret < 0 ) ) ;
2012-08-03 10:30:46 -07:00
/* tell other tasks trying to grab @work to back off */
mark_work_canceling ( work ) ;
local_irq_restore ( flags ) ;
2016-09-16 15:49:32 -04:00
/*
* This allows canceling during early boot . We know that @ work
* isn ' t executing .
*/
if ( wq_online )
workqueue: skip lockdep wq dependency in cancel_work_sync()
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-08-22 11:49:03 +02:00
__flush_work ( work , true ) ;
2016-09-16 15:49:32 -04:00
2010-06-29 10:07:13 +02:00
clear_work_data ( work ) ;
2015-03-05 08:04:13 -05:00
/*
* Paired with prepare_to_wait ( ) above so that either
* waitqueue_active ( ) is visible here or ! work_is_canceling ( ) is
* visible there .
*/
smp_mb ( ) ;
if ( waitqueue_active ( & cancel_waitq ) )
__wake_up ( & cancel_waitq , TASK_NORMAL , 1 , work ) ;
2007-07-15 23:41:44 -07:00
return ret ;
}
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
/**
2010-09-16 10:36:00 +02:00
* cancel_work_sync - cancel a work and wait for it to finish
* @ work : the work to cancel
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
*
2010-09-16 10:36:00 +02:00
* Cancel @ work and wait for its execution to finish . This function
* can be used even if the work re - queues itself or migrates to
* another workqueue . On return from this function , @ work is
* guaranteed to be not pending or executing on any CPU .
2007-07-15 23:41:44 -07:00
*
2010-09-16 10:36:00 +02:00
* cancel_work_sync ( & delayed_work - > work ) must not be used for
* delayed_work ' s . Use cancel_delayed_work_sync ( ) instead .
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
*
2010-09-16 10:36:00 +02:00
* The caller must ensure that the workqueue on which @ work was last
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
* queued can ' t be destroyed before this function returns .
2010-09-16 10:36:00 +02:00
*
2013-07-31 14:59:24 -07:00
* Return :
2010-09-16 10:36:00 +02:00
* % true if @ work was pending , % false otherwise .
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
*/
2010-09-16 10:36:00 +02:00
bool cancel_work_sync ( struct work_struct * work )
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
{
2012-08-03 10:30:46 -07:00
return __cancel_work_timer ( work , false ) ;
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:33:52 -07:00
}
2007-05-09 02:34:22 -07:00
EXPORT_SYMBOL_GPL ( cancel_work_sync ) ;
implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush. If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.
One example of this is the phy layer: it wants to flush work while holding
rtnl_lock(). But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.
So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up. Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.
flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.
Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.
When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.
Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.
[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:33:52 -07:00
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
/**
2010-09-16 10:36:00 +02:00
* flush_delayed_work - wait for a dwork to finish executing the last queueing
* @ dwork : the delayed work to flush
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
*
2010-09-16 10:36:00 +02:00
* Delayed timer is cancelled and the pending work is queued for
* immediate execution . Like flush_work ( ) , this function only
* considers the last queueing instance of @ dwork .
2007-07-15 23:41:44 -07:00
*
2013-07-31 14:59:24 -07:00
* Return :
2010-09-16 10:36:00 +02:00
* % true if flush_work ( ) waited for the work to finish execution ,
* % false if it was already idle .
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
*/
2010-09-16 10:36:00 +02:00
bool flush_delayed_work ( struct delayed_work * dwork )
{
2012-08-03 10:30:45 -07:00
local_irq_disable ( ) ;
2010-09-16 10:36:00 +02:00
if ( del_timer_sync ( & dwork - > timer ) )
2013-02-06 18:04:53 -08:00
__queue_work ( dwork - > cpu , dwork - > wq , & dwork - > work ) ;
2012-08-03 10:30:45 -07:00
local_irq_enable ( ) ;
2010-09-16 10:36:00 +02:00
return flush_work ( & dwork - > work ) ;
}
EXPORT_SYMBOL ( flush_delayed_work ) ;
2018-03-14 12:45:13 -07:00
/**
* flush_rcu_work - wait for a rwork to finish executing the last queueing
* @ rwork : the rcu work to flush
*
* Return :
* % true if flush_rcu_work ( ) waited for the work to finish execution ,
* % false if it was already idle .
*/
bool flush_rcu_work ( struct rcu_work * rwork )
{
if ( test_bit ( WORK_STRUCT_PENDING_BIT , work_data_bits ( & rwork - > work ) ) ) {
rcu_barrier ( ) ;
flush_work ( & rwork - > work ) ;
return true ;
} else {
return flush_work ( & rwork - > work ) ;
}
}
EXPORT_SYMBOL ( flush_rcu_work ) ;
2016-08-24 15:51:50 -06:00
static bool __cancel_work ( struct work_struct * work , bool is_dwork )
{
unsigned long flags ;
int ret ;
do {
ret = try_to_grab_pending ( work , is_dwork , & flags ) ;
} while ( unlikely ( ret = = - EAGAIN ) ) ;
if ( unlikely ( ret < 0 ) )
return false ;
set_work_pool_and_clear_pending ( work , get_work_pool_id ( work ) ) ;
local_irq_restore ( flags ) ;
return ret ;
}
2022-05-19 09:47:28 -04:00
/*
* See cancel_delayed_work ( )
*/
bool cancel_work ( struct work_struct * work )
{
return __cancel_work ( work , false ) ;
}
EXPORT_SYMBOL ( cancel_work ) ;
2010-09-16 10:48:29 +02:00
/**
2012-08-21 13:18:24 -07:00
* cancel_delayed_work - cancel a delayed work
* @ dwork : delayed_work to cancel
2010-09-16 10:48:29 +02:00
*
2013-07-31 14:59:24 -07:00
* Kill off a pending delayed_work .
*
* Return : % true if @ dwork was pending and canceled ; % false if it wasn ' t
* pending .
*
* Note :
* The work callback function may still be running on return , unless
* it returns % true and the work doesn ' t re - arm itself . Explicitly flush or
* use cancel_delayed_work_sync ( ) to wait on it .
2010-09-16 10:48:29 +02:00
*
2012-08-21 13:18:24 -07:00
* This function is safe to call from any context including IRQ handler .
2010-09-16 10:48:29 +02:00
*/
2012-08-21 13:18:24 -07:00
bool cancel_delayed_work ( struct delayed_work * dwork )
2010-09-16 10:48:29 +02:00
{
2016-08-24 15:51:50 -06:00
return __cancel_work ( & dwork - > work , true ) ;
2010-09-16 10:48:29 +02:00
}
2012-08-21 13:18:24 -07:00
EXPORT_SYMBOL ( cancel_delayed_work ) ;
2010-09-16 10:48:29 +02:00
2010-09-16 10:36:00 +02:00
/**
* cancel_delayed_work_sync - cancel a delayed work and wait for it to finish
* @ dwork : the delayed work cancel
*
* This is cancel_work_sync ( ) for delayed works .
*
2013-07-31 14:59:24 -07:00
* Return :
2010-09-16 10:36:00 +02:00
* % true if @ dwork was pending , % false otherwise .
*/
bool cancel_delayed_work_sync ( struct delayed_work * dwork )
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
{
2012-08-03 10:30:46 -07:00
return __cancel_work_timer ( & dwork - > work , true ) ;
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 02:34:46 -07:00
}
2007-07-15 23:41:44 -07:00
EXPORT_SYMBOL ( cancel_delayed_work_sync ) ;
2005-04-16 15:20:36 -07:00
2006-06-25 05:47:49 -07:00
/**
2010-10-19 11:14:49 +02:00
* schedule_on_each_cpu - execute a function synchronously on each online CPU
2006-06-25 05:47:49 -07:00
* @ func : the function to call
*
2010-10-19 11:14:49 +02:00
* schedule_on_each_cpu ( ) executes @ func on each online CPU using the
* system workqueue and blocks until all CPUs have completed .
2006-06-25 05:47:49 -07:00
* schedule_on_each_cpu ( ) is very slow .
2010-10-19 11:14:49 +02:00
*
2013-07-31 14:59:24 -07:00
* Return :
2010-10-19 11:14:49 +02:00
* 0 on success , - errno on failure .
2006-06-25 05:47:49 -07:00
*/
2006-11-22 14:55:48 +00:00
int schedule_on_each_cpu ( work_func_t func )
2006-01-08 01:00:43 -08:00
{
int cpu ;
2010-08-08 14:24:09 +02:00
struct work_struct __percpu * works ;
2006-01-08 01:00:43 -08:00
2006-06-25 05:47:49 -07:00
works = alloc_percpu ( struct work_struct ) ;
if ( ! works )
2006-01-08 01:00:43 -08:00
return - ENOMEM ;
2006-06-25 05:47:49 -07:00
2021-08-03 16:16:20 +02:00
cpus_read_lock ( ) ;
2009-11-17 14:06:20 -08:00
2006-01-08 01:00:43 -08:00
for_each_online_cpu ( cpu ) {
2006-12-18 20:05:09 +01:00
struct work_struct * work = per_cpu_ptr ( works , cpu ) ;
INIT_WORK ( work , func ) ;
2010-06-29 10:07:14 +02:00
schedule_work_on ( cpu , work ) ;
2009-10-14 06:22:47 +02:00
}
2009-11-17 14:06:20 -08:00
for_each_online_cpu ( cpu )
flush_work ( per_cpu_ptr ( works , cpu ) ) ;
2021-08-03 16:16:20 +02:00
cpus_read_unlock ( ) ;
2006-06-25 05:47:49 -07:00
free_percpu ( works ) ;
2006-01-08 01:00:43 -08:00
return 0 ;
}
2006-02-23 12:43:43 -06:00
/**
* execute_in_process_context - reliably execute the routine with user context
* @ fn : the function to execute
* @ ew : guaranteed storage for the execute work structure ( must
* be available when the work executes )
*
* Executes the function immediately if process context is available ,
* otherwise schedules the function for delayed execution .
*
2013-07-31 14:59:24 -07:00
* Return : 0 - function was executed
2006-02-23 12:43:43 -06:00
* 1 - function was scheduled for execution
*/
2006-11-22 14:55:48 +00:00
int execute_in_process_context ( work_func_t fn , struct execute_work * ew )
2006-02-23 12:43:43 -06:00
{
if ( ! in_interrupt ( ) ) {
2006-11-22 14:55:48 +00:00
fn ( & ew - > work ) ;
2006-02-23 12:43:43 -06:00
return 0 ;
}
2006-11-22 14:55:48 +00:00
INIT_WORK ( & ew - > work , fn ) ;
2006-02-23 12:43:43 -06:00
schedule_work ( & ew - > work ) ;
return 1 ;
}
EXPORT_SYMBOL_GPL ( execute_in_process_context ) ;
2015-04-02 19:14:39 +08:00
/**
* free_workqueue_attrs - free a workqueue_attrs
* @ attrs : workqueue_attrs to free
2013-03-12 11:30:05 -07:00
*
2015-04-02 19:14:39 +08:00
* Undo alloc_workqueue_attrs ( ) .
2013-03-12 11:30:05 -07:00
*/
2019-09-05 21:40:22 -04:00
void free_workqueue_attrs ( struct workqueue_attrs * attrs )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
if ( attrs ) {
free_cpumask_var ( attrs - > cpumask ) ;
kfree ( attrs ) ;
}
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
/**
* alloc_workqueue_attrs - allocate a workqueue_attrs
*
* Allocate a new workqueue_attrs , initialize with default settings and
* return it .
*
* Return : The allocated new workqueue_attr on success . % NULL on failure .
*/
2019-09-05 21:40:22 -04:00
struct workqueue_attrs * alloc_workqueue_attrs ( void )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
struct workqueue_attrs * attrs ;
2013-03-12 11:30:05 -07:00
2019-06-26 16:52:38 +02:00
attrs = kzalloc ( sizeof ( * attrs ) , GFP_KERNEL ) ;
2015-04-02 19:14:39 +08:00
if ( ! attrs )
goto fail ;
2019-06-26 16:52:38 +02:00
if ( ! alloc_cpumask_var ( & attrs - > cpumask , GFP_KERNEL ) )
2015-04-02 19:14:39 +08:00
goto fail ;
cpumask_copy ( attrs - > cpumask , cpu_possible_mask ) ;
return attrs ;
fail :
free_workqueue_attrs ( attrs ) ;
return NULL ;
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
static void copy_workqueue_attrs ( struct workqueue_attrs * to ,
const struct workqueue_attrs * from )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
to - > nice = from - > nice ;
cpumask_copy ( to - > cpumask , from - > cpumask ) ;
/*
* Unlike hash and equality test , this function doesn ' t ignore
* - > no_numa as it is used for both pool and wq attrs . Instead ,
* get_unbound_pool ( ) explicitly clears - > no_numa after copying .
*/
to - > no_numa = from - > no_numa ;
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
/* hash value of the content of @attr */
static u32 wqattrs_hash ( const struct workqueue_attrs * attrs )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
u32 hash = 0 ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
hash = jhash_1word ( attrs - > nice , hash ) ;
hash = jhash ( cpumask_bits ( attrs - > cpumask ) ,
BITS_TO_LONGS ( nr_cpumask_bits ) * sizeof ( long ) , hash ) ;
return hash ;
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
/* content equality test */
static bool wqattrs_equal ( const struct workqueue_attrs * a ,
const struct workqueue_attrs * b )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
if ( a - > nice ! = b - > nice )
return false ;
if ( ! cpumask_equal ( a - > cpumask , b - > cpumask ) )
return false ;
return true ;
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
/**
* init_worker_pool - initialize a newly zalloc ' d worker_pool
* @ pool : worker_pool to initialize
*
2015-05-23 10:38:14 +05:30
* Initialize a newly zalloc ' d @ pool . It also allocates @ pool - > attrs .
2015-04-02 19:14:39 +08:00
*
* Return : 0 on success , - errno on failure . Even on failure , all fields
* inside @ pool proper are initialized and put_unbound_pool ( ) can be called
* on @ pool safely to release it .
*/
static int init_worker_pool ( struct worker_pool * pool )
2013-03-12 11:30:05 -07:00
{
2020-05-27 21:46:33 +02:00
raw_spin_lock_init ( & pool - > lock ) ;
2015-04-02 19:14:39 +08:00
pool - > id = - 1 ;
pool - > cpu = - 1 ;
pool - > node = NUMA_NO_NODE ;
pool - > flags | = POOL_DISASSOCIATED ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
pool - > watchdog_ts = jiffies ;
2015-04-02 19:14:39 +08:00
INIT_LIST_HEAD ( & pool - > worklist ) ;
INIT_LIST_HEAD ( & pool - > idle_list ) ;
hash_init ( pool - > busy_hash ) ;
2013-03-12 11:30:05 -07:00
2017-10-16 15:58:25 -07:00
timer_setup ( & pool - > idle_timer , idle_worker_timeout , TIMER_DEFERRABLE ) ;
2013-03-12 11:30:05 -07:00
2017-10-16 15:58:25 -07:00
timer_setup ( & pool - > mayday_timer , pool_mayday_timeout , 0 ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
INIT_LIST_HEAD ( & pool - > workers ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
ida_init ( & pool - > worker_ida ) ;
INIT_HLIST_NODE ( & pool - > hash_node ) ;
pool - > refcnt = 1 ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* shouldn't fail above this point */
2019-06-26 16:52:38 +02:00
pool - > attrs = alloc_workqueue_attrs ( ) ;
2015-04-02 19:14:39 +08:00
if ( ! pool - > attrs )
return - ENOMEM ;
return 0 ;
2013-03-12 11:30:05 -07:00
}
2019-02-14 15:00:54 -08:00
# ifdef CONFIG_LOCKDEP
static void wq_init_lockdep ( struct workqueue_struct * wq )
{
char * lock_name ;
lockdep_register_key ( & wq - > key ) ;
lock_name = kasprintf ( GFP_KERNEL , " %s%s " , " (wq_completion) " , wq - > name ) ;
if ( ! lock_name )
lock_name = wq - > name ;
2019-03-06 19:27:31 -05:00
wq - > lock_name = lock_name ;
2019-02-14 15:00:54 -08:00
lockdep_init_map ( & wq - > lockdep_map , lock_name , & wq - > key , 0 ) ;
}
static void wq_unregister_lockdep ( struct workqueue_struct * wq )
{
lockdep_unregister_key ( & wq - > key ) ;
}
static void wq_free_lockdep ( struct workqueue_struct * wq )
{
if ( wq - > lock_name ! = wq - > name )
kfree ( wq - > lock_name ) ;
}
# else
static void wq_init_lockdep ( struct workqueue_struct * wq )
{
}
static void wq_unregister_lockdep ( struct workqueue_struct * wq )
{
}
static void wq_free_lockdep ( struct workqueue_struct * wq )
{
}
# endif
2015-04-02 19:14:39 +08:00
static void rcu_free_wq ( struct rcu_head * rcu )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
struct workqueue_struct * wq =
container_of ( rcu , struct workqueue_struct , rcu ) ;
2013-03-12 11:30:05 -07:00
2019-02-14 15:00:54 -08:00
wq_free_lockdep ( wq ) ;
2015-04-02 19:14:39 +08:00
if ( ! ( wq - > flags & WQ_UNBOUND ) )
free_percpu ( wq - > cpu_pwqs ) ;
2013-03-12 11:30:05 -07:00
else
2015-04-02 19:14:39 +08:00
free_workqueue_attrs ( wq - > unbound_attrs ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
kfree ( wq ) ;
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
static void rcu_free_pool ( struct rcu_head * rcu )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
struct worker_pool * pool = container_of ( rcu , struct worker_pool , rcu ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
ida_destroy ( & pool - > worker_ida ) ;
free_workqueue_attrs ( pool - > attrs ) ;
kfree ( pool ) ;
2013-03-12 11:30:05 -07:00
}
2020-05-27 21:46:32 +02:00
/* This returns with the lock held on success (pool manager is inactive). */
static bool wq_manager_inactive ( struct worker_pool * pool )
{
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2020-05-27 21:46:32 +02:00
if ( pool - > flags & POOL_MANAGER_ACTIVE ) {
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2020-05-27 21:46:32 +02:00
return false ;
}
return true ;
}
2015-04-02 19:14:39 +08:00
/**
* put_unbound_pool - put a worker_pool
* @ pool : worker_pool to put
*
2019-03-13 17:55:47 +01:00
* Put @ pool . If its refcnt reaches zero , it gets destroyed in RCU
2015-04-02 19:14:39 +08:00
* safe manner . get_unbound_pool ( ) calls this function on its failure path
* and this function should be able to release pools which went through ,
* successfully or not , init_worker_pool ( ) .
*
* Should be called with wq_pool_mutex held .
*/
static void put_unbound_pool ( struct worker_pool * pool )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
DECLARE_COMPLETION_ONSTACK ( detach_completion ) ;
struct worker * worker ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
if ( - - pool - > refcnt )
return ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* sanity checks */
if ( WARN_ON ( ! ( pool - > cpu < 0 ) ) | |
WARN_ON ( ! list_empty ( & pool - > worklist ) ) )
return ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* release id and unhash */
if ( pool - > id > = 0 )
idr_remove ( & worker_pool_idr , pool - > id ) ;
hash_del ( & pool - > hash_node ) ;
2013-04-01 11:23:38 -07:00
2015-04-02 19:14:39 +08:00
/*
2017-10-09 08:04:13 -07:00
* Become the manager and destroy all workers . This prevents
* @ pool ' s workers from blocking on attach_mutex . We ' re the last
* manager and @ pool gets freed with the flag set .
2020-05-27 21:46:32 +02:00
* Because of how wq_manager_inactive ( ) works , we will hold the
* spinlock after a successful wait .
2015-04-02 19:14:39 +08:00
*/
2020-05-27 21:46:32 +02:00
rcuwait_wait_event ( & manager_wait , wq_manager_inactive ( pool ) ,
TASK_UNINTERRUPTIBLE ) ;
2017-10-09 08:04:13 -07:00
pool - > flags | = POOL_MANAGER_ACTIVE ;
2015-04-02 19:14:39 +08:00
while ( ( worker = first_idle_worker ( pool ) ) )
destroy_worker ( worker ) ;
WARN_ON ( pool - > nr_workers | | pool - > nr_idle ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2013-04-01 11:23:38 -07:00
2018-05-18 08:47:13 -07:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2015-04-02 19:14:39 +08:00
if ( ! list_empty ( & pool - > workers ) )
pool - > detach_completion = & detach_completion ;
2018-05-18 08:47:13 -07:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
if ( pool - > detach_completion )
wait_for_completion ( pool - > detach_completion ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* shut down the timers */
del_timer_sync ( & pool - > idle_timer ) ;
del_timer_sync ( & pool - > mayday_timer ) ;
2013-03-12 11:30:05 -07:00
2019-03-13 17:55:47 +01:00
/* RCU protected to allow dereferences from get_work_pool() */
2018-11-06 19:18:45 -08:00
call_rcu ( & pool - > rcu , rcu_free_pool ) ;
2013-03-12 11:30:05 -07:00
}
/**
2015-04-02 19:14:39 +08:00
* get_unbound_pool - get a worker_pool with the specified attributes
* @ attrs : the attributes of the worker_pool to get
2013-03-12 11:30:05 -07:00
*
2015-04-02 19:14:39 +08:00
* Obtain a worker_pool which has the same attributes as @ attrs , bump the
* reference count and return it . If there already is a matching
* worker_pool , it will be used ; otherwise , this function attempts to
* create a new one .
2013-03-12 11:30:05 -07:00
*
2015-04-02 19:14:39 +08:00
* Should be called with wq_pool_mutex held .
2013-03-12 11:30:05 -07:00
*
2015-04-02 19:14:39 +08:00
* Return : On success , a worker_pool with the same attributes as @ attrs .
* On failure , % NULL .
2013-03-12 11:30:05 -07:00
*/
2015-04-02 19:14:39 +08:00
static struct worker_pool * get_unbound_pool ( const struct workqueue_attrs * attrs )
2013-03-12 11:30:05 -07:00
{
2015-04-02 19:14:39 +08:00
u32 hash = wqattrs_hash ( attrs ) ;
struct worker_pool * pool ;
int node ;
2015-10-09 11:53:12 +08:00
int target_node = NUMA_NO_NODE ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* do we already have a matching pool? */
hash_for_each_possible ( unbound_pool_hash , pool , hash_node , hash ) {
if ( wqattrs_equal ( pool - > attrs , attrs ) ) {
pool - > refcnt + + ;
return pool ;
}
}
2013-03-12 11:30:05 -07:00
2015-10-09 11:53:12 +08:00
/* if cpumask is contained inside a NUMA node, we belong to that node */
if ( wq_numa_enabled ) {
for_each_node ( node ) {
if ( cpumask_subset ( attrs - > cpumask ,
wq_numa_possible_cpumask [ node ] ) ) {
target_node = node ;
break ;
}
}
}
2015-04-02 19:14:39 +08:00
/* nope, create a new one */
2015-10-09 11:53:12 +08:00
pool = kzalloc_node ( sizeof ( * pool ) , GFP_KERNEL , target_node ) ;
2015-04-02 19:14:39 +08:00
if ( ! pool | | init_worker_pool ( pool ) < 0 )
goto fail ;
lockdep_set_subclass ( & pool - > lock , 1 ) ; /* see put_pwq() */
copy_workqueue_attrs ( pool - > attrs , attrs ) ;
2015-10-09 11:53:12 +08:00
pool - > node = target_node ;
2013-03-12 11:30:05 -07:00
/*
2015-04-02 19:14:39 +08:00
* no_numa isn ' t a worker_pool attribute , always clear it . See
* ' struct workqueue_attrs ' comments for detail .
2013-03-12 11:30:05 -07:00
*/
2015-04-02 19:14:39 +08:00
pool - > attrs - > no_numa = false ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
if ( worker_pool_assign_id ( pool ) < 0 )
goto fail ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* create and start the initial worker */
2016-09-16 15:49:32 -04:00
if ( wq_online & & ! create_worker ( pool ) )
2015-04-02 19:14:39 +08:00
goto fail ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
/* install */
hash_add ( unbound_pool_hash , & pool - > hash_node , hash ) ;
2013-03-12 11:30:05 -07:00
2015-04-02 19:14:39 +08:00
return pool ;
fail :
if ( pool )
put_unbound_pool ( pool ) ;
return NULL ;
2013-03-12 11:30:05 -07:00
}
2015-04-02 19:14:39 +08:00
static void rcu_free_pwq ( struct rcu_head * rcu )
2013-03-12 11:30:00 -07:00
{
2015-04-02 19:14:39 +08:00
kmem_cache_free ( pwq_cache ,
container_of ( rcu , struct pool_workqueue , rcu ) ) ;
2013-03-12 11:30:00 -07:00
}
2015-04-02 19:14:39 +08:00
/*
* Scheduled on system_wq by put_pwq ( ) when an unbound pwq hits zero refcnt
* and needs to be destroyed .
2013-03-12 11:30:00 -07:00
*/
2015-04-02 19:14:39 +08:00
static void pwq_unbound_release_workfn ( struct work_struct * work )
2013-03-12 11:30:00 -07:00
{
2015-04-02 19:14:39 +08:00
struct pool_workqueue * pwq = container_of ( work , struct pool_workqueue ,
unbound_release_work ) ;
struct workqueue_struct * wq = pwq - > wq ;
struct worker_pool * pool = pwq - > pool ;
2021-07-14 17:19:33 +08:00
bool is_last = false ;
2013-03-12 11:30:00 -07:00
2021-07-14 17:19:33 +08:00
/*
* when @ pwq is not linked , it doesn ' t hold any reference to the
* @ wq , and @ wq is invalid to access .
*/
if ( ! list_empty ( & pwq - > pwqs_node ) ) {
if ( WARN_ON_ONCE ( ! ( wq - > flags & WQ_UNBOUND ) ) )
return ;
2013-03-12 11:30:00 -07:00
2021-07-14 17:19:33 +08:00
mutex_lock ( & wq - > mutex ) ;
list_del_rcu ( & pwq - > pwqs_node ) ;
is_last = list_empty ( & wq - > pwqs ) ;
mutex_unlock ( & wq - > mutex ) ;
}
2015-04-02 19:14:39 +08:00
mutex_lock ( & wq_pool_mutex ) ;
put_unbound_pool ( pool ) ;
mutex_unlock ( & wq_pool_mutex ) ;
2018-11-06 19:18:45 -08:00
call_rcu ( & pwq - > rcu , rcu_free_pwq ) ;
2013-03-12 11:30:00 -07:00
2013-08-01 09:56:36 +08:00
/*
2015-04-02 19:14:39 +08:00
* If we ' re the last pwq going away , @ wq is already dead and no one
* is gonna access it anymore . Schedule RCU free .
2013-08-01 09:56:36 +08:00
*/
2019-02-14 15:00:54 -08:00
if ( is_last ) {
wq_unregister_lockdep ( wq ) ;
2018-11-06 19:18:45 -08:00
call_rcu ( & wq - > rcu , rcu_free_wq ) ;
2019-02-14 15:00:54 -08:00
}
2013-03-12 11:30:03 -07:00
}
2013-03-12 11:30:00 -07:00
/**
2015-04-02 19:14:39 +08:00
* pwq_adjust_max_active - update a pwq ' s max_active to the current setting
* @ pwq : target pool_workqueue
2013-07-31 14:59:24 -07:00
*
2015-04-02 19:14:39 +08:00
* If @ pwq isn ' t freezing , set @ pwq - > max_active to the associated
2021-08-17 09:32:34 +08:00
* workqueue ' s saved_max_active and activate inactive work items
2015-04-02 19:14:39 +08:00
* accordingly . If @ pwq is freezing , clear @ pwq - > max_active to zero .
2013-03-12 11:30:00 -07:00
*/
2015-04-02 19:14:39 +08:00
static void pwq_adjust_max_active ( struct pool_workqueue * pwq )
2013-03-12 11:30:00 -07:00
{
2015-04-02 19:14:39 +08:00
struct workqueue_struct * wq = pwq - > wq ;
bool freezable = wq - > flags & WQ_FREEZABLE ;
2016-09-16 15:49:32 -04:00
unsigned long flags ;
2013-03-12 11:30:00 -07:00
2015-04-02 19:14:39 +08:00
/* for @wq->saved_max_active */
lockdep_assert_held ( & wq - > mutex ) ;
2013-03-12 11:30:00 -07:00
2015-04-02 19:14:39 +08:00
/* fast exit for non-freezable wqs */
if ( ! freezable & & pwq - > max_active = = wq - > saved_max_active )
return ;
2013-03-12 11:30:00 -07:00
2016-09-16 15:49:32 -04:00
/* this function can be called during early boot w/ irq disabled */
2020-05-27 21:46:33 +02:00
raw_spin_lock_irqsave ( & pwq - > pool - > lock , flags ) ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
/*
* During [ un ] freezing , the caller is responsible for ensuring that
* this function is called at least once after @ workqueue_freezing
* is updated and visible .
*/
if ( ! freezable | | ! workqueue_freezing ) {
2020-11-19 14:21:25 +08:00
bool kick = false ;
2015-04-02 19:14:39 +08:00
pwq - > max_active = wq - > saved_max_active ;
2013-03-12 11:30:00 -07:00
2021-08-17 09:32:34 +08:00
while ( ! list_empty ( & pwq - > inactive_works ) & &
2020-11-19 14:21:25 +08:00
pwq - > nr_active < pwq - > max_active ) {
2021-08-17 09:32:34 +08:00
pwq_activate_first_inactive ( pwq ) ;
2020-11-19 14:21:25 +08:00
kick = true ;
}
2015-03-09 09:22:28 -04:00
2015-04-02 19:14:39 +08:00
/*
* Need to kick a worker after thawed or an unbound wq ' s
2020-11-19 14:21:25 +08:00
* max_active is bumped . In realtime scenarios , always kicking a
* worker will cause interference on the isolated cpu cores , so
* let ' s kick iff work items were activated .
2015-04-02 19:14:39 +08:00
*/
2020-11-19 14:21:25 +08:00
if ( kick )
wake_up_worker ( pwq - > pool ) ;
2015-04-02 19:14:39 +08:00
} else {
pwq - > max_active = 0 ;
}
2015-03-09 09:22:28 -04:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irqrestore ( & pwq - > pool - > lock , flags ) ;
2015-03-09 09:22:28 -04:00
}
2021-07-31 08:01:29 +08:00
/* initialize newly allocated @pwq which is associated with @wq and @pool */
2015-04-02 19:14:39 +08:00
static void init_pwq ( struct pool_workqueue * pwq , struct workqueue_struct * wq ,
struct worker_pool * pool )
2013-03-12 11:30:03 -07:00
{
2015-04-02 19:14:39 +08:00
BUG_ON ( ( unsigned long ) pwq & WORK_STRUCT_FLAG_MASK ) ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
memset ( pwq , 0 , sizeof ( * pwq ) ) ;
pwq - > pool = pool ;
pwq - > wq = wq ;
pwq - > flush_color = - 1 ;
pwq - > refcnt = 1 ;
2021-08-17 09:32:34 +08:00
INIT_LIST_HEAD ( & pwq - > inactive_works ) ;
2015-04-02 19:14:39 +08:00
INIT_LIST_HEAD ( & pwq - > pwqs_node ) ;
INIT_LIST_HEAD ( & pwq - > mayday_node ) ;
INIT_WORK ( & pwq - > unbound_release_work , pwq_unbound_release_workfn ) ;
2013-03-12 11:30:03 -07:00
}
2015-04-02 19:14:39 +08:00
/* sync @pwq with the current state of its associated wq and link it */
static void link_pwq ( struct pool_workqueue * pwq )
2013-03-12 11:30:03 -07:00
{
2015-04-02 19:14:39 +08:00
struct workqueue_struct * wq = pwq - > wq ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
lockdep_assert_held ( & wq - > mutex ) ;
2013-04-01 11:23:32 -07:00
2015-04-02 19:14:39 +08:00
/* may be called multiple times, ignore if already linked */
if ( ! list_empty ( & pwq - > pwqs_node ) )
2013-03-12 11:30:03 -07:00
return ;
2015-04-02 19:14:39 +08:00
/* set the matching work_color */
pwq - > work_color = wq - > work_color ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
/* sync max_active to the current setting */
pwq_adjust_max_active ( pwq ) ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
/* link in @pwq */
list_add_rcu ( & pwq - > pwqs_node , & wq - > pwqs ) ;
}
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
static struct pool_workqueue * alloc_unbound_pwq ( struct workqueue_struct * wq ,
const struct workqueue_attrs * attrs )
{
struct worker_pool * pool ;
struct pool_workqueue * pwq ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
2015-04-02 19:14:39 +08:00
lockdep_assert_held ( & wq_pool_mutex ) ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
2015-04-02 19:14:39 +08:00
pool = get_unbound_pool ( attrs ) ;
if ( ! pool )
return NULL ;
workqueue: async worker destruction
worker destruction includes these parts of code:
adjust pool's stats
remove the worker from idle list
detach the worker from the pool
kthread_stop() to wait for the worker's task exit
free the worker struct
We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.
However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.
So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.
The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:
1) The code of "detach the worker" is not short enough to unfold them
in worker_thread().
2) the name of "worker_detach_from_pool()" is self-comment, and we add
some comments above the function.
3) it will be shared by rescuer in later patch which allows rescuer
and normal thread use the same attach/detach frameworks.
The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.
Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().
tj: Minor description updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 17:46:29 +08:00
2015-04-02 19:14:39 +08:00
pwq = kmem_cache_alloc_node ( pwq_cache , GFP_KERNEL , pool - > node ) ;
if ( ! pwq ) {
put_unbound_pool ( pool ) ;
return NULL ;
}
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
init_pwq ( pwq , wq , pool ) ;
return pwq ;
}
2013-03-12 11:30:03 -07:00
/**
2015-05-11 11:02:47 -04:00
* wq_calc_node_cpumask - calculate a wq_attrs ' cpumask for the specified node
2015-04-30 17:16:12 +08:00
* @ attrs : the wq_attrs of the default pwq of the target workqueue
2015-04-02 19:14:39 +08:00
* @ node : the target NUMA node
* @ cpu_going_down : if > = 0 , the CPU to consider as offline
* @ cpumask : outarg , the resulting cpumask
2013-03-12 11:30:03 -07:00
*
2015-04-02 19:14:39 +08:00
* Calculate the cpumask a workqueue with @ attrs should use on @ node . If
* @ cpu_going_down is > = 0 , that cpu is considered offline during
* calculation . The result is stored in @ cpumask .
2013-04-01 11:23:32 -07:00
*
2015-04-02 19:14:39 +08:00
* If NUMA affinity is not enabled , @ attrs - > cpumask is always used . If
* enabled and @ node has online CPUs requested by @ attrs , the returned
* cpumask is the intersection of the possible CPUs of @ node and
* @ attrs - > cpumask .
2013-07-31 14:59:24 -07:00
*
2015-04-02 19:14:39 +08:00
* The caller is responsible for ensuring that the cpumask of @ node stays
* stable .
*
* Return : % true if the resulting @ cpumask is different from @ attrs - > cpumask ,
* % false if equal .
2013-03-12 11:30:03 -07:00
*/
2015-04-02 19:14:39 +08:00
static bool wq_calc_node_cpumask ( const struct workqueue_attrs * attrs , int node ,
int cpu_going_down , cpumask_t * cpumask )
2013-03-12 11:30:03 -07:00
{
2015-04-02 19:14:39 +08:00
if ( ! wq_numa_enabled | | attrs - > no_numa )
goto use_dfl ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
/* does @node have any online CPUs @attrs wants? */
cpumask_and ( cpumask , cpumask_of_node ( node ) , attrs - > cpumask ) ;
if ( cpu_going_down > = 0 )
cpumask_clear_cpu ( cpu_going_down , cpumask ) ;
2013-03-12 11:30:03 -07:00
2015-04-02 19:14:39 +08:00
if ( cpumask_empty ( cpumask ) )
goto use_dfl ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/* yeap, return possible CPUs in @node that @attrs wants */
cpumask_and ( cpumask , attrs - > cpumask , wq_numa_possible_cpumask [ node ] ) ;
2017-07-27 16:27:14 -05:00
if ( cpumask_empty ( cpumask ) ) {
pr_warn_once ( " WARNING: workqueue cpumask: online intersect > "
" possible intersect \n " ) ;
return false ;
}
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
return ! cpumask_equal ( cpumask , attrs - > cpumask ) ;
use_dfl :
cpumask_copy ( cpumask , attrs - > cpumask ) ;
return false ;
}
2013-04-01 11:23:35 -07:00
/* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
static struct pool_workqueue * numa_pwq_tbl_install ( struct workqueue_struct * wq ,
int node ,
struct pool_workqueue * pwq )
{
struct pool_workqueue * old_pwq ;
2015-05-12 20:32:29 +08:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-04-01 11:23:35 -07:00
lockdep_assert_held ( & wq - > mutex ) ;
/* link_pwq() can handle duplicate calls */
link_pwq ( pwq ) ;
old_pwq = rcu_access_pointer ( wq - > numa_pwq_tbl [ node ] ) ;
rcu_assign_pointer ( wq - > numa_pwq_tbl [ node ] , pwq ) ;
return old_pwq ;
}
2015-04-27 17:58:38 +08:00
/* context to store the prepared attrs & pwqs before applying */
struct apply_wqattrs_ctx {
struct workqueue_struct * wq ; /* target workqueue */
struct workqueue_attrs * attrs ; /* attrs to apply */
2015-04-30 17:16:12 +08:00
struct list_head list ; /* queued for batching commit */
2015-04-27 17:58:38 +08:00
struct pool_workqueue * dfl_pwq ;
struct pool_workqueue * pwq_tbl [ ] ;
} ;
/* free the resources after success or abort */
static void apply_wqattrs_cleanup ( struct apply_wqattrs_ctx * ctx )
{
if ( ctx ) {
int node ;
for_each_node ( node )
put_pwq_unlocked ( ctx - > pwq_tbl [ node ] ) ;
put_pwq_unlocked ( ctx - > dfl_pwq ) ;
free_workqueue_attrs ( ctx - > attrs ) ;
kfree ( ctx ) ;
}
}
/* allocate the attrs and pwqs for later installation */
static struct apply_wqattrs_ctx *
apply_wqattrs_prepare ( struct workqueue_struct * wq ,
const struct workqueue_attrs * attrs )
2013-03-12 11:30:04 -07:00
{
2015-04-27 17:58:38 +08:00
struct apply_wqattrs_ctx * ctx ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
struct workqueue_attrs * new_attrs , * tmp_attrs ;
2015-04-27 17:58:38 +08:00
int node ;
2013-03-12 11:30:04 -07:00
2015-04-27 17:58:38 +08:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2013-03-12 11:30:04 -07:00
treewide: Use struct_size() for kmalloc()-family
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:
// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@
- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-05-08 13:45:50 -07:00
ctx = kzalloc ( struct_size ( ctx , pwq_tbl , nr_node_ids ) , GFP_KERNEL ) ;
2013-03-12 11:30:04 -07:00
2019-06-26 16:52:38 +02:00
new_attrs = alloc_workqueue_attrs ( ) ;
tmp_attrs = alloc_workqueue_attrs ( ) ;
2015-04-27 17:58:38 +08:00
if ( ! ctx | | ! new_attrs | | ! tmp_attrs )
goto out_free ;
2013-04-01 11:23:31 -07:00
2015-04-30 17:16:12 +08:00
/*
* Calculate the attrs of the default pwq .
* If the user configured cpumask doesn ' t overlap with the
* wq_unbound_cpumask , we fallback to the wq_unbound_cpumask .
*/
2013-04-01 11:23:31 -07:00
copy_workqueue_attrs ( new_attrs , attrs ) ;
2015-04-27 17:58:39 +08:00
cpumask_and ( new_attrs - > cpumask , new_attrs - > cpumask , wq_unbound_cpumask ) ;
2015-04-30 17:16:12 +08:00
if ( unlikely ( cpumask_empty ( new_attrs - > cpumask ) ) )
cpumask_copy ( new_attrs - > cpumask , wq_unbound_cpumask ) ;
2013-04-01 11:23:31 -07:00
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/*
* We may create multiple pwqs with differing cpumasks . Make a
* copy of @ new_attrs which will be modified and used to obtain
* pools .
*/
copy_workqueue_attrs ( tmp_attrs , new_attrs ) ;
/*
* If something goes wrong during CPU up / down , we ' ll fall back to
* the default pwq covering whole @ attrs - > cpumask . Always create
* it even if we don ' t use it immediately .
*/
2015-04-27 17:58:38 +08:00
ctx - > dfl_pwq = alloc_unbound_pwq ( wq , new_attrs ) ;
if ( ! ctx - > dfl_pwq )
goto out_free ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
for_each_node ( node ) {
2015-04-30 17:16:12 +08:00
if ( wq_calc_node_cpumask ( new_attrs , node , - 1 , tmp_attrs - > cpumask ) ) {
2015-04-27 17:58:38 +08:00
ctx - > pwq_tbl [ node ] = alloc_unbound_pwq ( wq , tmp_attrs ) ;
if ( ! ctx - > pwq_tbl [ node ] )
goto out_free ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
} else {
2015-04-27 17:58:38 +08:00
ctx - > dfl_pwq - > refcnt + + ;
ctx - > pwq_tbl [ node ] = ctx - > dfl_pwq ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
}
}
2015-04-30 17:16:12 +08:00
/* save the user configured attrs and sanitize it. */
copy_workqueue_attrs ( new_attrs , attrs ) ;
cpumask_and ( new_attrs - > cpumask , new_attrs - > cpumask , cpu_possible_mask ) ;
2015-04-27 17:58:38 +08:00
ctx - > attrs = new_attrs ;
2015-04-30 17:16:12 +08:00
2015-04-27 17:58:38 +08:00
ctx - > wq = wq ;
free_workqueue_attrs ( tmp_attrs ) ;
return ctx ;
out_free :
free_workqueue_attrs ( tmp_attrs ) ;
free_workqueue_attrs ( new_attrs ) ;
apply_wqattrs_cleanup ( ctx ) ;
return NULL ;
}
/* set attrs and install prepared pwqs, @ctx points to old pwqs on return */
static void apply_wqattrs_commit ( struct apply_wqattrs_ctx * ctx )
{
int node ;
2013-03-12 11:30:04 -07:00
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/* all pwqs have been created successfully, let's install'em */
2015-04-27 17:58:38 +08:00
mutex_lock ( & ctx - > wq - > mutex ) ;
2013-04-01 11:23:32 -07:00
2015-04-27 17:58:38 +08:00
copy_workqueue_attrs ( ctx - > wq - > unbound_attrs , ctx - > attrs ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/* save the previous pwq and install the new one */
2013-04-01 11:23:35 -07:00
for_each_node ( node )
2015-04-27 17:58:38 +08:00
ctx - > pwq_tbl [ node ] = numa_pwq_tbl_install ( ctx - > wq , node ,
ctx - > pwq_tbl [ node ] ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/* @dfl_pwq might not have been used, ensure it's linked */
2015-04-27 17:58:38 +08:00
link_pwq ( ctx - > dfl_pwq ) ;
swap ( ctx - > wq - > dfl_pwq , ctx - > dfl_pwq ) ;
2013-04-01 11:23:35 -07:00
2015-04-27 17:58:38 +08:00
mutex_unlock ( & ctx - > wq - > mutex ) ;
}
2013-03-12 11:30:04 -07:00
2015-05-19 18:03:47 +08:00
static void apply_wqattrs_lock ( void )
{
/* CPUs should stay stable across pwq creations and installations */
2021-08-03 16:16:20 +02:00
cpus_read_lock ( ) ;
2015-05-19 18:03:47 +08:00
mutex_lock ( & wq_pool_mutex ) ;
}
static void apply_wqattrs_unlock ( void )
{
mutex_unlock ( & wq_pool_mutex ) ;
2021-08-03 16:16:20 +02:00
cpus_read_unlock ( ) ;
2015-05-19 18:03:47 +08:00
}
static int apply_workqueue_attrs_locked ( struct workqueue_struct * wq ,
const struct workqueue_attrs * attrs )
2015-04-27 17:58:38 +08:00
{
struct apply_wqattrs_ctx * ctx ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
2015-04-27 17:58:38 +08:00
/* only unbound workqueues can change attributes */
if ( WARN_ON ( ! ( wq - > flags & WQ_UNBOUND ) ) )
return - EINVAL ;
2013-04-01 11:23:31 -07:00
2015-04-27 17:58:38 +08:00
/* creating multiple pwqs breaks ordering guarantee */
2017-07-23 08:36:15 -04:00
if ( ! list_empty ( & wq - > pwqs ) ) {
if ( WARN_ON ( wq - > flags & __WQ_ORDERED_EXPLICIT ) )
return - EINVAL ;
wq - > flags & = ~ __WQ_ORDERED ;
}
2015-04-27 17:58:38 +08:00
ctx = apply_wqattrs_prepare ( wq , attrs ) ;
2016-01-07 20:38:59 +08:00
if ( ! ctx )
return - ENOMEM ;
2015-04-27 17:58:38 +08:00
/* the ctx has been prepared successfully, let's commit it */
2016-01-07 20:38:59 +08:00
apply_wqattrs_commit ( ctx ) ;
2015-04-27 17:58:38 +08:00
apply_wqattrs_cleanup ( ctx ) ;
2016-01-07 20:38:59 +08:00
return 0 ;
2013-03-12 11:30:04 -07:00
}
2015-05-19 18:03:47 +08:00
/**
* apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
* @ wq : the target workqueue
* @ attrs : the workqueue_attrs to apply , allocated with alloc_workqueue_attrs ( )
*
* Apply @ attrs to an unbound workqueue @ wq . Unless disabled , on NUMA
* machines , this function maps a separate pwq to each NUMA node with
* possibles CPUs in @ attrs - > cpumask so that work items are affine to the
* NUMA node it was issued on . Older pwqs are released as in - flight work
* items finish . Note that a work item which repeatedly requeues itself
* back - to - back will stay on its current pwq .
*
* Performs GFP_KERNEL allocations .
*
2021-08-03 16:16:20 +02:00
* Assumes caller has CPU hotplug read exclusion , i . e . cpus_read_lock ( ) .
2019-09-05 21:40:23 -04:00
*
2015-05-19 18:03:47 +08:00
* Return : 0 on success and - errno on failure .
*/
2019-09-05 21:40:22 -04:00
int apply_workqueue_attrs ( struct workqueue_struct * wq ,
2015-05-19 18:03:47 +08:00
const struct workqueue_attrs * attrs )
{
int ret ;
2019-09-05 21:40:23 -04:00
lockdep_assert_cpus_held ( ) ;
mutex_lock ( & wq_pool_mutex ) ;
2015-05-19 18:03:47 +08:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2019-09-05 21:40:23 -04:00
mutex_unlock ( & wq_pool_mutex ) ;
2015-05-19 18:03:47 +08:00
return ret ;
}
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
/**
* wq_update_unbound_numa - update NUMA affinity of a wq for CPU hot [ un ] plug
* @ wq : the target workqueue
* @ cpu : the CPU coming up or going down
* @ online : whether @ cpu is coming up or going down
*
* This function is to be called from % CPU_DOWN_PREPARE , % CPU_ONLINE and
* % CPU_DOWN_FAILED . @ cpu is being hot [ un ] plugged , update NUMA affinity of
* @ wq accordingly .
*
* If NUMA affinity can ' t be adjusted due to memory allocation failure , it
* falls back to @ wq - > dfl_pwq which may not be optimal but is always
* correct .
*
* Note that when the last allowed CPU of a NUMA node goes offline for a
* workqueue with a cpumask spanning multiple nodes , the workers which were
* already executing the work items for the workqueue will lose their CPU
* affinity and may execute on any CPU . This is similar to how per - cpu
* workqueues behave on CPU_DOWN . If a workqueue user wants strict
* affinity , it ' s the user ' s responsibility to flush the work item from
* CPU_DOWN_PREPARE .
*/
static void wq_update_unbound_numa ( struct workqueue_struct * wq , int cpu ,
bool online )
{
int node = cpu_to_node ( cpu ) ;
int cpu_off = online ? - 1 : cpu ;
struct pool_workqueue * old_pwq = NULL , * pwq ;
struct workqueue_attrs * target_attrs ;
cpumask_t * cpumask ;
lockdep_assert_held ( & wq_pool_mutex ) ;
2015-05-12 20:32:30 +08:00
if ( ! wq_numa_enabled | | ! ( wq - > flags & WQ_UNBOUND ) | |
wq - > unbound_attrs - > no_numa )
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
return ;
/*
* We don ' t wanna alloc / free wq_attrs for each wq for each CPU .
* Let ' s use a preallocated one . The following buf is protected by
* CPU hotplug exclusion .
*/
target_attrs = wq_update_unbound_numa_attrs_buf ;
cpumask = target_attrs - > cpumask ;
copy_workqueue_attrs ( target_attrs , wq - > unbound_attrs ) ;
pwq = unbound_pwq_by_node ( wq , node ) ;
/*
* Let ' s determine what needs to be done . If the target cpumask is
2015-04-30 17:16:12 +08:00
* different from the default pwq ' s , we need to compare it to @ pwq ' s
* and create a new one if they don ' t match . If the target cpumask
* equals the default pwq ' s , the default pwq should be used .
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
*/
2015-04-30 17:16:12 +08:00
if ( wq_calc_node_cpumask ( wq - > dfl_pwq - > pool - > attrs , node , cpu_off , cpumask ) ) {
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
if ( cpumask_equal ( cpumask , pwq - > pool - > attrs - > cpumask ) )
2015-05-12 20:32:30 +08:00
return ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
} else {
2014-04-18 09:08:14 +09:00
goto use_dfl_pwq ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
}
/* create a new pwq */
pwq = alloc_unbound_pwq ( wq , target_attrs ) ;
if ( ! pwq ) {
2014-05-12 13:59:35 -04:00
pr_warn ( " workqueue: allocation failed while updating NUMA affinity of \" %s \" \n " ,
wq - > name ) ;
2014-04-16 14:32:29 +09:00
goto use_dfl_pwq ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
}
2015-05-12 20:32:30 +08:00
/* Install the new pwq. */
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
mutex_lock ( & wq - > mutex ) ;
old_pwq = numa_pwq_tbl_install ( wq , node , pwq ) ;
goto out_unlock ;
use_dfl_pwq :
2015-05-12 20:32:30 +08:00
mutex_lock ( & wq - > mutex ) ;
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & wq - > dfl_pwq - > pool - > lock ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
get_pwq ( wq - > dfl_pwq ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & wq - > dfl_pwq - > pool - > lock ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
old_pwq = numa_pwq_tbl_install ( wq , node , wq - > dfl_pwq ) ;
out_unlock :
mutex_unlock ( & wq - > mutex ) ;
put_pwq_unlocked ( old_pwq ) ;
}
2013-03-12 11:29:57 -07:00
static int alloc_and_link_pwqs ( struct workqueue_struct * wq )
2010-06-29 10:07:11 +02:00
{
2013-03-12 11:29:58 -07:00
bool highpri = wq - > flags & WQ_HIGHPRI ;
2013-09-05 12:30:04 -04:00
int cpu , ret ;
2013-03-12 11:29:57 -07:00
if ( ! ( wq - > flags & WQ_UNBOUND ) ) {
2013-03-12 11:29:59 -07:00
wq - > cpu_pwqs = alloc_percpu ( struct pool_workqueue ) ;
if ( ! wq - > cpu_pwqs )
2013-03-12 11:29:57 -07:00
return - ENOMEM ;
for_each_possible_cpu ( cpu ) {
2013-03-12 11:30:00 -07:00
struct pool_workqueue * pwq =
per_cpu_ptr ( wq - > cpu_pwqs , cpu ) ;
2013-03-12 11:30:03 -07:00
struct worker_pool * cpu_pools =
2013-03-12 11:30:03 -07:00
per_cpu ( cpu_worker_pools , cpu ) ;
2010-07-02 10:03:51 +02:00
2013-04-01 11:23:35 -07:00
init_pwq ( pwq , wq , & cpu_pools [ highpri ] ) ;
mutex_lock ( & wq - > mutex ) ;
2013-04-01 11:23:35 -07:00
link_pwq ( pwq ) ;
2013-04-01 11:23:35 -07:00
mutex_unlock ( & wq - > mutex ) ;
2013-03-12 11:29:57 -07:00
}
2013-03-12 11:30:04 -07:00
return 0 ;
2019-09-05 21:40:23 -04:00
}
2021-08-03 16:16:20 +02:00
cpus_read_lock ( ) ;
2019-09-05 21:40:23 -04:00
if ( wq - > flags & __WQ_ORDERED ) {
2013-09-05 12:30:04 -04:00
ret = apply_workqueue_attrs ( wq , ordered_wq_attrs [ highpri ] ) ;
/* there should only be single pwq for ordering guarantee */
WARN ( ! ret & & ( wq - > pwqs . next ! = & wq - > dfl_pwq - > pwqs_node | |
wq - > pwqs . prev ! = & wq - > dfl_pwq - > pwqs_node ) ,
" ordering guarantee broken for workqueue %s \n " , wq - > name ) ;
2013-03-12 11:29:57 -07:00
} else {
2019-09-05 21:40:23 -04:00
ret = apply_workqueue_attrs ( wq , unbound_std_wq_attrs [ highpri ] ) ;
2013-03-12 11:29:57 -07:00
}
2021-08-03 16:16:20 +02:00
cpus_read_unlock ( ) ;
2019-09-05 21:40:23 -04:00
return ret ;
2010-06-29 10:07:11 +02:00
}
2010-07-02 10:03:51 +02:00
static int wq_clamp_max_active ( int max_active , unsigned int flags ,
const char * name )
2010-06-29 10:07:14 +02:00
{
2010-07-02 10:03:51 +02:00
int lim = flags & WQ_UNBOUND ? WQ_UNBOUND_MAX_ACTIVE : WQ_MAX_ACTIVE ;
if ( max_active < 1 | | max_active > lim )
2012-08-19 00:52:42 +03:00
pr_warn ( " workqueue: max_active %d requested for %s is out of range, clamping between %d and %d \n " ,
max_active , name , 1 , lim ) ;
2010-06-29 10:07:14 +02:00
2010-07-02 10:03:51 +02:00
return clamp_val ( max_active , 1 , lim ) ;
2010-06-29 10:07:14 +02:00
}
2018-01-08 05:38:32 -08:00
/*
* Workqueues which may be used during memory reclaim should have a rescuer
* to guarantee forward progress .
*/
static int init_rescuer ( struct workqueue_struct * wq )
{
struct worker * rescuer ;
2020-05-08 18:07:40 +03:00
int ret ;
2018-01-08 05:38:32 -08:00
if ( ! ( wq - > flags & WQ_MEM_RECLAIM ) )
return 0 ;
rescuer = alloc_worker ( NUMA_NO_NODE ) ;
if ( ! rescuer )
return - ENOMEM ;
rescuer - > rescue_wq = wq ;
rescuer - > task = kthread_create ( rescuer_thread , rescuer , " %s " , wq - > name ) ;
2020-04-29 12:04:13 +08:00
if ( IS_ERR ( rescuer - > task ) ) {
2020-05-08 18:07:40 +03:00
ret = PTR_ERR ( rescuer - > task ) ;
2018-01-08 05:38:32 -08:00
kfree ( rescuer ) ;
2020-05-08 18:07:40 +03:00
return ret ;
2018-01-08 05:38:32 -08:00
}
wq - > rescuer = rescuer ;
kthread_bind_mask ( rescuer - > task , cpu_possible_mask ) ;
wake_up_process ( rescuer - > task ) ;
return 0 ;
}
2019-03-12 21:21:26 +01:00
__printf ( 1 , 4 )
2019-02-14 15:00:54 -08:00
struct workqueue_struct * alloc_workqueue ( const char * fmt ,
unsigned int flags ,
int max_active , . . . )
2005-04-16 15:20:36 -07:00
{
2013-04-01 11:23:35 -07:00
size_t tbl_size = 0 ;
2013-04-01 11:23:34 -07:00
va_list args ;
2005-04-16 15:20:36 -07:00
struct workqueue_struct * wq ;
2013-03-12 11:29:58 -07:00
struct pool_workqueue * pwq ;
2012-01-10 15:11:35 -08:00
2017-07-18 18:41:52 -04:00
/*
* Unbound & & max_active = = 1 used to imply ordered , which is no
* longer the case on NUMA machines due to per - node pools . While
* alloc_ordered_workqueue ( ) is the right way to create an ordered
* workqueue , keep the previous behavior to avoid subtle breakages
* on NUMA .
*/
if ( ( flags & WQ_UNBOUND ) & & max_active = = 1 )
flags | = __WQ_ORDERED ;
2013-04-08 16:45:40 +05:30
/* see the comment above the definition of WQ_POWER_EFFICIENT */
if ( ( flags & WQ_POWER_EFFICIENT ) & & wq_power_efficient )
flags | = WQ_UNBOUND ;
2013-04-01 11:23:34 -07:00
/* allocate wq and format name */
2013-04-01 11:23:35 -07:00
if ( flags & WQ_UNBOUND )
2014-07-22 13:05:40 +08:00
tbl_size = nr_node_ids * sizeof ( wq - > numa_pwq_tbl [ 0 ] ) ;
2013-04-01 11:23:35 -07:00
wq = kzalloc ( sizeof ( * wq ) + tbl_size , GFP_KERNEL ) ;
2012-01-10 15:11:35 -08:00
if ( ! wq )
2013-03-12 11:30:04 -07:00
return NULL ;
2012-01-10 15:11:35 -08:00
2013-04-01 11:23:34 -07:00
if ( flags & WQ_UNBOUND ) {
2019-06-26 16:52:38 +02:00
wq - > unbound_attrs = alloc_workqueue_attrs ( ) ;
2013-04-01 11:23:34 -07:00
if ( ! wq - > unbound_attrs )
goto err_free_wq ;
}
2019-02-14 15:00:54 -08:00
va_start ( args , max_active ) ;
2013-04-01 11:23:34 -07:00
vsnprintf ( wq - > name , sizeof ( wq - > name ) , fmt , args ) ;
2012-01-10 15:11:35 -08:00
va_end ( args ) ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:14 +02:00
max_active = max_active ? : WQ_DFL_ACTIVE ;
2012-01-10 15:11:35 -08:00
max_active = wq_clamp_max_active ( max_active , flags , wq - > name ) ;
2007-05-09 02:34:09 -07:00
2012-01-10 15:11:35 -08:00
/* init wq */
2010-06-29 10:07:10 +02:00
wq - > flags = flags ;
2010-06-29 10:07:12 +02:00
wq - > saved_max_active = max_active ;
2013-03-25 16:57:17 -07:00
mutex_init ( & wq - > mutex ) ;
2013-02-13 19:29:12 -08:00
atomic_set ( & wq - > nr_pwqs_to_flush , 0 ) ;
2013-03-12 11:29:57 -07:00
INIT_LIST_HEAD ( & wq - > pwqs ) ;
2010-06-29 10:07:11 +02:00
INIT_LIST_HEAD ( & wq - > flusher_queue ) ;
INIT_LIST_HEAD ( & wq - > flusher_overflow ) ;
2013-03-12 11:29:59 -07:00
INIT_LIST_HEAD ( & wq - > maydays ) ;
2010-06-29 10:07:13 +02:00
2019-02-14 15:00:54 -08:00
wq_init_lockdep ( wq ) ;
2007-05-09 02:34:13 -07:00
INIT_LIST_HEAD ( & wq - > list ) ;
2007-05-09 02:34:09 -07:00
2013-03-12 11:29:57 -07:00
if ( alloc_and_link_pwqs ( wq ) < 0 )
2019-03-11 16:02:55 -07:00
goto err_unreg_lockdep ;
2010-06-29 10:07:11 +02:00
2018-01-08 05:38:37 -08:00
if ( wq_online & & init_rescuer ( wq ) < 0 )
2018-01-08 05:38:32 -08:00
goto err_destroy ;
2007-05-09 02:34:09 -07:00
2013-03-12 11:30:05 -07:00
if ( ( wq - > flags & WQ_SYSFS ) & & workqueue_sysfs_register ( wq ) )
goto err_destroy ;
2010-06-29 10:07:12 +02:00
/*
2013-03-25 16:57:17 -07:00
* wq_pool_mutex protects global freeze state and workqueues list .
* Grab it , adjust max_active and add the new @ wq to workqueues
* list .
2010-06-29 10:07:12 +02:00
*/
2013-03-25 16:57:17 -07:00
mutex_lock ( & wq_pool_mutex ) ;
2010-06-29 10:07:12 +02:00
2013-03-25 16:57:19 -07:00
mutex_lock ( & wq - > mutex ) ;
2013-03-13 16:51:35 -07:00
for_each_pwq ( pwq , wq )
pwq_adjust_max_active ( pwq ) ;
2013-03-25 16:57:19 -07:00
mutex_unlock ( & wq - > mutex ) ;
2010-06-29 10:07:12 +02:00
2015-03-09 09:22:28 -04:00
list_add_tail_rcu ( & wq - > list , & workqueues ) ;
2010-06-29 10:07:12 +02:00
2013-03-25 16:57:17 -07:00
mutex_unlock ( & wq_pool_mutex ) ;
2010-06-29 10:07:11 +02:00
2007-05-09 02:34:09 -07:00
return wq ;
2013-03-12 11:30:04 -07:00
2019-03-11 16:02:55 -07:00
err_unreg_lockdep :
2019-03-03 14:00:46 -08:00
wq_unregister_lockdep ( wq ) ;
wq_free_lockdep ( wq ) ;
2019-03-11 16:02:55 -07:00
err_free_wq :
2013-04-01 11:23:34 -07:00
free_workqueue_attrs ( wq - > unbound_attrs ) ;
2013-03-12 11:30:04 -07:00
kfree ( wq ) ;
return NULL ;
err_destroy :
destroy_workqueue ( wq ) ;
2010-06-29 10:07:10 +02:00
return NULL ;
2007-05-09 02:34:09 -07:00
}
2019-02-14 15:00:54 -08:00
EXPORT_SYMBOL_GPL ( alloc_workqueue ) ;
2005-04-16 15:20:36 -07:00
2019-09-23 11:08:58 -07:00
static bool pwq_busy ( struct pool_workqueue * pwq )
{
int i ;
for ( i = 0 ; i < WORK_NR_COLORS ; i + + )
if ( pwq - > nr_in_flight [ i ] )
return true ;
if ( ( pwq ! = pwq - > wq - > dfl_pwq ) & & ( pwq - > refcnt > 1 ) )
return true ;
2021-08-17 09:32:34 +08:00
if ( pwq - > nr_active | | ! list_empty ( & pwq - > inactive_works ) )
2019-09-23 11:08:58 -07:00
return true ;
return false ;
}
2007-05-09 02:34:09 -07:00
/**
* destroy_workqueue - safely terminate a workqueue
* @ wq : target workqueue
*
* Safely destroy a workqueue . All work currently pending will be done first .
*/
void destroy_workqueue ( struct workqueue_struct * wq )
{
2013-03-12 11:29:58 -07:00
struct pool_workqueue * pwq ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
int node ;
2007-05-09 02:34:09 -07:00
2019-09-18 18:43:40 -07:00
/*
* Remove it from sysfs first so that sanity check failure doesn ' t
* lead to sysfs name conflicts .
*/
workqueue_sysfs_unregister ( wq ) ;
2011-04-05 18:01:44 +02:00
/* drain it before proceeding with destruction */
drain_workqueue ( wq ) ;
2010-12-20 19:32:04 +01:00
2019-09-18 18:43:40 -07:00
/* kill rescuer, if sanity checks fail, leave it w/o rescuer */
if ( wq - > rescuer ) {
struct worker * rescuer = wq - > rescuer ;
/* this prevents new queueing */
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & wq_mayday_lock ) ;
2019-09-18 18:43:40 -07:00
wq - > rescuer = NULL ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & wq_mayday_lock ) ;
2019-09-18 18:43:40 -07:00
/* rescuer will empty maydays list before exiting */
kthread_stop ( rescuer - > task ) ;
2019-09-20 13:39:57 -07:00
kfree ( rescuer ) ;
2019-09-18 18:43:40 -07:00
}
2019-09-23 11:08:58 -07:00
/*
* Sanity checks - grab all the locks so that we wait for all
* in - flight operations which may do put_pwq ( ) .
*/
mutex_lock ( & wq_pool_mutex ) ;
2013-03-25 16:57:18 -07:00
mutex_lock ( & wq - > mutex ) ;
2013-03-12 11:29:58 -07:00
for_each_pwq ( pwq , wq ) {
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pwq - > pool - > lock ) ;
2019-09-23 11:08:58 -07:00
if ( WARN_ON ( pwq_busy ( pwq ) ) ) {
2019-11-28 08:47:49 +08:00
pr_warn ( " %s: %s has the following busy pwq \n " ,
__func__ , wq - > name ) ;
2019-09-23 11:08:58 -07:00
show_pwq ( pwq ) ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2013-03-25 16:57:18 -07:00
mutex_unlock ( & wq - > mutex ) ;
2019-09-23 11:08:58 -07:00
mutex_unlock ( & wq_pool_mutex ) ;
2021-10-20 14:09:00 +11:00
show_one_workqueue ( wq ) ;
2013-03-12 11:29:57 -07:00
return ;
2013-03-12 11:30:00 -07:00
}
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pwq - > pool - > lock ) ;
2013-03-12 11:29:57 -07:00
}
2013-03-25 16:57:18 -07:00
mutex_unlock ( & wq - > mutex ) ;
2013-03-12 11:29:57 -07:00
2010-06-29 10:07:12 +02:00
/*
* wq list is used to freeze wq , remove from list after
* flushing is complete in case freeze races us .
*/
2015-03-09 09:22:28 -04:00
list_del_rcu ( & wq - > list ) ;
2013-03-25 16:57:17 -07:00
mutex_unlock ( & wq_pool_mutex ) ;
2007-05-09 02:34:09 -07:00
2013-03-12 11:30:04 -07:00
if ( ! ( wq - > flags & WQ_UNBOUND ) ) {
2019-02-14 15:00:54 -08:00
wq_unregister_lockdep ( wq ) ;
2013-03-12 11:30:04 -07:00
/*
* The base ref is never dropped on per - cpu pwqs . Directly
2015-03-09 09:22:28 -04:00
* schedule RCU free .
2013-03-12 11:30:04 -07:00
*/
2018-11-06 19:18:45 -08:00
call_rcu ( & wq - > rcu , rcu_free_wq ) ;
2013-03-12 11:30:04 -07:00
} else {
/*
* We ' re the sole accessor of @ wq at this point . Directly
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
* access numa_pwq_tbl [ ] and dfl_pwq to put the base refs .
* @ wq will be freed when the last pwq is released .
2013-03-12 11:30:04 -07:00
*/
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
for_each_node ( node ) {
pwq = rcu_access_pointer ( wq - > numa_pwq_tbl [ node ] ) ;
RCU_INIT_POINTER ( wq - > numa_pwq_tbl [ node ] , NULL ) ;
put_pwq_unlocked ( pwq ) ;
}
/*
* Put dfl_pwq . @ wq may be freed any time after dfl_pwq is
* put . Don ' t access it afterwards .
*/
pwq = wq - > dfl_pwq ;
wq - > dfl_pwq = NULL ;
2013-04-01 11:23:35 -07:00
put_pwq_unlocked ( pwq ) ;
2013-03-12 11:30:03 -07:00
}
2007-05-09 02:34:09 -07:00
}
EXPORT_SYMBOL_GPL ( destroy_workqueue ) ;
2010-06-29 10:07:14 +02:00
/**
* workqueue_set_max_active - adjust max_active of a workqueue
* @ wq : target workqueue
* @ max_active : new max_active value .
*
* Set max_active of @ wq to @ max_active .
*
* CONTEXT :
* Don ' t call from IRQ context .
*/
void workqueue_set_max_active ( struct workqueue_struct * wq , int max_active )
{
2013-03-12 11:29:58 -07:00
struct pool_workqueue * pwq ;
2010-06-29 10:07:14 +02:00
2013-03-12 11:30:04 -07:00
/* disallow meddling with max_active for ordered workqueues */
2017-07-23 08:36:15 -04:00
if ( WARN_ON ( wq - > flags & __WQ_ORDERED_EXPLICIT ) )
2013-03-12 11:30:04 -07:00
return ;
2010-07-02 10:03:51 +02:00
max_active = wq_clamp_max_active ( max_active , wq - > flags , wq - > name ) ;
2010-06-29 10:07:14 +02:00
2013-03-25 16:57:19 -07:00
mutex_lock ( & wq - > mutex ) ;
2010-06-29 10:07:14 +02:00
2017-07-23 08:36:15 -04:00
wq - > flags & = ~ __WQ_ORDERED ;
2010-06-29 10:07:14 +02:00
wq - > saved_max_active = max_active ;
2013-03-13 16:51:35 -07:00
for_each_pwq ( pwq , wq )
pwq_adjust_max_active ( pwq ) ;
2009-11-17 14:06:20 -08:00
2013-03-25 16:57:19 -07:00
mutex_unlock ( & wq - > mutex ) ;
2006-01-08 01:00:43 -08:00
}
2010-06-29 10:07:14 +02:00
EXPORT_SYMBOL_GPL ( workqueue_set_max_active ) ;
2006-01-08 01:00:43 -08:00
2018-02-11 10:38:28 +01:00
/**
* current_work - retrieve % current task ' s work struct
*
* Determine if % current task is a workqueue worker and what it ' s working on .
* Useful to find out the context that the % current task is running in .
*
* Return : work struct if % current task is a workqueue worker , % NULL otherwise .
*/
struct work_struct * current_work ( void )
{
struct worker * worker = current_wq_worker ( ) ;
return worker ? worker - > current_work : NULL ;
}
EXPORT_SYMBOL ( current_work ) ;
2013-03-12 17:41:37 -07:00
/**
* current_is_workqueue_rescuer - is % current workqueue rescuer ?
*
* Determine whether % current is a workqueue rescuer . Can be used from
* work functions to determine whether it ' s being run off the rescuer task .
2013-07-31 14:59:24 -07:00
*
* Return : % true if % current is a workqueue rescuer . % false otherwise .
2013-03-12 17:41:37 -07:00
*/
bool current_is_workqueue_rescuer ( void )
{
struct worker * worker = current_wq_worker ( ) ;
2013-03-20 03:28:03 +08:00
return worker & & worker - > rescue_wq ;
2013-03-12 17:41:37 -07:00
}
2010-02-12 17:39:21 +09:00
/**
2010-06-29 10:07:14 +02:00
* workqueue_congested - test whether a workqueue is congested
* @ cpu : CPU in question
* @ wq : target workqueue
2010-02-12 17:39:21 +09:00
*
2010-06-29 10:07:14 +02:00
* Test whether @ wq ' s cpu workqueue for @ cpu is congested . There is
* no synchronization around this function and the test result is
* unreliable and only useful as advisory hints or for debugging .
2010-02-12 17:39:21 +09:00
*
2013-05-10 11:10:17 -07:00
* If @ cpu is WORK_CPU_UNBOUND , the test is performed on the local CPU .
* Note that both per - cpu and unbound workqueues may be associated with
* multiple pool_workqueues which have separate congested states . A
* workqueue being congested on one CPU doesn ' t mean the workqueue is also
* contested on other CPUs / NUMA nodes .
*
2013-07-31 14:59:24 -07:00
* Return :
2010-06-29 10:07:14 +02:00
* % true if congested , % false otherwise .
2010-02-12 17:39:21 +09:00
*/
2013-03-12 11:29:59 -07:00
bool workqueue_congested ( int cpu , struct workqueue_struct * wq )
2005-04-16 15:20:36 -07:00
{
2013-03-12 11:30:00 -07:00
struct pool_workqueue * pwq ;
2013-03-12 11:30:00 -07:00
bool ret ;
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
preempt_disable ( ) ;
2013-03-12 11:30:00 -07:00
2013-05-10 11:10:17 -07:00
if ( cpu = = WORK_CPU_UNBOUND )
cpu = smp_processor_id ( ) ;
2013-03-12 11:30:00 -07:00
if ( ! ( wq - > flags & WQ_UNBOUND ) )
pwq = per_cpu_ptr ( wq - > cpu_pwqs , cpu ) ;
else
2013-04-01 11:23:35 -07:00
pwq = unbound_pwq_by_node ( wq , cpu_to_node ( cpu ) ) ;
2010-06-29 10:07:14 +02:00
2021-08-17 09:32:34 +08:00
ret = ! list_empty ( & pwq - > inactive_works ) ;
2019-03-13 17:55:47 +01:00
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
2013-03-12 11:30:00 -07:00
return ret ;
2005-04-16 15:20:36 -07:00
}
2010-06-29 10:07:14 +02:00
EXPORT_SYMBOL_GPL ( workqueue_congested ) ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:14 +02:00
/**
* work_busy - test whether a work is currently pending or running
* @ work : the work to be tested
*
* Test whether @ work is currently pending or running . There is no
* synchronization around this function and the test result is
* unreliable and only useful as advisory hints or for debugging .
*
2013-07-31 14:59:24 -07:00
* Return :
2010-06-29 10:07:14 +02:00
* OR ' d bitmask of WORK_BUSY_ * bits .
*/
unsigned int work_busy ( struct work_struct * work )
2005-04-16 15:20:36 -07:00
{
2013-03-12 11:30:00 -07:00
struct worker_pool * pool ;
2010-06-29 10:07:14 +02:00
unsigned long flags ;
unsigned int ret = 0 ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:14 +02:00
if ( work_pending ( work ) )
ret | = WORK_BUSY_PENDING ;
2005-04-16 15:20:36 -07:00
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
2013-03-12 11:30:00 -07:00
pool = get_work_pool ( work ) ;
2013-02-06 18:04:53 -08:00
if ( pool ) {
2020-05-27 21:46:33 +02:00
raw_spin_lock_irqsave ( & pool - > lock , flags ) ;
2013-02-06 18:04:53 -08:00
if ( find_worker_executing_work ( pool , work ) )
ret | = WORK_BUSY_RUNNING ;
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irqrestore ( & pool - > lock , flags ) ;
2013-02-06 18:04:53 -08:00
}
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2005-04-16 15:20:36 -07:00
2010-06-29 10:07:14 +02:00
return ret ;
2005-04-16 15:20:36 -07:00
}
2010-06-29 10:07:14 +02:00
EXPORT_SYMBOL_GPL ( work_busy ) ;
2005-04-16 15:20:36 -07:00
2013-04-30 15:27:22 -07:00
/**
* set_worker_desc - set description for the current work item
* @ fmt : printf - style format string
* @ . . . : arguments for the format string
*
* This function can be called by a running work function to describe what
* the work item is about . If the worker task gets dumped , this
* information will be printed out together to help debugging . The
* description can be at most WORKER_DESC_LEN including the trailing ' \0 ' .
*/
void set_worker_desc ( const char * fmt , . . . )
{
struct worker * worker = current_wq_worker ( ) ;
va_list args ;
if ( worker ) {
va_start ( args , fmt ) ;
vsnprintf ( worker - > desc , sizeof ( worker - > desc ) , fmt , args ) ;
va_end ( args ) ;
}
}
2018-05-17 19:14:57 +02:00
EXPORT_SYMBOL_GPL ( set_worker_desc ) ;
2013-04-30 15:27:22 -07:00
/**
* print_worker_info - print out worker information and description
* @ log_lvl : the log level to use when printing
* @ task : target task
*
* If @ task is a worker and currently executing a work item , print out the
* name of the workqueue being serviced and worker description set with
* set_worker_desc ( ) by the currently executing work item .
*
* This function can be safely called on any task as long as the
* task_struct itself is accessible . While safe , this function isn ' t
* synchronized and may print out mixups or garbages of limited length .
*/
void print_worker_info ( const char * log_lvl , struct task_struct * task )
{
work_func_t * fn = NULL ;
char name [ WQ_NAME_LEN ] = { } ;
char desc [ WORKER_DESC_LEN ] = { } ;
struct pool_workqueue * pwq = NULL ;
struct workqueue_struct * wq = NULL ;
struct worker * worker ;
if ( ! ( task - > flags & PF_WQ_WORKER ) )
return ;
/*
* This function is called without any synchronization and @ task
* could be in any state . Be careful with dereferences .
*/
2016-10-11 13:55:17 -07:00
worker = kthread_probe_data ( task ) ;
2013-04-30 15:27:22 -07:00
/*
2018-05-18 08:47:13 -07:00
* Carefully copy the associated workqueue ' s workfn , name and desc .
* Keep the original last ' \0 ' in case the original is garbage .
2013-04-30 15:27:22 -07:00
*/
2020-06-17 09:37:53 +02:00
copy_from_kernel_nofault ( & fn , & worker - > current_func , sizeof ( fn ) ) ;
copy_from_kernel_nofault ( & pwq , & worker - > current_pwq , sizeof ( pwq ) ) ;
copy_from_kernel_nofault ( & wq , & pwq - > wq , sizeof ( wq ) ) ;
copy_from_kernel_nofault ( name , wq - > name , sizeof ( name ) - 1 ) ;
copy_from_kernel_nofault ( desc , worker - > desc , sizeof ( desc ) - 1 ) ;
2013-04-30 15:27:22 -07:00
if ( fn | | name [ 0 ] | | desc [ 0 ] ) {
2019-03-25 21:32:28 +02:00
printk ( " %sWorkqueue: %s %ps " , log_lvl , name , fn ) ;
2018-05-18 08:47:13 -07:00
if ( strcmp ( name , desc ) )
2013-04-30 15:27:22 -07:00
pr_cont ( " (%s) " , desc ) ;
pr_cont ( " \n " ) ;
}
}
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
static void pr_cont_pool_info ( struct worker_pool * pool )
{
pr_cont ( " cpus=%*pbl " , nr_cpumask_bits , pool - > attrs - > cpumask ) ;
if ( pool - > node ! = NUMA_NO_NODE )
pr_cont ( " node=%d " , pool - > node ) ;
pr_cont ( " flags=0x%x nice=%d " , pool - > flags , pool - > attrs - > nice ) ;
}
static void pr_cont_work ( bool comma , struct work_struct * work )
{
if ( work - > func = = wq_barrier_func ) {
struct wq_barrier * barr ;
barr = container_of ( work , struct wq_barrier , work ) ;
pr_cont ( " %s BAR(%d) " , comma ? " , " : " " ,
task_pid_nr ( barr - > task ) ) ;
} else {
2019-03-25 21:32:28 +02:00
pr_cont ( " %s %ps " , comma ? " , " : " " , work - > func ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
}
}
static void show_pwq ( struct pool_workqueue * pwq )
{
struct worker_pool * pool = pwq - > pool ;
struct work_struct * work ;
struct worker * worker ;
bool has_in_flight = false , has_pending = false ;
int bkt ;
pr_info ( " pwq %d: " , pool - > id ) ;
pr_cont_pool_info ( pool ) ;
2019-09-25 06:59:15 -07:00
pr_cont ( " active=%d/%d refcnt=%d%s \n " ,
pwq - > nr_active , pwq - > max_active , pwq - > refcnt ,
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
! list_empty ( & pwq - > mayday_node ) ? " MAYDAY " : " " ) ;
hash_for_each ( pool - > busy_hash , bkt , worker , hentry ) {
if ( worker - > current_pwq = = pwq ) {
has_in_flight = true ;
break ;
}
}
if ( has_in_flight ) {
bool comma = false ;
pr_info ( " in-flight: " ) ;
hash_for_each ( pool - > busy_hash , bkt , worker , hentry ) {
if ( worker - > current_pwq ! = pwq )
continue ;
2019-03-25 21:32:28 +02:00
pr_cont ( " %s %d%s:%ps " , comma ? " , " : " " ,
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
task_pid_nr ( worker - > task ) ,
2019-09-20 14:09:14 -07:00
worker - > rescue_wq ? " (RESCUER) " : " " ,
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
worker - > current_func ) ;
list_for_each_entry ( work , & worker - > scheduled , entry )
pr_cont_work ( false , work ) ;
comma = true ;
}
pr_cont ( " \n " ) ;
}
list_for_each_entry ( work , & pool - > worklist , entry ) {
if ( get_work_pwq ( work ) = = pwq ) {
has_pending = true ;
break ;
}
}
if ( has_pending ) {
bool comma = false ;
pr_info ( " pending: " ) ;
list_for_each_entry ( work , & pool - > worklist , entry ) {
if ( get_work_pwq ( work ) ! = pwq )
continue ;
pr_cont_work ( comma , work ) ;
comma = ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) ;
}
pr_cont ( " \n " ) ;
}
2021-08-17 09:32:34 +08:00
if ( ! list_empty ( & pwq - > inactive_works ) ) {
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
bool comma = false ;
2021-08-17 09:32:34 +08:00
pr_info ( " inactive: " ) ;
list_for_each_entry ( work , & pwq - > inactive_works , entry ) {
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
pr_cont_work ( comma , work ) ;
comma = ! ( * work_data_bits ( work ) & WORK_STRUCT_LINKED ) ;
}
pr_cont ( " \n " ) ;
}
}
/**
2021-10-20 14:09:00 +11:00
* show_one_workqueue - dump state of specified workqueue
* @ wq : workqueue whose state will be printed
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
*/
2021-10-20 14:09:00 +11:00
void show_one_workqueue ( struct workqueue_struct * wq )
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
{
2021-10-20 14:09:00 +11:00
struct pool_workqueue * pwq ;
bool idle = true ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
unsigned long flags ;
2021-10-20 14:09:00 +11:00
for_each_pwq ( pwq , wq ) {
if ( pwq - > nr_active | | ! list_empty ( & pwq - > inactive_works ) ) {
idle = false ;
break ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
}
2021-10-20 14:09:00 +11:00
}
if ( idle ) /* Nothing to print for idle workqueue */
return ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
2021-10-20 14:09:00 +11:00
pr_info ( " workqueue %s: flags=0x%x \n " , wq - > name , wq - > flags ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
2021-10-20 14:09:00 +11:00
for_each_pwq ( pwq , wq ) {
raw_spin_lock_irqsave ( & pwq - > pool - > lock , flags ) ;
if ( pwq - > nr_active | | ! list_empty ( & pwq - > inactive_works ) ) {
2018-01-11 09:53:35 +09:00
/*
2021-10-20 14:09:00 +11:00
* Defer printing to avoid deadlocks in console
* drivers that queue work while holding locks
* also taken in their write paths .
2018-01-11 09:53:35 +09:00
*/
2021-10-20 14:09:00 +11:00
printk_deferred_enter ( ) ;
show_pwq ( pwq ) ;
printk_deferred_exit ( ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
}
2021-10-20 14:09:00 +11:00
raw_spin_unlock_irqrestore ( & pwq - > pool - > lock , flags ) ;
2018-01-11 09:53:35 +09:00
/*
* We could be printing a lot from atomic context , e . g .
2021-10-20 14:09:00 +11:00
* sysrq - t - > show_all_workqueues ( ) . Avoid triggering
2018-01-11 09:53:35 +09:00
* hard lockup .
*/
touch_nmi_watchdog ( ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
}
2021-10-20 14:09:00 +11:00
}
/**
* show_one_worker_pool - dump state of specified worker pool
* @ pool : worker pool whose state will be printed
*/
static void show_one_worker_pool ( struct worker_pool * pool )
{
struct worker * worker ;
bool first = true ;
unsigned long flags ;
raw_spin_lock_irqsave ( & pool - > lock , flags ) ;
if ( pool - > nr_workers = = pool - > nr_idle )
goto next_pool ;
/*
* Defer printing to avoid deadlocks in console drivers that
* queue work while holding locks also taken in their write
* paths .
*/
printk_deferred_enter ( ) ;
pr_info ( " pool %d: " , pool - > id ) ;
pr_cont_pool_info ( pool ) ;
pr_cont ( " hung=%us workers=%d " ,
jiffies_to_msecs ( jiffies - pool - > watchdog_ts ) / 1000 ,
pool - > nr_workers ) ;
if ( pool - > manager )
pr_cont ( " manager: %d " ,
task_pid_nr ( pool - > manager - > task ) ) ;
list_for_each_entry ( worker , & pool - > idle_list , entry ) {
pr_cont ( " %s%d " , first ? " idle: " : " " ,
task_pid_nr ( worker - > task ) ) ;
first = false ;
}
pr_cont ( " \n " ) ;
printk_deferred_exit ( ) ;
next_pool :
raw_spin_unlock_irqrestore ( & pool - > lock , flags ) ;
/*
* We could be printing a lot from atomic context , e . g .
* sysrq - t - > show_all_workqueues ( ) . Avoid triggering
* hard lockup .
*/
touch_nmi_watchdog ( ) ;
}
/**
* show_all_workqueues - dump workqueue state
*
* Called from a sysrq handler or try_to_freeze_tasks ( ) and prints out
* all busy workqueues and pools .
*/
void show_all_workqueues ( void )
{
struct workqueue_struct * wq ;
struct worker_pool * pool ;
int pi ;
rcu_read_lock ( ) ;
pr_info ( " Showing busy workqueues and worker pools: \n " ) ;
list_for_each_entry_rcu ( wq , & workqueues , list )
show_one_workqueue ( wq ) ;
for_each_pool ( pool , pi )
show_one_worker_pool ( pool ) ;
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
workqueue: dump workqueues on sysrq-t
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
2015-03-09 09:22:28 -04:00
}
2018-05-18 08:47:13 -07:00
/* used to show worker information through /proc/PID/{comm,stat,status} */
void wq_worker_comm ( char * buf , size_t size , struct task_struct * task )
{
int off ;
/* always show the actual comm */
off = strscpy ( buf , task - > comm , size ) ;
if ( off < 0 )
return ;
2018-05-21 08:04:35 -07:00
/* stabilize PF_WQ_WORKER and worker pool association */
2018-05-18 08:47:13 -07:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2018-05-21 08:04:35 -07:00
if ( task - > flags & PF_WQ_WORKER ) {
struct worker * worker = kthread_data ( task ) ;
struct worker_pool * pool = worker - > pool ;
2018-05-18 08:47:13 -07:00
2018-05-21 08:04:35 -07:00
if ( pool ) {
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2018-05-21 08:04:35 -07:00
/*
* - > desc tracks information ( wq name or
* set_worker_desc ( ) ) for the latest execution . If
* current , prepend ' + ' , otherwise ' - ' .
*/
if ( worker - > desc [ 0 ] ! = ' \0 ' ) {
if ( worker - > current_work )
scnprintf ( buf + off , size - off , " +%s " ,
worker - > desc ) ;
else
scnprintf ( buf + off , size - off , " -%s " ,
worker - > desc ) ;
}
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2018-05-18 08:47:13 -07:00
}
}
mutex_unlock ( & wq_pool_attach_mutex ) ;
}
2018-05-22 21:47:32 +02:00
# ifdef CONFIG_SMP
2010-06-29 10:07:12 +02:00
/*
* CPU hotplug .
*
2010-06-29 10:07:14 +02:00
* There are two challenges in supporting CPU hotplug . Firstly , there
2013-02-13 19:29:12 -08:00
* are a lot of assumptions on strong associations among work , pwq and
2013-01-24 11:01:34 -08:00
* pool which make migrating pending and scheduled works very
2010-06-29 10:07:14 +02:00
* difficult to implement without impacting hot paths . Secondly ,
2013-01-24 11:01:33 -08:00
* worker pools serve mix of short , long and very long running works making
2010-06-29 10:07:14 +02:00
* blocked draining impractical .
*
2013-01-24 11:01:33 -08:00
* This is solved by allowing the pools to be disassociated from the CPU
2012-07-17 12:39:27 -07:00
* running as an unbound one and allowing it to be reattached later if the
* cpu comes back online .
2010-06-29 10:07:12 +02:00
*/
2005-04-16 15:20:36 -07:00
2017-12-01 22:20:36 +08:00
static void unbind_workers ( int cpu )
2007-05-09 02:34:09 -07:00
{
2012-07-13 22:16:44 -07:00
struct worker_pool * pool ;
2010-06-29 10:07:12 +02:00
struct worker * worker ;
2007-05-09 02:34:09 -07:00
2013-03-12 11:30:03 -07:00
for_each_cpu_worker_pool ( pool , cpu ) {
2018-05-18 08:47:13 -07:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
2007-05-09 02:34:09 -07:00
2013-01-24 11:01:33 -08:00
/*
2014-05-20 17:46:34 +08:00
* We ' ve blocked all attach / detach operations . Make all workers
2013-01-24 11:01:33 -08:00
* unbound and set DISASSOCIATED . Before this , all workers
2021-12-07 15:35:39 +08:00
* must be on the cpu . After this , they may become diasporas .
2021-12-07 15:35:40 +08:00
* And the preemption disabled section in their sched callbacks
* are guaranteed to see WORKER_UNBOUND since the code here
* is on the same cpu .
2013-01-24 11:01:33 -08:00
*/
2014-05-20 17:46:31 +08:00
for_each_pool_worker ( worker , pool )
2013-01-24 11:01:33 -08:00
worker - > flags | = WORKER_UNBOUND ;
2007-05-09 02:34:15 -07:00
2013-01-24 11:01:33 -08:00
pool - > flags | = POOL_DISASSOCIATED ;
2012-07-17 12:39:26 -07:00
2013-03-08 15:18:28 -08:00
/*
2021-12-07 15:35:41 +08:00
* The handling of nr_running in sched callbacks are disabled
* now . Zap nr_running . After this , nr_running stays zero and
* need_more_worker ( ) and keep_working ( ) are always true as
* long as the worklist is not empty . This pool now behaves as
* an unbound ( in terms of concurrency management ) pool which
2013-03-08 15:18:28 -08:00
* are served by workers tied to the pool .
*/
2021-12-23 20:31:40 +08:00
pool - > nr_running = 0 ;
2013-03-08 15:18:28 -08:00
/*
* With concurrency management just turned off , a busy
* worker blocking could lead to lengthy stalls . Kick off
* unbound chain execution of currently pending work items .
*/
wake_up_worker ( pool ) ;
2021-12-07 15:35:41 +08:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2021-12-07 15:35:41 +08:00
for_each_pool_worker ( worker , pool ) {
kthread_set_per_cpu ( worker - > task , - 1 ) ;
2022-07-29 17:44:38 +08:00
if ( cpumask_intersects ( wq_unbound_cpumask , cpu_active_mask ) )
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task , wq_unbound_cpumask ) < 0 ) ;
else
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task , cpu_possible_mask ) < 0 ) ;
2021-12-07 15:35:41 +08:00
}
mutex_unlock ( & wq_pool_attach_mutex ) ;
2013-03-08 15:18:28 -08:00
}
2007-05-09 02:34:09 -07:00
}
2013-03-19 13:45:21 -07:00
/**
* rebind_workers - rebind all workers of a pool to the associated CPU
* @ pool : pool of interest
*
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
* @ pool - > cpu is coming online . Rebind all workers to the CPU .
2013-03-19 13:45:21 -07:00
*/
static void rebind_workers ( struct worker_pool * pool )
{
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
struct worker * worker ;
2013-03-19 13:45:21 -07:00
2018-05-18 08:47:13 -07:00
lockdep_assert_held ( & wq_pool_attach_mutex ) ;
2013-03-19 13:45:21 -07:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
/*
* Restore CPU affinity of all workers . As all idle workers should
* be on the run - queue of the associated CPU before any local
2015-05-23 10:38:14 +05:30
* wake - ups for concurrency management happen , restore CPU affinity
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
* of all workers first and then clear UNBOUND . As we ' re called
* from CPU_ONLINE , the following shouldn ' t fail .
*/
2021-01-12 11:26:49 +01:00
for_each_pool_worker ( worker , pool ) {
kthread_set_per_cpu ( worker - > task , pool - > cpu ) ;
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task ,
pool - > attrs - > cpumask ) < 0 ) ;
2021-01-12 11:26:49 +01:00
}
2013-03-19 13:45:21 -07:00
2020-05-27 21:46:33 +02:00
raw_spin_lock_irq ( & pool - > lock ) ;
workqueue: fix rebind bound workers warning
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
dump_stack+0x89/0xd4
__warn+0xfd/0x120
warn_slowpath_null+0x1d/0x20
rebind_workers+0x1c0/0x1d0
workqueue_cpu_up_callback+0xf5/0x1d0
notifier_call_chain+0x64/0x90
? trace_hardirqs_on_caller+0xf2/0x220
? notify_prepare+0x80/0x80
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x35/0x50
notify_down_prepare+0x5e/0x80
? notify_prepare+0x80/0x80
cpuhp_invoke_callback+0x73/0x330
? __schedule+0x33e/0x8a0
cpuhp_down_callbacks+0x51/0xc0
cpuhp_thread_fun+0xc1/0xf0
smpboot_thread_fn+0x159/0x2a0
? smpboot_create_threads+0x80/0x80
kthread+0xef/0x110
? wait_for_completion+0xf0/0x120
? schedule_tail+0x35/0xf0
ret_from_fork+0x22/0x50
? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed
This bug can be reproduced by below config w/ nohz_full= all cpus:
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y
As Thomas pointed out:
| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP = 5
| CPU_PRI_WORKQUEUE_DOWN = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.
The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:
workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.
This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
2016-05-11 17:55:18 +08:00
2014-06-03 15:33:27 +08:00
pool - > flags & = ~ POOL_DISASSOCIATED ;
2013-03-19 13:45:21 -07:00
2014-05-20 17:46:31 +08:00
for_each_pool_worker ( worker , pool ) {
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
unsigned int worker_flags = worker - > flags ;
2013-03-19 13:45:21 -07:00
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
/*
* We want to clear UNBOUND but can ' t directly call
* worker_clr_flags ( ) or adjust nr_running . Atomically
* replace UNBOUND with another NOT_RUNNING flag REBOUND .
* @ worker will clear REBOUND using worker_clr_flags ( ) when
* it initiates the next execution cycle thus restoring
* concurrency management . Note that when or whether
* @ worker clears REBOUND doesn ' t affect correctness .
*
locking/atomics, workqueue: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the workqueue code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-12-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23 14:07:22 -07:00
* WRITE_ONCE ( ) is necessary because @ worker - > flags may be
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
* tested without holding any lock in
2019-03-13 17:55:48 +01:00
* wq_worker_running ( ) . Without it , NOT_RUNNING test may
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
* fail incorrectly leading to premature concurrency
* management operations .
*/
WARN_ON_ONCE ( ! ( worker_flags & WORKER_UNBOUND ) ) ;
worker_flags | = WORKER_REBOUND ;
worker_flags & = ~ WORKER_UNBOUND ;
locking/atomics, workqueue: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the workqueue code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-12-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23 14:07:22 -07:00
WRITE_ONCE ( worker - > flags , worker_flags ) ;
2013-03-19 13:45:21 -07:00
}
workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
2020-05-27 21:46:33 +02:00
raw_spin_unlock_irq ( & pool - > lock ) ;
2013-03-19 13:45:21 -07:00
}
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
/**
* restore_unbound_workers_cpumask - restore cpumask of unbound workers
* @ pool : unbound pool of interest
* @ cpu : the CPU which is coming up
*
* An unbound pool may end up with a cpumask which doesn ' t have any online
* CPUs . When a worker of such pool get scheduled , the scheduler resets
* its cpus_allowed . If @ cpu is in @ pool ' s cpumask which didn ' t have any
* online CPU before , cpus_allowed of all its workers should be restored .
*/
static void restore_unbound_workers_cpumask ( struct worker_pool * pool , int cpu )
{
static cpumask_t cpumask ;
struct worker * worker ;
2018-05-18 08:47:13 -07:00
lockdep_assert_held ( & wq_pool_attach_mutex ) ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
/* is @cpu allowed for @pool? */
if ( ! cpumask_test_cpu ( cpu , pool - > attrs - > cpumask ) )
return ;
cpumask_and ( & cpumask , pool - > attrs - > cpumask , cpu_online_mask ) ;
/* as we're called from CPU_ONLINE, the following shouldn't fail */
2014-05-20 17:46:31 +08:00
for_each_pool_worker ( worker , pool )
2016-06-16 14:38:42 +02:00
WARN_ON_ONCE ( set_cpus_allowed_ptr ( worker - > task , & cpumask ) < 0 ) ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
}
2016-07-13 17:16:29 +00:00
int workqueue_prepare_cpu ( unsigned int cpu )
{
struct worker_pool * pool ;
for_each_cpu_worker_pool ( pool , cpu ) {
if ( pool - > nr_workers )
continue ;
if ( ! create_worker ( pool ) )
return - ENOMEM ;
}
return 0 ;
}
int workqueue_online_cpu ( unsigned int cpu )
2007-05-09 02:34:09 -07:00
{
2012-07-13 22:16:44 -07:00
struct worker_pool * pool ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
struct workqueue_struct * wq ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
int pi ;
2012-07-17 12:39:27 -07:00
2016-07-13 17:16:29 +00:00
mutex_lock ( & wq_pool_mutex ) ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
2016-07-13 17:16:29 +00:00
for_each_pool ( pool , pi ) {
2018-05-18 08:47:13 -07:00
mutex_lock ( & wq_pool_attach_mutex ) ;
2013-01-24 11:01:33 -08:00
2016-07-13 17:16:29 +00:00
if ( pool - > cpu = = cpu )
rebind_workers ( pool ) ;
else if ( pool - > cpu < 0 )
restore_unbound_workers_cpumask ( pool , cpu ) ;
2013-01-24 11:01:33 -08:00
2018-05-18 08:47:13 -07:00
mutex_unlock ( & wq_pool_attach_mutex ) ;
2016-07-13 17:16:29 +00:00
}
2015-04-02 19:14:39 +08:00
2016-07-13 17:16:29 +00:00
/* update NUMA affinity of unbound workqueues */
list_for_each_entry ( wq , & workqueues , list )
wq_update_unbound_numa ( wq , cpu , true ) ;
2015-04-02 19:14:39 +08:00
2016-07-13 17:16:29 +00:00
mutex_unlock ( & wq_pool_mutex ) ;
return 0 ;
2015-04-02 19:14:39 +08:00
}
2016-07-13 17:16:29 +00:00
int workqueue_offline_cpu ( unsigned int cpu )
2015-04-02 19:14:39 +08:00
{
struct workqueue_struct * wq ;
2016-07-13 17:16:29 +00:00
/* unbinding per-cpu workers should happen on the local CPU */
2017-12-01 22:20:36 +08:00
if ( WARN_ON ( cpu ! = smp_processor_id ( ) ) )
return - 1 ;
unbind_workers ( cpu ) ;
2016-07-13 17:16:29 +00:00
/* update NUMA affinity of unbound workqueues */
mutex_lock ( & wq_pool_mutex ) ;
list_for_each_entry ( wq , & workqueues , list )
wq_update_unbound_numa ( wq , cpu , false ) ;
mutex_unlock ( & wq_pool_mutex ) ;
return 0 ;
2015-04-02 19:14:39 +08:00
}
struct work_for_cpu {
struct work_struct work ;
long ( * fn ) ( void * ) ;
void * arg ;
long ret ;
} ;
static void work_for_cpu_fn ( struct work_struct * work )
{
struct work_for_cpu * wfc = container_of ( work , struct work_for_cpu , work ) ;
wfc - > ret = wfc - > fn ( wfc - > arg ) ;
}
/**
2016-03-10 12:07:38 +01:00
* work_on_cpu - run a function in thread context on a particular cpu
2015-04-02 19:14:39 +08:00
* @ cpu : the cpu to run on
* @ fn : the function to run
* @ arg : the function arg
*
* It is up to the caller to ensure that the cpu doesn ' t go offline .
* The caller must not hold any locks which would prevent @ fn from completing .
*
* Return : The value @ fn returns .
*/
long work_on_cpu ( int cpu , long ( * fn ) ( void * ) , void * arg )
{
struct work_for_cpu wfc = { . fn = fn , . arg = arg } ;
INIT_WORK_ONSTACK ( & wfc . work , work_for_cpu_fn ) ;
schedule_work_on ( cpu , & wfc . work ) ;
flush_work ( & wfc . work ) ;
destroy_work_on_stack ( & wfc . work ) ;
return wfc . ret ;
}
EXPORT_SYMBOL_GPL ( work_on_cpu ) ;
2017-04-12 22:07:28 +02:00
/**
* work_on_cpu_safe - run a function in thread context on a particular cpu
* @ cpu : the cpu to run on
* @ fn : the function to run
* @ arg : the function argument
*
* Disables CPU hotplug and calls work_on_cpu ( ) . The caller must not hold
* any locks which would prevent @ fn from completing .
*
* Return : The value @ fn returns .
*/
long work_on_cpu_safe ( int cpu , long ( * fn ) ( void * ) , void * arg )
{
long ret = - ENODEV ;
2021-08-03 16:16:20 +02:00
cpus_read_lock ( ) ;
2017-04-12 22:07:28 +02:00
if ( cpu_online ( cpu ) )
ret = work_on_cpu ( cpu , fn , arg ) ;
2021-08-03 16:16:20 +02:00
cpus_read_unlock ( ) ;
2017-04-12 22:07:28 +02:00
return ret ;
}
EXPORT_SYMBOL_GPL ( work_on_cpu_safe ) ;
2015-04-02 19:14:39 +08:00
# endif /* CONFIG_SMP */
# ifdef CONFIG_FREEZER
/**
* freeze_workqueues_begin - begin freezing workqueues
*
* Start freezing workqueues . After this function returns , all freezable
2021-08-17 09:32:34 +08:00
* workqueues will queue new works to their inactive_works list instead of
2015-04-02 19:14:39 +08:00
* pool - > worklist .
*
* CONTEXT :
* Grabs and releases wq_pool_mutex , wq - > mutex and pool - > lock ' s .
*/
void freeze_workqueues_begin ( void )
{
struct workqueue_struct * wq ;
struct pool_workqueue * pwq ;
mutex_lock ( & wq_pool_mutex ) ;
WARN_ON_ONCE ( workqueue_freezing ) ;
workqueue_freezing = true ;
list_for_each_entry ( wq , & workqueues , list ) {
mutex_lock ( & wq - > mutex ) ;
for_each_pwq ( pwq , wq )
pwq_adjust_max_active ( pwq ) ;
mutex_unlock ( & wq - > mutex ) ;
}
mutex_unlock ( & wq_pool_mutex ) ;
}
/**
* freeze_workqueues_busy - are freezable workqueues still busy ?
*
* Check whether freezing is complete . This function must be called
* between freeze_workqueues_begin ( ) and thaw_workqueues ( ) .
*
* CONTEXT :
* Grabs and releases wq_pool_mutex .
*
* Return :
* % true if some freezable workqueues are still busy . % false if freezing
* is complete .
*/
bool freeze_workqueues_busy ( void )
{
bool busy = false ;
struct workqueue_struct * wq ;
struct pool_workqueue * pwq ;
mutex_lock ( & wq_pool_mutex ) ;
WARN_ON_ONCE ( ! workqueue_freezing ) ;
list_for_each_entry ( wq , & workqueues , list ) {
if ( ! ( wq - > flags & WQ_FREEZABLE ) )
continue ;
/*
* nr_active is monotonically decreasing . It ' s safe
* to peek without lock .
*/
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
2015-04-02 19:14:39 +08:00
for_each_pwq ( pwq , wq ) {
WARN_ON_ONCE ( pwq - > nr_active < 0 ) ;
if ( pwq - > nr_active ) {
busy = true ;
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2015-04-02 19:14:39 +08:00
goto out_unlock ;
}
}
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2015-04-02 19:14:39 +08:00
}
out_unlock :
mutex_unlock ( & wq_pool_mutex ) ;
return busy ;
}
/**
* thaw_workqueues - thaw workqueues
*
* Thaw workqueues . Normal queueing is restored and all collected
* frozen works are transferred to their respective pool worklists .
*
* CONTEXT :
* Grabs and releases wq_pool_mutex , wq - > mutex and pool - > lock ' s .
*/
void thaw_workqueues ( void )
{
struct workqueue_struct * wq ;
struct pool_workqueue * pwq ;
mutex_lock ( & wq_pool_mutex ) ;
if ( ! workqueue_freezing )
goto out_unlock ;
workqueue_freezing = false ;
/* restore max_active and repopulate worklist */
list_for_each_entry ( wq , & workqueues , list ) {
mutex_lock ( & wq - > mutex ) ;
for_each_pwq ( pwq , wq )
pwq_adjust_max_active ( pwq ) ;
mutex_unlock ( & wq - > mutex ) ;
}
out_unlock :
mutex_unlock ( & wq_pool_mutex ) ;
}
# endif /* CONFIG_FREEZER */
2015-04-30 17:16:12 +08:00
static int workqueue_apply_unbound_cpumask ( void )
{
LIST_HEAD ( ctxs ) ;
int ret = 0 ;
struct workqueue_struct * wq ;
struct apply_wqattrs_ctx * ctx , * n ;
lockdep_assert_held ( & wq_pool_mutex ) ;
list_for_each_entry ( wq , & workqueues , list ) {
if ( ! ( wq - > flags & WQ_UNBOUND ) )
continue ;
/* creating multiple pwqs breaks ordering guarantee */
if ( wq - > flags & __WQ_ORDERED )
continue ;
ctx = apply_wqattrs_prepare ( wq , wq - > unbound_attrs ) ;
if ( ! ctx ) {
ret = - ENOMEM ;
break ;
}
list_add_tail ( & ctx - > list , & ctxs ) ;
}
list_for_each_entry_safe ( ctx , n , & ctxs , list ) {
if ( ! ret )
apply_wqattrs_commit ( ctx ) ;
apply_wqattrs_cleanup ( ctx ) ;
}
return ret ;
}
/**
* workqueue_set_unbound_cpumask - Set the low - level unbound cpumask
* @ cpumask : the cpumask to set
*
* The low - level workqueues cpumask is a global cpumask that limits
* the affinity of all unbound workqueues . This function check the @ cpumask
* and apply it to all unbound workqueues and updates all pwqs of them .
*
2021-07-31 08:01:29 +08:00
* Return : 0 - Success
2015-04-30 17:16:12 +08:00
* - EINVAL - Invalid @ cpumask
* - ENOMEM - Failed to allocate memory for attrs or pwqs .
*/
int workqueue_set_unbound_cpumask ( cpumask_var_t cpumask )
{
int ret = - EINVAL ;
cpumask_var_t saved_cpumask ;
2017-11-03 17:27:50 +02:00
/*
* Not excluding isolated cpus on purpose .
* If the user wishes to include them , we allow that .
*/
2015-04-30 17:16:12 +08:00
cpumask_and ( cpumask , cpumask , cpu_possible_mask ) ;
if ( ! cpumask_empty ( cpumask ) ) {
2015-05-19 18:03:47 +08:00
apply_wqattrs_lock ( ) ;
2021-10-17 20:04:02 +08:00
if ( cpumask_equal ( cpumask , wq_unbound_cpumask ) ) {
ret = 0 ;
goto out_unlock ;
}
if ( ! zalloc_cpumask_var ( & saved_cpumask , GFP_KERNEL ) ) {
ret = - ENOMEM ;
goto out_unlock ;
}
2015-04-30 17:16:12 +08:00
/* save the old wq_unbound_cpumask. */
cpumask_copy ( saved_cpumask , wq_unbound_cpumask ) ;
/* update wq_unbound_cpumask at first and apply it to wqs. */
cpumask_copy ( wq_unbound_cpumask , cpumask ) ;
ret = workqueue_apply_unbound_cpumask ( ) ;
/* restore the wq_unbound_cpumask when failed. */
if ( ret < 0 )
cpumask_copy ( wq_unbound_cpumask , saved_cpumask ) ;
2021-10-17 20:04:02 +08:00
free_cpumask_var ( saved_cpumask ) ;
out_unlock :
2015-05-19 18:03:47 +08:00
apply_wqattrs_unlock ( ) ;
2015-04-30 17:16:12 +08:00
}
return ret ;
}
2015-04-02 19:14:39 +08:00
# ifdef CONFIG_SYSFS
/*
* Workqueues with WQ_SYSFS flag set is visible to userland via
* / sys / bus / workqueue / devices / WQ_NAME . All visible workqueues have the
* following attributes .
*
* per_cpu RO bool : whether the workqueue is per - cpu or unbound
* max_active RW int : maximum number of in - flight work items
*
* Unbound workqueues have the following extra attributes .
*
2017-11-02 23:05:12 -04:00
* pool_ids RO int : the associated pool IDs for each node
2015-04-02 19:14:39 +08:00
* nice RW int : nice value of the workers
* cpumask RW mask : bitmask of allowed CPUs for the workers
2017-11-02 23:05:12 -04:00
* numa RW bool : whether enable NUMA affinity
2015-04-02 19:14:39 +08:00
*/
struct wq_device {
struct workqueue_struct * wq ;
struct device dev ;
} ;
static struct workqueue_struct * dev_to_wq ( struct device * dev )
{
struct wq_device * wq_dev = container_of ( dev , struct wq_device , dev ) ;
return wq_dev - > wq ;
}
static ssize_t per_cpu_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
return scnprintf ( buf , PAGE_SIZE , " %d \n " , ( bool ) ! ( wq - > flags & WQ_UNBOUND ) ) ;
}
static DEVICE_ATTR_RO ( per_cpu ) ;
static ssize_t max_active_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
return scnprintf ( buf , PAGE_SIZE , " %d \n " , wq - > saved_max_active ) ;
}
static ssize_t max_active_store ( struct device * dev ,
struct device_attribute * attr , const char * buf ,
size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int val ;
if ( sscanf ( buf , " %d " , & val ) ! = 1 | | val < = 0 )
return - EINVAL ;
workqueue_set_max_active ( wq , val ) ;
return count ;
}
static DEVICE_ATTR_RW ( max_active ) ;
static struct attribute * wq_sysfs_attrs [ ] = {
& dev_attr_per_cpu . attr ,
& dev_attr_max_active . attr ,
NULL ,
} ;
ATTRIBUTE_GROUPS ( wq_sysfs ) ;
static ssize_t wq_pool_ids_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
const char * delim = " " ;
int node , written = 0 ;
2021-08-03 16:16:20 +02:00
cpus_read_lock ( ) ;
2019-03-13 17:55:47 +01:00
rcu_read_lock ( ) ;
2015-04-02 19:14:39 +08:00
for_each_node ( node ) {
written + = scnprintf ( buf + written , PAGE_SIZE - written ,
" %s%d:%d " , delim , node ,
unbound_pwq_by_node ( wq , node ) - > pool - > id ) ;
delim = " " ;
}
written + = scnprintf ( buf + written , PAGE_SIZE - written , " \n " ) ;
2019-03-13 17:55:47 +01:00
rcu_read_unlock ( ) ;
2021-08-03 16:16:20 +02:00
cpus_read_unlock ( ) ;
2015-04-02 19:14:39 +08:00
return written ;
}
static ssize_t wq_nice_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int written ;
mutex_lock ( & wq - > mutex ) ;
written = scnprintf ( buf , PAGE_SIZE , " %d \n " , wq - > unbound_attrs - > nice ) ;
mutex_unlock ( & wq - > mutex ) ;
return written ;
}
/* prepare workqueue_attrs for sysfs store operations */
static struct workqueue_attrs * wq_sysfs_prep_attrs ( struct workqueue_struct * wq )
{
struct workqueue_attrs * attrs ;
2015-05-20 14:41:18 +08:00
lockdep_assert_held ( & wq_pool_mutex ) ;
2019-06-26 16:52:38 +02:00
attrs = alloc_workqueue_attrs ( ) ;
2015-04-02 19:14:39 +08:00
if ( ! attrs )
return NULL ;
copy_workqueue_attrs ( attrs , wq - > unbound_attrs ) ;
return attrs ;
}
static ssize_t wq_nice_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
2015-05-19 18:03:48 +08:00
int ret = - ENOMEM ;
apply_wqattrs_lock ( ) ;
2015-04-02 19:14:39 +08:00
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( ! attrs )
2015-05-19 18:03:48 +08:00
goto out_unlock ;
2015-04-02 19:14:39 +08:00
if ( sscanf ( buf , " %d " , & attrs - > nice ) = = 1 & &
attrs - > nice > = MIN_NICE & & attrs - > nice < = MAX_NICE )
2015-05-19 18:03:48 +08:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2015-04-02 19:14:39 +08:00
else
ret = - EINVAL ;
2015-05-19 18:03:48 +08:00
out_unlock :
apply_wqattrs_unlock ( ) ;
2015-04-02 19:14:39 +08:00
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
}
static ssize_t wq_cpumask_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int written ;
mutex_lock ( & wq - > mutex ) ;
written = scnprintf ( buf , PAGE_SIZE , " %*pb \n " ,
cpumask_pr_args ( wq - > unbound_attrs - > cpumask ) ) ;
mutex_unlock ( & wq - > mutex ) ;
return written ;
}
static ssize_t wq_cpumask_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t count )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
2015-05-19 18:03:48 +08:00
int ret = - ENOMEM ;
apply_wqattrs_lock ( ) ;
2015-04-02 19:14:39 +08:00
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( ! attrs )
2015-05-19 18:03:48 +08:00
goto out_unlock ;
2015-04-02 19:14:39 +08:00
ret = cpumask_parse ( buf , attrs - > cpumask ) ;
if ( ! ret )
2015-05-19 18:03:48 +08:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2015-04-02 19:14:39 +08:00
2015-05-19 18:03:48 +08:00
out_unlock :
apply_wqattrs_unlock ( ) ;
2015-04-02 19:14:39 +08:00
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
}
static ssize_t wq_numa_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
int written ;
workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
2015-04-02 19:14:39 +08:00
mutex_lock ( & wq - > mutex ) ;
written = scnprintf ( buf , PAGE_SIZE , " %d \n " ,
! wq - > unbound_attrs - > no_numa ) ;
mutex_unlock ( & wq - > mutex ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
2015-04-02 19:14:39 +08:00
return written ;
2012-07-17 12:39:26 -07:00
}
2015-04-02 19:14:39 +08:00
static ssize_t wq_numa_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t count )
2012-07-17 12:39:26 -07:00
{
2015-04-02 19:14:39 +08:00
struct workqueue_struct * wq = dev_to_wq ( dev ) ;
struct workqueue_attrs * attrs ;
2015-05-19 18:03:48 +08:00
int v , ret = - ENOMEM ;
apply_wqattrs_lock ( ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
2015-04-02 19:14:39 +08:00
attrs = wq_sysfs_prep_attrs ( wq ) ;
if ( ! attrs )
2015-05-19 18:03:48 +08:00
goto out_unlock ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
2015-04-02 19:14:39 +08:00
ret = - EINVAL ;
if ( sscanf ( buf , " %d " , & v ) = = 1 ) {
attrs - > no_numa = ! v ;
2015-05-19 18:03:48 +08:00
ret = apply_workqueue_attrs_locked ( wq , attrs ) ;
2012-07-17 12:39:26 -07:00
}
2015-04-02 19:14:39 +08:00
2015-05-19 18:03:48 +08:00
out_unlock :
apply_wqattrs_unlock ( ) ;
2015-04-02 19:14:39 +08:00
free_workqueue_attrs ( attrs ) ;
return ret ? : count ;
2012-07-17 12:39:26 -07:00
}
2015-04-02 19:14:39 +08:00
static struct device_attribute wq_sysfs_unbound_attrs [ ] = {
__ATTR ( pool_ids , 0444 , wq_pool_ids_show , NULL ) ,
__ATTR ( nice , 0644 , wq_nice_show , wq_nice_store ) ,
__ATTR ( cpumask , 0644 , wq_cpumask_show , wq_cpumask_store ) ,
__ATTR ( numa , 0644 , wq_numa_show , wq_numa_store ) ,
__ATTR_NULL ,
} ;
2009-01-16 15:31:15 -08:00
2015-04-02 19:14:39 +08:00
static struct bus_type wq_subsys = {
. name = " workqueue " ,
. dev_groups = wq_sysfs_groups ,
2008-11-05 13:39:10 +11:00
} ;
2015-04-27 17:58:39 +08:00
static ssize_t wq_unbound_cpumask_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
int written ;
2015-04-30 17:16:12 +08:00
mutex_lock ( & wq_pool_mutex ) ;
2015-04-27 17:58:39 +08:00
written = scnprintf ( buf , PAGE_SIZE , " %*pb \n " ,
cpumask_pr_args ( wq_unbound_cpumask ) ) ;
2015-04-30 17:16:12 +08:00
mutex_unlock ( & wq_pool_mutex ) ;
2015-04-27 17:58:39 +08:00
return written ;
}
2015-04-30 17:16:12 +08:00
static ssize_t wq_unbound_cpumask_store ( struct device * dev ,
struct device_attribute * attr , const char * buf , size_t count )
{
cpumask_var_t cpumask ;
int ret ;
if ( ! zalloc_cpumask_var ( & cpumask , GFP_KERNEL ) )
return - ENOMEM ;
ret = cpumask_parse ( buf , cpumask ) ;
if ( ! ret )
ret = workqueue_set_unbound_cpumask ( cpumask ) ;
free_cpumask_var ( cpumask ) ;
return ret ? ret : count ;
}
2015-04-27 17:58:39 +08:00
static struct device_attribute wq_sysfs_cpumask_attr =
2015-04-30 17:16:12 +08:00
__ATTR ( cpumask , 0644 , wq_unbound_cpumask_show ,
wq_unbound_cpumask_store ) ;
2015-04-27 17:58:39 +08:00
2015-04-02 19:14:39 +08:00
static int __init wq_sysfs_init ( void )
2008-11-05 13:39:10 +11:00
{
2015-04-27 17:58:39 +08:00
int err ;
err = subsys_virtual_register ( & wq_subsys , NULL ) ;
if ( err )
return err ;
return device_create_file ( wq_subsys . dev_root , & wq_sysfs_cpumask_attr ) ;
2008-11-05 13:39:10 +11:00
}
2015-04-02 19:14:39 +08:00
core_initcall ( wq_sysfs_init ) ;
2008-11-05 13:39:10 +11:00
2015-04-02 19:14:39 +08:00
static void wq_device_release ( struct device * dev )
2008-11-05 13:39:10 +11:00
{
2015-04-02 19:14:39 +08:00
struct wq_device * wq_dev = container_of ( dev , struct wq_device , dev ) ;
2009-04-09 09:50:37 -06:00
2015-04-02 19:14:39 +08:00
kfree ( wq_dev ) ;
2008-11-05 13:39:10 +11:00
}
2010-06-29 10:07:12 +02:00
/**
2015-04-02 19:14:39 +08:00
* workqueue_sysfs_register - make a workqueue visible in sysfs
* @ wq : the workqueue to register
2010-06-29 10:07:12 +02:00
*
2015-04-02 19:14:39 +08:00
* Expose @ wq in sysfs under / sys / bus / workqueue / devices .
* alloc_workqueue * ( ) automatically calls this function if WQ_SYSFS is set
* which is the preferred method .
2010-06-29 10:07:12 +02:00
*
2015-04-02 19:14:39 +08:00
* Workqueue user should use this function directly iff it wants to apply
* workqueue_attrs before making the workqueue visible in sysfs ; otherwise ,
* apply_workqueue_attrs ( ) may race against userland updating the
* attributes .
*
* Return : 0 on success , - errno on failure .
2010-06-29 10:07:12 +02:00
*/
2015-04-02 19:14:39 +08:00
int workqueue_sysfs_register ( struct workqueue_struct * wq )
2010-06-29 10:07:12 +02:00
{
2015-04-02 19:14:39 +08:00
struct wq_device * wq_dev ;
int ret ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
/*
2015-05-23 10:38:14 +05:30
* Adjusting max_active or creating new pwqs by applying
2015-04-02 19:14:39 +08:00
* attributes breaks ordering guarantee . Disallow exposing ordered
* workqueues .
*/
2017-07-23 08:36:15 -04:00
if ( WARN_ON ( wq - > flags & __WQ_ORDERED_EXPLICIT ) )
2015-04-02 19:14:39 +08:00
return - EINVAL ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
wq - > wq_dev = wq_dev = kzalloc ( sizeof ( * wq_dev ) , GFP_KERNEL ) ;
if ( ! wq_dev )
return - ENOMEM ;
2013-03-13 19:47:40 -07:00
2015-04-02 19:14:39 +08:00
wq_dev - > wq = wq ;
wq_dev - > dev . bus = & wq_subsys ;
wq_dev - > dev . release = wq_device_release ;
2016-02-17 21:04:41 +01:00
dev_set_name ( & wq_dev - > dev , " %s " , wq - > name ) ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
/*
* unbound_attrs are created separately . Suppress uevent until
* everything is ready .
*/
dev_set_uevent_suppress ( & wq_dev - > dev , true ) ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
ret = device_register ( & wq_dev - > dev ) ;
if ( ret ) {
2018-03-06 15:35:43 +05:30
put_device ( & wq_dev - > dev ) ;
2015-04-02 19:14:39 +08:00
wq - > wq_dev = NULL ;
return ret ;
}
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
if ( wq - > flags & WQ_UNBOUND ) {
struct device_attribute * attr ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
for ( attr = wq_sysfs_unbound_attrs ; attr - > attr . name ; attr + + ) {
ret = device_create_file ( & wq_dev - > dev , attr ) ;
if ( ret ) {
device_unregister ( & wq_dev - > dev ) ;
wq - > wq_dev = NULL ;
return ret ;
2010-06-29 10:07:12 +02:00
}
}
}
2015-04-02 19:14:39 +08:00
dev_set_uevent_suppress ( & wq_dev - > dev , false ) ;
kobject_uevent ( & wq_dev - > dev . kobj , KOBJ_ADD ) ;
return 0 ;
2010-06-29 10:07:12 +02:00
}
/**
2015-04-02 19:14:39 +08:00
* workqueue_sysfs_unregister - undo workqueue_sysfs_register ( )
* @ wq : the workqueue to unregister
2010-06-29 10:07:12 +02:00
*
2015-04-02 19:14:39 +08:00
* If @ wq is registered to sysfs by workqueue_sysfs_register ( ) , unregister .
2010-06-29 10:07:12 +02:00
*/
2015-04-02 19:14:39 +08:00
static void workqueue_sysfs_unregister ( struct workqueue_struct * wq )
2010-06-29 10:07:12 +02:00
{
2015-04-02 19:14:39 +08:00
struct wq_device * wq_dev = wq - > wq_dev ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
if ( ! wq - > wq_dev )
return ;
2010-06-29 10:07:12 +02:00
2015-04-02 19:14:39 +08:00
wq - > wq_dev = NULL ;
device_unregister ( & wq_dev - > dev ) ;
2010-06-29 10:07:12 +02:00
}
2015-04-02 19:14:39 +08:00
# else /* CONFIG_SYSFS */
static void workqueue_sysfs_unregister ( struct workqueue_struct * wq ) { }
# endif /* CONFIG_SYSFS */
2010-06-29 10:07:12 +02:00
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
/*
* Workqueue watchdog .
*
* Stall may be caused by various bugs - missing WQ_MEM_RECLAIM , illegal
* flush dependency , a concurrency managed work item which stays RUNNING
* indefinitely . Workqueue stalls can be very difficult to debug as the
* usual warning mechanisms don ' t trigger and internal workqueue state is
* largely opaque .
*
* Workqueue watchdog monitors all worker pools periodically and dumps
* state if some pools failed to make forward progress for a while where
* forward progress is defined as the first item on - > worklist changing .
*
* This mechanism is controlled through the kernel parameter
* " workqueue.watchdog_thresh " which can be updated at runtime through the
* corresponding sysfs parameter file .
*/
# ifdef CONFIG_WQ_WATCHDOG
static unsigned long wq_watchdog_thresh = 30 ;
2017-10-04 16:27:00 -07:00
static struct timer_list wq_watchdog_timer ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
static unsigned long wq_watchdog_touched = INITIAL_JIFFIES ;
static DEFINE_PER_CPU ( unsigned long , wq_watchdog_touched_cpu ) = INITIAL_JIFFIES ;
static void wq_watchdog_reset_touched ( void )
{
int cpu ;
wq_watchdog_touched = jiffies ;
for_each_possible_cpu ( cpu )
per_cpu ( wq_watchdog_touched_cpu , cpu ) = jiffies ;
}
2017-10-04 16:27:00 -07:00
static void wq_watchdog_timer_fn ( struct timer_list * unused )
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
{
unsigned long thresh = READ_ONCE ( wq_watchdog_thresh ) * HZ ;
bool lockup_detected = false ;
2021-05-20 19:14:22 +09:00
unsigned long now = jiffies ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
struct worker_pool * pool ;
int pi ;
if ( ! thresh )
return ;
rcu_read_lock ( ) ;
for_each_pool ( pool , pi ) {
unsigned long pool_ts , touched , ts ;
if ( list_empty ( & pool - > worklist ) )
continue ;
2021-05-20 19:14:22 +09:00
/*
* If a virtual machine is stopped by the host it can look to
* the watchdog like a stall .
*/
kvm_check_and_clear_guest_paused ( ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
/* get the latest of pool and touched timestamps */
2021-03-24 19:40:29 +08:00
if ( pool - > cpu > = 0 )
touched = READ_ONCE ( per_cpu ( wq_watchdog_touched_cpu , pool - > cpu ) ) ;
else
touched = READ_ONCE ( wq_watchdog_touched ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
pool_ts = READ_ONCE ( pool - > watchdog_ts ) ;
if ( time_after ( pool_ts , touched ) )
ts = pool_ts ;
else
ts = touched ;
/* did we stall? */
2021-05-20 19:14:22 +09:00
if ( time_after ( now , ts + thresh ) ) {
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
lockup_detected = true ;
pr_emerg ( " BUG: workqueue lockup - pool " ) ;
pr_cont_pool_info ( pool ) ;
pr_cont ( " stuck for %us! \n " ,
2021-05-20 19:14:22 +09:00
jiffies_to_msecs ( now - pool_ts ) / 1000 ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
}
}
rcu_read_unlock ( ) ;
if ( lockup_detected )
2021-10-20 14:09:00 +11:00
show_all_workqueues ( ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
wq_watchdog_reset_touched ( ) ;
mod_timer ( & wq_watchdog_timer , jiffies + thresh ) ;
}
2018-08-21 17:25:07 +02:00
notrace void wq_watchdog_touch ( int cpu )
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
{
if ( cpu > = 0 )
per_cpu ( wq_watchdog_touched_cpu , cpu ) = jiffies ;
2021-03-24 19:40:29 +08:00
wq_watchdog_touched = jiffies ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
}
static void wq_watchdog_set_thresh ( unsigned long thresh )
{
wq_watchdog_thresh = 0 ;
del_timer_sync ( & wq_watchdog_timer ) ;
if ( thresh ) {
wq_watchdog_thresh = thresh ;
wq_watchdog_reset_touched ( ) ;
mod_timer ( & wq_watchdog_timer , jiffies + thresh * HZ ) ;
}
}
static int wq_watchdog_param_set_thresh ( const char * val ,
const struct kernel_param * kp )
{
unsigned long thresh ;
int ret ;
ret = kstrtoul ( val , 0 , & thresh ) ;
if ( ret )
return ret ;
if ( system_wq )
wq_watchdog_set_thresh ( thresh ) ;
else
wq_watchdog_thresh = thresh ;
return 0 ;
}
static const struct kernel_param_ops wq_watchdog_thresh_ops = {
. set = wq_watchdog_param_set_thresh ,
. get = param_get_ulong ,
} ;
module_param_cb ( watchdog_thresh , & wq_watchdog_thresh_ops , & wq_watchdog_thresh ,
0644 ) ;
static void wq_watchdog_init ( void )
{
2017-10-04 16:27:00 -07:00
timer_setup ( & wq_watchdog_timer , wq_watchdog_timer_fn , TIMER_DEFERRABLE ) ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
wq_watchdog_set_thresh ( wq_watchdog_thresh ) ;
}
# else /* CONFIG_WQ_WATCHDOG */
static inline void wq_watchdog_init ( void ) { }
# endif /* CONFIG_WQ_WATCHDOG */
2013-04-01 11:23:32 -07:00
static void __init wq_numa_init ( void )
{
cpumask_var_t * tbl ;
int node , cpu ;
if ( num_possible_nodes ( ) < = 1 )
return ;
2013-04-01 11:23:38 -07:00
if ( wq_disable_numa ) {
pr_info ( " workqueue: NUMA affinity support disabled \n " ) ;
return ;
}
2021-07-22 11:03:52 +08:00
for_each_possible_cpu ( cpu ) {
if ( WARN_ON ( cpu_to_node ( cpu ) = = NUMA_NO_NODE ) ) {
pr_warn ( " workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support \n " , cpu ) ;
return ;
}
}
2019-06-26 16:52:38 +02:00
wq_update_unbound_numa_attrs_buf = alloc_workqueue_attrs ( ) ;
workqueue: implement NUMA affinity for unbound workqueues
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01 11:23:36 -07:00
BUG_ON ( ! wq_update_unbound_numa_attrs_buf ) ;
2013-04-01 11:23:32 -07:00
/*
* We want masks of possible CPUs of each node which isn ' t readily
* available . Build one from cpu_to_node ( ) which should have been
* fully initialized by now .
*/
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 14:03:40 -07:00
tbl = kcalloc ( nr_node_ids , sizeof ( tbl [ 0 ] ) , GFP_KERNEL ) ;
2013-04-01 11:23:32 -07:00
BUG_ON ( ! tbl ) ;
for_each_node ( node )
workqueue: zero cpumask of wq_numa_possible_cpumask on init
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.
BUG: unable to handle kernel paging request at 0000000000001d08
IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
[<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
[<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
[<ffffffff811926f1>] new_slab+0x91/0x300
[<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
[<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
[<ffffffff810a3c78>] ? load_balance+0x218/0x890
[<ffffffff8101a679>] ? sched_clock+0x9/0x10
[<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
[<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
[<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
[<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
[<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8105d0ec>] do_fork+0xbc/0x360
[<ffffffff8105d3b6>] kernel_thread+0x26/0x30
[<ffffffff81086652>] kthreadd+0x2c2/0x300
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
[<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.
When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:
3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }
But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.
By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-07 09:56:48 -04:00
BUG_ON ( ! zalloc_cpumask_var_node ( & tbl [ node ] , GFP_KERNEL ,
2013-05-15 14:24:24 -07:00
node_online ( node ) ? node : NUMA_NO_NODE ) ) ;
2013-04-01 11:23:32 -07:00
for_each_possible_cpu ( cpu ) {
node = cpu_to_node ( cpu ) ;
cpumask_set_cpu ( cpu , tbl [ node ] ) ;
}
wq_numa_possible_cpumask = tbl ;
wq_numa_enabled = true ;
}
2016-09-16 15:49:32 -04:00
/**
* workqueue_init_early - early init for workqueue subsystem
*
* This is the first half of two - staged workqueue subsystem initialization
* and invoked as soon as the bare basics - memory allocation , cpumasks and
* idr are up . It sets up all the data structures and system workqueues
* and allows early boot code to create workqueues and queue / cancel work
* items . Actual work item execution starts only after kthreads can be
* created and scheduled right before early initcalls .
*/
2020-02-23 15:28:52 +08:00
void __init workqueue_init_early ( void )
2005-04-16 15:20:36 -07:00
{
2013-03-12 11:30:00 -07:00
int std_nice [ NR_STD_WORKER_POOLS ] = { 0 , HIGHPRI_NICE_LEVEL } ;
int i , cpu ;
2010-06-29 10:07:11 +02:00
2020-06-01 08:44:40 +00:00
BUILD_BUG_ON ( __alignof__ ( struct pool_workqueue ) < __alignof__ ( long long ) ) ;
2013-03-12 11:29:57 -07:00
2015-04-27 17:58:39 +08:00
BUG_ON ( ! alloc_cpumask_var ( & wq_unbound_cpumask , GFP_KERNEL ) ) ;
2022-02-07 16:59:06 +01:00
cpumask_copy ( wq_unbound_cpumask , housekeeping_cpumask ( HK_TYPE_WQ ) ) ;
cpumask_and ( wq_unbound_cpumask , wq_unbound_cpumask , housekeeping_cpumask ( HK_TYPE_DOMAIN ) ) ;
2015-04-27 17:58:39 +08:00
2013-03-12 11:29:57 -07:00
pwq_cache = KMEM_CACHE ( pool_workqueue , SLAB_PANIC ) ;
2013-01-24 11:01:34 -08:00
/* initialize CPU pools */
2013-03-12 11:30:03 -07:00
for_each_possible_cpu ( cpu ) {
2012-07-13 22:16:44 -07:00
struct worker_pool * pool ;
2010-06-29 10:07:12 +02:00
2013-03-12 11:30:00 -07:00
i = 0 ;
2013-03-12 11:30:03 -07:00
for_each_cpu_worker_pool ( pool , cpu ) {
2013-03-12 11:30:00 -07:00
BUG_ON ( init_worker_pool ( pool ) ) ;
2013-01-24 11:01:33 -08:00
pool - > cpu = cpu ;
2013-03-12 11:30:03 -07:00
cpumask_copy ( pool - > attrs - > cpumask , cpumask_of ( cpu ) ) ;
2013-03-12 11:30:00 -07:00
pool - > attrs - > nice = std_nice [ i + + ] ;
2013-04-01 11:23:34 -07:00
pool - > node = cpu_to_node ( cpu ) ;
2013-03-12 11:30:00 -07:00
2013-01-24 11:01:33 -08:00
/* alloc pool ID */
2013-03-25 16:57:17 -07:00
mutex_lock ( & wq_pool_mutex ) ;
2013-01-24 11:01:33 -08:00
BUG_ON ( worker_pool_assign_id ( pool ) ) ;
2013-03-25 16:57:17 -07:00
mutex_unlock ( & wq_pool_mutex ) ;
2012-07-13 22:16:44 -07:00
}
2010-06-29 10:07:12 +02:00
}
2013-09-05 12:30:04 -04:00
/* create default unbound and ordered wq attrs */
2013-03-12 11:30:03 -07:00
for ( i = 0 ; i < NR_STD_WORKER_POOLS ; i + + ) {
struct workqueue_attrs * attrs ;
2019-06-26 16:52:38 +02:00
BUG_ON ( ! ( attrs = alloc_workqueue_attrs ( ) ) ) ;
2013-03-12 11:30:03 -07:00
attrs - > nice = std_nice [ i ] ;
unbound_std_wq_attrs [ i ] = attrs ;
2013-09-05 12:30:04 -04:00
/*
* An ordered wq should have only one pwq as ordering is
* guaranteed by max_active which is enforced by pwqs .
* Turn off NUMA so that dfl_pwq is used for all nodes .
*/
2019-06-26 16:52:38 +02:00
BUG_ON ( ! ( attrs = alloc_workqueue_attrs ( ) ) ) ;
2013-09-05 12:30:04 -04:00
attrs - > nice = std_nice [ i ] ;
attrs - > no_numa = true ;
ordered_wq_attrs [ i ] = attrs ;
2013-03-12 11:30:03 -07:00
}
2010-06-29 10:07:14 +02:00
system_wq = alloc_workqueue ( " events " , 0 , 0 ) ;
2012-08-15 23:25:39 +09:00
system_highpri_wq = alloc_workqueue ( " events_highpri " , WQ_HIGHPRI , 0 ) ;
2010-06-29 10:07:14 +02:00
system_long_wq = alloc_workqueue ( " events_long " , 0 , 0 ) ;
2010-07-02 10:03:51 +02:00
system_unbound_wq = alloc_workqueue ( " events_unbound " , WQ_UNBOUND ,
WQ_UNBOUND_MAX_ACTIVE ) ;
2011-02-21 09:52:50 +01:00
system_freezable_wq = alloc_workqueue ( " events_freezable " ,
WQ_FREEZABLE , 0 ) ;
2013-04-24 17:12:54 +05:30
system_power_efficient_wq = alloc_workqueue ( " events_power_efficient " ,
WQ_POWER_EFFICIENT , 0 ) ;
system_freezable_power_efficient_wq = alloc_workqueue ( " events_freezable_power_efficient " ,
WQ_FREEZABLE | WQ_POWER_EFFICIENT ,
0 ) ;
2012-08-15 23:25:39 +09:00
BUG_ON ( ! system_wq | | ! system_highpri_wq | | ! system_long_wq | |
2013-04-24 17:12:54 +05:30
! system_unbound_wq | | ! system_freezable_wq | |
! system_power_efficient_wq | |
! system_freezable_power_efficient_wq ) ;
2016-09-16 15:49:32 -04:00
}
/**
* workqueue_init - bring workqueue subsystem fully online
*
* This is the latter half of two - staged workqueue subsystem initialization
* and invoked as soon as kthreads can be created and scheduled .
* Workqueues have been created and work items queued on them , but there
* are no kworkers executing the work items yet . Populate the worker pools
* with the initial workers and enable future kworker creations .
*/
2020-02-23 15:28:52 +08:00
void __init workqueue_init ( void )
2016-09-16 15:49:32 -04:00
{
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 12:01:27 -04:00
struct workqueue_struct * wq ;
2016-09-16 15:49:32 -04:00
struct worker_pool * pool ;
int cpu , bkt ;
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 12:01:27 -04:00
/*
* It ' d be simpler to initialize NUMA in workqueue_init_early ( ) but
* CPU to node mapping may not be available that early on some
* archs such as power and arm64 . As per - cpu pools created
* previously could be missing node hint and unbound pools NUMA
* affinity , fix them up .
2018-01-08 05:38:37 -08:00
*
* Also , while iterating workqueues , create rescuers if requested .
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 12:01:27 -04:00
*/
wq_numa_init ( ) ;
mutex_lock ( & wq_pool_mutex ) ;
for_each_possible_cpu ( cpu ) {
for_each_cpu_worker_pool ( pool , cpu ) {
pool - > node = cpu_to_node ( cpu ) ;
}
}
2018-01-08 05:38:37 -08:00
list_for_each_entry ( wq , & workqueues , list ) {
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 12:01:27 -04:00
wq_update_unbound_numa ( wq , smp_processor_id ( ) , true ) ;
2018-01-08 05:38:37 -08:00
WARN ( init_rescuer ( wq ) ,
" workqueue: failed to create early rescuer for %s " ,
wq - > name ) ;
}
workqueue: move wq_numa_init() to workqueue_init()
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-19 12:01:27 -04:00
mutex_unlock ( & wq_pool_mutex ) ;
2016-09-16 15:49:32 -04:00
/* create the initial workers */
for_each_online_cpu ( cpu ) {
for_each_cpu_worker_pool ( pool , cpu ) {
pool - > flags & = ~ POOL_DISASSOCIATED ;
BUG_ON ( ! create_worker ( pool ) ) ;
}
}
hash_for_each ( unbound_pool_hash , bkt , pool , hash_node )
BUG_ON ( ! create_worker ( pool ) ) ;
wq_online = true ;
workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:28:04 -05:00
wq_watchdog_init ( ) ;
2005-04-16 15:20:36 -07:00
}
2022-06-01 16:32:47 +09:00
/*
* Despite the naming , this is a no - op function which is here only for avoiding
* link error . Since compile - time warning may fail to catch , we will need to
* emit run - time warning from __flush_workqueue ( ) .
*/
void __warn_flushing_systemwide_wq ( void ) { }
EXPORT_SYMBOL ( __warn_flushing_systemwide_wq ) ;