IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
In the common case, there will be only one expedited grace period in
the system at a given time, in which case it is not helpful to use
funnel locking. This commit therefore adds a fastpath that bypasses
funnel locking when the root ->exp_funnel_mutex is not held.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The grace-period kthread sleeps waiting to do a force-quiescent-state
scan, and when awakened sets rsp->gp_state to RCU_GP_DONE_FQS.
However, this is confusing because the kthread has not done the
force-quiescent-state, but is instead just starting to do it. This commit
therefore renames RCU_GP_DONE_FQS to RCU_GP_DOING_FQS in order to make
things a bit easier on reviewers.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The condition for the wait_event_interruptible_timeout() that waits
to do the next force-quiescent-state scan is a bit ornate:
((gf = READ_ONCE(rsp->gp_flags)) &
RCU_GP_FLAG_FQS) ||
(!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))
This commit therefore pulls this condition out into a helper function
and comments its component conditions.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although synchronize_sched_expedited() historically has no RCU CPU stall
warnings, the availability of the rcupdate.rcu_expedited boot parameter
invalidates the old assumption that synchronize_sched()'s stall warnings
would suffice. This commit therefore adds RCU CPU stall warnings to
synchronize_sched_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The strictly rcu_node based funnel-locking scheme works well in many
cases, but systems with CONFIG_RCU_FANOUT_LEAF=64 won't necessarily get
all that much concurrency. This commit therefore extends the funnel
locking into the per-CPU rcu_data structure, providing concurrency equal
to the number of CPUs.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
One of the requirements on RCU grace periods is that if there is a
causal chain of operations that starts after one grace period and
ends before another grace period, then the two grace periods must
be serialized. There has been (and might still be) code that relies
on this, for example, certain types of reference-counting code that
does a call_rcu() within an RCU callback function.
This requirement is why there is an smp_mb() at the end of both
synchronize_sched_expedited() and synchronize_rcu_expedited().
However, this is the only smp_mb() in these functions, so it would
be nicer to consolidate it into rcu_exp_gp_seq_end(). This commit
does just that.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_seq operations were open-coded in _rcu_barrier(), so this commit
replaces the open-coding with the shiny new rcu_seq operations.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of synchronize_rcu_expedited()'s mutex_trylock()
polling loop in favor of the funnel-locking scheme that was abstracted
from synchronize_sched_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The type of "s" has been "long" rather than the correct "unsigned long"
for quite some time. This commit fixes this type error.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit abstracts funnel locking from synchronize_sched_expedited()
so that it may be used by synchronize_rcu_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although synchronize_rcu_expedited() uses a sequence-counter scheme, it
is based on a single increment per grace period, which means that tasks
piggybacking off of concurrent grace periods may be forced to wait longer
than necessary. This commit therefore applies the new sequence-count
functions developed for synchronize_sched_expedited() to speed things
up a bit and to consolidate the sequence-counter implementation.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit creates rcu_exp_gp_seq_start() and rcu_exp_gp_seq_end() to
bracket an expedited grace period, rcu_exp_gp_seq_snap() to snapshot the
sequence counter, and rcu_exp_gp_seq_done() to check to see if a full
expedited grace period has elapsed since the snapshot. These will be
applied to synchronize_rcu_expedited(). These are defined in terms of
underlying rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(), rcu_seq_done(),
which will be applied to _rcu_barrier().
One reason that this commit doesn't use the seqcount primitives themselves
is that the smp_wmb() in those primitive is insufficient due to the fact
that expedited grace periods do reads as well as writes. In addition,
the read-side seqcount primitives detect a potentially partial change,
where the expedited primitives instead need a guaranteed full change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Sequentially stopping the CPUs slows down expedited grace periods by
at least a factor of two, based on rcutorture's grace-period-per-second
rate. This is a conservative measure because rcutorture uses unusually
long RCU read-side critical sections and because rcutorture periodically
quiesces the system in order to test RCU's ability to ramp down to and
up from the idle state. This commit therefore replaces the stop_one_cpu()
with stop_one_cpu_nowait(), using an atomic-counter scheme to determine
when all CPUs have passed through the stopped state.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of synchronize_sched_expedited()'s mutex_trylock()
polling loop in favor of a funnel-locking scheme based on the rcu_node
tree. The work-done check is done at each level of the tree, allowing
high-contention situations to be resolved quickly with reasonable levels
of mutex contention.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that synchronize_sched_expedited() have a mutex, it can use simpler
work-already-done detection scheme. This commit simplifies this scheme
by using something similar to the sequence-locking counter scheme.
A counter is incremented before and after each grace period, so that
the counter is odd in the midst of the grace period and even otherwise.
So if the counter has advanced to the second even number that is
greater than or equal to the snapshot, the required grace period has
already happened.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The synchronize_sched_expedited() currently invokes try_stop_cpus(),
which schedules the stopper kthreads on each online non-idle CPU,
and waits until all those kthreads are running before letting any
of them stop. This is disastrous for real-time workloads, which
get hit with a preemption that is as long as the longest scheduling
latency on any CPU, including any non-realtime housekeeping CPUs.
This commit therefore switches to using stop_one_cpu() on each CPU
in turn. This avoids inflicting the worst-case scheduling latency
on the worst-case CPU onto all other CPUs, and also simplifies the
code a little bit.
Follow-up commits will simplify the counter-snapshotting algorithm
and convert a number of the counters that are now protected by the
new ->expedited_mutex to non-atomic.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ paulmck: Kept stop_one_cpu(), dropped disabling of "guardrails". ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CONFIG_RCU_CPU_STALL_INFO has been default-y for a couple of
releases with no complaints, so it is time to eliminate this Kconfig
option entirely, so that the long-form RCU CPU stall warnings cannot
be disabled. This commit does just that.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The fact that tasks could be migrated from leaf to root rcu_node
structures meant that synchronize_rcu_expedited() had to disable
CPU hotplug. However, tasks now stay put, so this commit removes the
CPU-hotplug disabling from synchronize_rcu_expedited().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently if the rcu_fanout_leaf boot parameter is out of bounds (that
is, less than RCU_FANOUT_LEAF or greater than the number of bits in an
unsigned long), a warning is issued and execution continues with the
out-of-bounds value. This can result in all manner of failures, so this
patch resets rcu_fanout_leaf to RCU_FANOUT_LEAF when an out-of-bounds
condition is detected.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because gcc does not realize a loop would not be entered ever
(i.e. in case of rcu_num_lvls == 1):
for (i = 1; i < rcu_num_lvls; i++)
rsp->level[i] = rsp->level[i - 1] + levelcnt[i - 1];
some compiler (pre- 5.x?) versions give a bogus warning:
kernel/rcu/tree.c: In function ‘rcu_init_one.isra.55’:
kernel/rcu/tree.c:4108:13: warning: array subscript is above array bounds [-Warray-bounds]
rsp->level[i] = rsp->level[i - 1] + rsp->levelcnt[i - 1];
^
Fix that warning by adding an extra item to rcu_state::level[]
array. Once the bogus warning is fixed in gcc and kernel drops
support of older versions, the dummy item may be removed from
the array.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The resend mechanism happily calls the interrupt handler of interrupts
which are marked IRQ_NESTED_THREAD from softirq context. This can
result in crashes because the interrupt handler is not the proper way
to invoke the device handlers. They must be invoked via
handle_nested_irq.
Prevent the resend even if the interrupt has no valid parent irq
set. Its better to have a lost interrupt than a crashing machine.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Although RCU-tasks isn't really designed to support rcu_dereference()
and list manipulation, that is how rcutorture tests it. Which means
that lockdep-RCU complains about the rcu_dereference_check() invocations
because RCU-tasks doesn't have read-side markers. This commit therefore
creates a torturing_tasks() to silence the lockdep-RCU complaints from
rcu_dereference_check() when RCU-tasks is being tortured.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_torture_cbflood() function correctly checks for flavors of
RCU that lack analogs to call_rcu() and rcu_barrier(), but in that
case it fails to terminate correctly. In fact, it terminates so
incorrectly that segfaults can result. This commit therefore causes
rcu_torture_cbflood() to do the proper wait-for-stop procedure.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Specifying a negative rcutorture.shuffle_interval value will cause a
negative value to be used as a sleep time. This commit therefore
refuses to start shuffling unless the rcutorture.shuffle_interval
value is greater than zero.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, a negative value for rcutorture.nfakewriters= can cause
rcutorture to pass a negative size to the memory allocator, which
is not really a particularly good thing to do. This commit therefore
adds bounds checking to this parameter, so that values that are less
than or equal to zero disable fake writing.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
A negative value for rcutorture.n_barrier_cbs can pass a negative value
to the memory allocator, so this commit instead causes rcu_barrier()
testing to be disabled in this case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This update makes arithmetic to calculate number of RCU nodes
more straight and easy to read.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although a number of RCU levels may be less than the current
maximum of four, some static data associated with each level
are allocated for all four levels. As result, the extra data
never get accessed and just wast memory. This update limits
count of allocated items to the number of used RCU levels.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Members rcu_state::levelcnt[] and rcu_state::levelspread[]
are only used at init. There is no reason to keep them
afterwards.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Number of items in rcu_capacity[] array is defined by macro
MAX_RCU_LVLS. However, that array is never accessed beyond
RCU_NUM_LVLS index. Therefore, we can limit the array to
RCU_NUM_LVLS items and eliminate MAX_RCU_LVLS. As result,
in most cases the memory is conserved.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Variable rcu_num_lvls is limited by RCU_NUM_LVLS macro.
In turn, rcu_state::levelcnt[] array is never accessed
beyond rcu_num_lvls. Thus, rcu_state::levelcnt[] is safe
to limit to RCU_NUM_LVLS items.
Since rcu_num_lvls could be changed during boot (as result
of rcutree.rcu_fanout_leaf kernel parameter update) one might
assume a new value could overflow the value of RCU_NUM_LVLS.
However, that is not the case, since leaf-level fanout is only
permitted to increase, resulting in rcu_num_lvls possibly to
decrease.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Current code suggests that introducing the extra level to
rcu_capacity[] array makes some of the arithmetic easier.
Well, in fact it appears rather confusing and unnecessary.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This update simplifies rcu_init_geometry() code flow
and makes calculation of the total number of rcu_node
structures more easy to read.
The update relies on the fact num_rcu_lvl[] is never
accessed beyond rcu_num_lvls index by the rest of the
code. Therefore, there is no need initialize the whole
num_rcu_lvl[].
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Local variable 'n' mimics 'nr_cpu_ids' while the both are
used within one function. There is no reason for 'n' to
exist whatsoever.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently a condition when RCU tree is unable to accommodate
the configured number of CPUs is not permitted and causes
a fall back to compile-time values. However, the code has no
means to exceed the RCU tree capacity neither at compile-time
nor in run-time. Therefore, if the condition is met in run-
time then it indicates a serios problem elsewhere and should
be handled with a panic.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Type-checking coccinelle spatches are being used to locate type mismatches
between function signatures and return values in this case this produced:
./kernel/rcu/srcu.c:271 WARNING: return of wrong type
int != unsigned long,
srcu_readers_active() returns an int that is the sum of per_cpu unsigned
long but the only user is cleanup_srcu_struct() which is using it as a
boolean (condition) to see if there is any readers rather than actually
using the approximate number of readers. The theoretically possible
unsigned long overflow case does not need to be handled explicitly - if
we had 4G++ readers then something else went wrong a long time ago.
proposal: change the return type to boolean. The function name is left
unchanged as it fits the naming expectation for a boolean.
patch was compile tested for x86_64_defconfig (implies CONFIG_SRCU=y)
patch is against 4.1-rc5 (localversion-next is -next-20150525)
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
DEBUG_LOCK_ALLOC=y is not a production setting, but it is
not very unusual either. Many developers routinely
use kernels built with it enabled.
Apart from being selected by hand, it is also auto-selected by
PROVE_LOCKING "Lock debugging: prove locking correctness" and
LOCK_STAT "Lock usage statistics" config options.
LOCK STAT is necessary for "perf lock" to work.
I wouldn't spend too much time optimizing it, but this particular
function has a very large cost in code size: when it is deinlined,
code size decreases by 830,000 bytes:
text data bss dec hex filename
85674192 22294776 20627456 128596424 7aa39c8 vmlinux.before
84837612 22294424 20627456 127759492 79d7484 vmlinux
(with this config: http://busybox.net/~vda/kernel_config)
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Tejun Heo <tj@kernel.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: linux-kernel@vger.kernel.org
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
For clarity, if CONFIG_SECCOMP isn't defined, seccomp_mode() is returning
"disabled". This makes that more clear, along with another 0-use, and
results in no operational change.
Signed-off-by: Kees Cook <keescook@chromium.org>
This patch is the first step in enabling checkpoint/restore of processes
with seccomp enabled.
One of the things CRIU does while dumping tasks is inject code into them
via ptrace to collect information that is only available to the process
itself. However, if we are in a seccomp mode where these processes are
prohibited from making these syscalls, then what CRIU does kills the task.
This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables
a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp
filters to disable (and re-enable) seccomp filters for another task so that
they can be successfully dumped (and restored). We restrict the set of
processes that can disable seccomp through ptrace because although today
ptrace can be used to bypass seccomp, there is some discussion of closing
this loophole in the future and we would like this patch to not depend on
that behavior and be future proofed for when it is removed.
Note that seccomp can be suspended before any filters are actually
installed; this behavior is useful on criu restore, so that we can suspend
seccomp, restore the filters, unmap our restore code from the restored
process' address space, and then resume the task by detaching and have the
filters resumed as well.
v2 changes:
* require that the tracer have no seccomp filters installed
* drop TIF_NOTSC manipulation from the patch
* change from ptrace command to a ptrace option and use this ptrace option
as the flag to check. This means that as soon as the tracer
detaches/dies, seccomp is re-enabled and as a corrollary that one can not
disable seccomp across PTRACE_ATTACHs.
v3 changes:
* get rid of various #ifdefs everywhere
* report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly
used
v4 changes:
* get rid of may_suspend_seccomp() in favor of a capable() check in ptrace
directly
v5 changes:
* check that seccomp is not enabled (or suspended) on the tracer
Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com>
CC: Will Drewry <wad@chromium.org>
CC: Roland McGrath <roland@hack.frob.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
[kees: access seccomp.mode through seccomp_mode() instead]
Signed-off-by: Kees Cook <keescook@chromium.org>
Recently lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends(). The following PATCH makes the change.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
tracer (traces unlikely and likely branches) when enabled with certain
debug options.
What happened was that various debug options like lockdep and DEBUG_PREEMPT
can cause parts of the branch tracer to recurse outside its recursion
protection. In fact, part of its recursion protection used these features
that caused the lockup. This cleans up the code a little and makes the
recursion protection a bit more robust.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVovjwAAoJEEjnJuOKh9ldYacH/3VVIcBhRbuUmANZJSYLXSUh
nEx6FOAfmla/7hH2dVSm8JmnbUzWyunttjbZ6/O8PFE1gumuvVrfWrIV8iwUe4J6
6Z2KdbKd+3FpaSKEnX61UQ1cfR+1eFLOLH9CH5O4twVPyLzvI+NpaJQJaNoX0ywq
vqsUMx63gKdGwhC6BhLi0t/xsTuuIgGjDAjuaF2yNZCuBw9UtziedxK5pveH2OFX
G8dAVfP18aqsXaRMj1LUrm6wRUP0BD7B2v99jdUu2UHYTtmBzy8Vm6RfhJ4Gk8d7
WnIkaBBU5iu75E9ec35MtA52zQ+8b8O/fpIQZgJgFRr7uaf+hvbXjF0UScWE6vU=
=ujzT
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fengguang Wu discovered a crash that happened to be because of the
branch tracer (traces unlikely and likely branches) when enabled with
certain debug options.
What happened was that various debug options like lockdep and
DEBUG_PREEMPT can cause parts of the branch tracer to recurse outside
its recursion protection. In fact, part of its recursion protection
used these features that caused the lockup. This cleans up the code a
little and makes the recursion protection a bit more robust"
* tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have branch tracer use recursive field of task struct
Boris reported that the sparse_irq protection around __cpu_up() in the
generic code causes a regression on Xen. Xen allocates interrupts and
some more in the xen_cpu_up() function, so it deadlocks on the
sparse_irq_lock.
There is no simple fix for this and we really should have the
protection for all architectures, but for now the only solution is to
move it to x86 where actual wreckage due to the lack of protection has
been observed.
Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Fixes: a89941816726 'hotplug: Prevent alloc/free of irq descriptors during cpu up/down'
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xiao jin <jin.xiao@intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
Adds a new single-purpose PIDs subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that applies to a cgroup rather than a
process tree.
However, it should be noted that organisational operations (adding and
removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking. This is due to the fact that, in the unified hierarchy, attach
cannot fail (and it is not possible for a task to overcome its PIDs
cgroup policy limit by attaching to a child cgroup -- even if migrating
mid-fork it must be able to fork in the parent first).
PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected by a cgroup
policy. In addition, add a cancel_fork callback so that if an error
occurs later in the forking process, any state modified by can_fork can
be reverted.
Allow for a private opaque pointer to be passed from cgroup_can_fork to
cgroup_post_fork, allowing for the fork state to be stored by each
subsystem separately.
Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG>
enumerations to be be defined and used. In addition, explicitly add a
CGROUP_CANFORK_COUNT macro to make arrays easier to define.
This is in preparation for implementing the pids cgroup subsystem.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In case of func->state or func->old_addr not having expected values,
we'd rather bail out immediately from klp_disable_func().
This can't really happen with the current codebase, but fix this
anyway in the sake of robustness.
[jkosina@suse.com: reworded the changelog a bit]
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
The synchronous synchronize_rcu() in wakeup_source_remove() makes
user process which writes to /sys/kernel/wake_unlock blocked sometimes.
For example, when android eventhub tries to release a wakelock, this
blocking process can occur, and eventhub can't get input events
for a while.
Using a work item instead of direct function call at pm_wake_unlock()
can prevent this unnecessary delay from happening.
Signed-off-by: SungEun Kim <cleaneye.kim@lge.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
tick_broadcast_oneshot_control got moved from tick-broadcast to
tick-common, but the export stayed in the old place. Fix it up.
Fixes: f32dd1170511 'tick/broadcast: Make idle check independent from mode and config'
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
net/bridge/br_mdb.c
Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.
Signed-off-by: David S. Miller <davem@davemloft.net>
ARG1 = BPF_R1 as it stands, evaluates to regs[BPF_REG_1] = regs[BPF_REG_1]
and thus has no effect. Add a comment instead, explaining what happens and
why it's okay to just remove it. Since from user space side, a tail call is
invoked as a pseudo helper function via bpf_tail_call_proto, the verifier
checks the arguments just like with any other helper function and makes
sure that the first argument (regs[BPF_REG_1])'s type is ARG_PTR_TO_CTX.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>