IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Print the mask, max_cycles, and max_idle_ns values for
clocksources being registered.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-12-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A long running project has been to clean up remaining uses
of clocksource_register(), replacing it with the simpler
clocksource_register_khz/hz() functions.
However, there are a few cases where we need to self-define
our mult/shift values, so switch the function to a more
obviously internal __clocksource_register() name, and
consolidate much of the internal logic so we don't have
duplication.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-10-git-send-email-john.stultz@linaro.org
[ Minor cleanups. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The clocksource watchdog reporting has been less helpful
then desired, as it just printed the delta between
the two clocksources. This prevents any useful analysis
of why the skew occurred.
Thus this patch tries to improve the output when we
mark a clocksource as unstable, printing out the cycle
last and now values for both the current clocksource
and the watchdog clocksource. This will allow us to see
if the result was due to a false positive caused by
a problematic watchdog.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-9-git-send-email-john.stultz@linaro.org
[ Minor cleanups of kernel messages. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was suggested that the underflow/overflow protection
should probably throw some sort of warning out, rather
than just silently fixing the issue.
So this patch adds some warnings here. The flag variables
used are not protected by locks, but since we can't print
from the reading functions, just being able to say we
saw an issue in the update interval is useful enough,
and can be slightly racy without real consequence.
The big complication is that we're only under a read
seqlock, so the data could shift under us during
our calculation to see if there was a problem. This
patch avoids this issue by nesting another seqlock
which allows us to snapshot the just required values
atomically. So we shouldn't see false positives.
I also added some basic rate-limiting here, since
on one build machine w/ skewed TSCs it was fairly
noisy at bootup.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-8-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the case where there is a broken clocksource
where there are multiple actual clocks that
aren't perfectly aligned, we may see small "negative"
deltas when we subtract 'now' from 'cycle_last'.
The values are actually negative with respect to the
clocksource mask value, not necessarily negative
if cast to a s64, but we can check by checking the
delta to see if it is a small (relative to the mask)
negative value (again negative relative to the mask).
If so, we assume we jumped backwards somehow and
instead use zero for our delta.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-7-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When calculating the current delta since the last tick, we
currently have no hard protections to prevent a multiplication
overflow from occuring.
This patch introduces infrastructure to allow a cap that
limits the clocksource read delta value to the 'max_cycles' value,
which is where an overflow would occur.
Since this is in the hotpath, it adds the extra checking under
CONFIG_DEBUG_TIMEKEEPING=y.
There was some concern that capping time like this could cause
problems as we may stop expiring timers, which could go circular
if the timer that triggers time accumulation were mis-scheduled
too far in the future, which would cause time to stop.
However, since the mult overflow would result in a smaller time
value, we would effectively have the same problem there.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recently there's been requests for better sanity
checking in the time code, so that it's more clear
when something is going wrong, since timekeeping issues
could manifest in a large number of strange ways in
various subsystems.
Thus, this patch adds some extra infrastructure to
add a check to update_wall_time() to print two new
warnings:
1) if we see the call delayed beyond the 'max_cycles'
overflow point,
2) or if we see the call delayed beyond the clocksource's
'max_idle_ns' value, which is currently 50% of the
overflow point.
This extra infrastructure is conditional on
a new CONFIG_DEBUG_TIMEKEEPING option, also
added in this patch - default off.
Tested this a bit by halting qemu for specified
lengths of time to trigger the warnings.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-5-git-send-email-john.stultz@linaro.org
[ Improved the changelog and the messages a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to pass the total request length in the kiocb, as
we already get passed in through the iov_iter argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Current approach in handling shadow memory for modules is broken.
Shadow memory could be freed only after memory shadow corresponds it is no
longer used. vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:
void vfree(const void *addr)
{
...
if (unlikely(in_interrupt())) {
struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
if (llist_add((struct llist_node *)addr, &p->list))
schedule_work(&p->wq);
Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.
So shadow memory should be freed after module's memory. However, such
deallocation order could race with kasan_module_alloc() in module_alloc().
Free shadow right before releasing vm area. At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At grace-period initialization time, RCU checks that all quiescent
states were really reported for the previous grace period. Now that
grace-period cleanup has been split out of grace-period initialization,
this commit also performs those checks at grace-period cleanup time.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit informs RCU of an outgoing CPU just before that CPU invokes
arch_cpu_idle_dead() during its last pass through the idle loop (via a
new CPU_DYING_IDLE notifier value). This change means that RCU need not
deal with outgoing CPUs passing through the scheduler after informing
RCU that they are no longer online. Note that removing the CPU from
the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
and orphaning callbacks is still done at CPU_DEAD time, the reason being
that at CPU_DEAD time we have another CPU that can adopt them.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit uses a per-CPU variable to make the CPU-offline code path
through the idle loop more precise, so that the outgoing CPU is
guaranteed to make it into the idle loop before it is powered off.
This commit is in preparation for putting the RCU offline-handling
code on this code path, which will eliminate the magic one-jiffy
wait that RCU uses as the maximum time for an outgoing CPU to get
all the way through the scheduler.
The magic one-jiffy wait for incoming CPUs remains a separate issue.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because that RCU grace-period initialization need no longer exclude
CPU-hotplug operations, this commit eliminates the ->onoff_mutex and
its uses.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events. Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.
This commit avoids these problems by buffering online and offline events
in a new ->qsmaskinitnext field in the leaf rcu_node structures. When a
grace period starts, the events accumulated in this mask are applied to
the ->qsmaskinit field, and, if needed, up the rcu_node tree. The special
case of all CPUs corresponding to a given leaf rcu_node structure being
offline while there are still elements in that structure's ->blkd_tasks
list is handled using a new ->wait_blkd_tasks field. In this case,
propagating the offline bits up the tree is deferred until the beginning
of the grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared. If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.
This of course means that RCU's notion of which CPUs are offline can be
out of date. This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started. In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_report_unblock_qs_rnp() function is invoked when the
last task blocking the current grace period exits its outermost
RCU read-side critical section. Previously, this was called only
from rcu_read_unlock_special(), and was therefore defined only when
CONFIG_RCU_PREEMPT=y. However, this function will be invoked even when
CONFIG_RCU_PREEMPT=n once CPU-hotplug operations are processed only at
the beginnings of RCU grace periods. The reason for this change is that
the last task on a given leaf rcu_node structure's ->blkd_tasks list
might well exit its RCU read-side critical section between the time that
recent CPU-hotplug operations were applied and when the new grace period
was initialized. This situation could result in RCU waiting forever on
that leaf rcu_node structure, because if all that structure's CPUs were
already offline, there would be no quiescent-state events to drive that
structure's part of the grace period.
This commit therefore moves rcu_report_unblock_qs_rnp() to common code
that is built unconditionally so that the quiescent-state-forcing code
can clean up after this situation, avoiding the grace-period stall.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the rcu_node tree ->expmask bitmasks are initially set to
reflect the online CPUs. This is pointless, because only the CPUs
preempted within RCU read-side critical sections by the preceding
synchronize_sched_expedited() need to be tracked. This commit therefore
instead sets up these bitmasks based on the state of the ->blkd_tasks
lists.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
I noticed that a helper function with argument type ARG_ANYTHING does
not need to have an initialized value (register).
This can worst case lead to unintented stack memory leakage in future
helper functions if they are not carefully designed, or unintended
application behaviour in case the application developer was not careful
enough to match a correct helper function signature in the API.
The underlying issue is that ARG_ANYTHING should actually be split
into two different semantics:
1) ARG_DONTCARE for function arguments that the helper function
does not care about (in other words: the default for unused
function arguments), and
2) ARG_ANYTHING that is an argument actually being used by a
helper function and *guaranteed* to be an initialized register.
The current risk is low: ARG_ANYTHING is only used for the 'flags'
argument (r4) in bpf_map_update_elem() that internally does strict
checking.
Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to facilitate clocksource validation, add a
'max_cycles' field to the clocksource structure which
will hold the maximum cycle value that can safely be
multiplied without potentially causing an overflow.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The clocksource logic has a number of places where we try to
include a safety margin. Most of these are 12% safety margins,
but they are inconsistently applied and sometimes are applied
on top of each other.
Additionally, in the previous patch, we corrected an issue
where we unintentionally in effect created a 50% safety margin,
which these 12.5% margins where then added to.
So to simplify the logic here, this patch removes the various
12.5% margins, and consolidates adding the margin in one place:
clocks_calc_max_nsecs().
Additionally, Linus prefers a 50% safety margin, as it allows
bad clock values to be more easily caught. This should really
have no net effect, due to the corrected issue earlier which
caused greater then 50% margins to be used w/o issue.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Stephen Boyd <sboyd@codeaurora.org> (for the sched_clock.c bit)
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The previous clocks_calc_max_nsecs() code had some unecessarily
complex bit logic to find the max interval that could cause
multiplication overflows. Since this is not in the hot
path, just do the divide to make it easier to read.
The previous implementation also had a subtle issue
that it avoided overflows with signed 64-bit values, where
as the intervals are always unsigned. This resulted in
overly conservative intervals, which other safety margins
were then added to, reducing the intended interval length.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1426133800-29329-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Offline CPUs cannot safely invoke trace events, but such CPUs do execute
within rcu_cpu_notify(). Therefore, this commit removes the trace events
from rcu_cpu_notify(). These trace events are for utilization, against
which rcu_cpu_notify() execution time should be negligible.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Grace-period initialization normally proceeds quite quickly, so
that it is very difficult to reproduce races against grace-period
initialization. This commit therefore allows grace-period
initialization to be artificially slowed down, increasing
race-reproduction probability. A pair of new Kconfig parameters are
provided, CONFIG_RCU_TORTURE_TEST_SLOW_INIT to enable the slowdowns, and
CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY to specify the number of jiffies
of slowdown to apply. A boot-time parameter named rcutree.gp_init_delay
allows boot-time delay to be specified. By default, no delay will be
applied even if CONFIG_RCU_TORTURE_TEST_SLOW_INIT is set.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If all CPUs have passed through quiescent states, then stalls might be
due to starvation of the grace-period kthread or to failure to propagate
the quiescent states up the rcu_node combining tree. The current stall
warning messages do not differentiate, so this commit adds a printout
of the root rcu_node structure's ->qsmask field.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, both rcu_cleanup_dead_cpu() and rcu_send_cbs_to_orphanage()
initialize the outgoing CPU's callback list. However, only
rcu_cleanup_dead_cpu() invokes rcu_send_cbs_to_orphanage(), and
it does so unconditionally, which means that only one of these
initializations is required. This commit therefore consolidates the
callback-list initialization with the rest of the callback handling in
rcu_send_cbs_to_orphanage().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU ignores offlined CPUs, so they cannot safely run RCU read-side code.
(They -can- use SRCU, but not RCU.) This means that any use of RCU
during or after the call to arch_cpu_idle_dead(). Unfortunately,
commit 2ed53c0d6cc99 added a complete() call, which will contain RCU
read-side critical sections if there is a task waiting to be awakened.
Which, as it turns out, there almost never is. In my qemu/KVM testing,
the to-be-awakened task is not yet asleep more than 99.5% of the time.
In current mainline, failure is even harder to reproduce, requiring a
virtualized environment that delays the outgoing CPU by at least three
jiffies between the time it exits its stop_machine() task at CPU_DYING
time and the time it calls arch_cpu_idle_dead() from the idle loop.
However, this problem really can occur, especially in virtualized
environments, and therefore really does need to be fixed
This suggests moving back to the polling loop, but using a much shorter
wait, with gentle exponential backoff instead of the old 100-millisecond
wait. Most of the time, the loop will exit without waiting at all,
and almost all of the remaining uses will wait only five microseconds.
If the outgoing CPU is preempted, a loop will wait one jiffy, then
increase the wait by a factor of 11/10ths, rounding up. As before, there
is a five-second timeout.
This commit therefore provides common-code infrastructure to do the
dying-to-surviving CPU handoff in a safe manner. This code also
provides an indication at CPU-online of whether the CPU to be onlined
previously timed out on offline. The new cpu_check_up_prepare() function
returns -EBUSY if this CPU previously took more than five seconds to
go offline, or -EAGAIN if it has not yet managed to go offline. The
rationale for -EAGAIN is that it might still be preempted, so an additional
wait might well find it correctly offlined. Architecture-specific code
can decide how to handle these conditions. Systems in which CPUs take
themselves completely offline might respond to an -EBUSY return as if
it was a zero (success) return. Systems in which the surviving CPU must
take some action might take it at this time, or might simply mark the
other CPU as unusable.
Note that architectures that take the easy way out and simply pass the
-EBUSY and -EAGAIN upwards will change the sysfs API.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <linux-api@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
[ paulmck: Fixed state machine for architectures that don't check earlier
CPU-hotplug results as suggested by James Hogan. ]
This patch adds rq->clock update skip for SCHED_DEADLINE task yield,
to tell update_rq_clock() that we've just updated the clock, so that
we don't do a microscopic update in schedule() and double the
fastpath cost.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425961200-3809-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
drivers/net/ethernet/cadence/macb.c
Overlapping changes in macb driver, mostly fixes and cleanups
in 'net' overlapping with the integration of at91_ether into
macb in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
contains fixes to ftrace when /proc/sys/kernel/ftrace_enabled and
function tracing are started. Doing the following causes some issues:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As well as with function tracing too. Pratyush Anand first reported
this issue to me and supplied a patch. When I tested this on my x86
test box, it caused thousands of backtraces and warnings to appear in
dmesg, which also caused a denial of service (a warning for every
function that was listed). I applied Pratyush's patch but it did not
fix the issue for me. I looked into it and found a slight problem
with trampoline accounting. I fixed it and sent Pratyush a patch, but
he said that it did not fix the issue for him.
I later learned tha Pratyush was using an ARM64 server, and when I tested
on my ARM board, I was able to reproduce the same issue as Pratyush.
After applying his patch, it fixed the problem. The above test uncovered
two different bugs, one in x86 and one in ARM and ARM64. As this looked
like it would affect PowerPC, I tested it on my PPC64 box. It too broke,
but neither the patch that fixed ARM or x86 fixed this box (the changes
were all in generic code!). The above test, uncovered two more bugs that
affected PowerPC. Again, the changes were only done to generic code.
It's the way the arch code expected things to be done that was different
between the archs. Some where more sensitive than others.
The rest of this series fixes the PPC bugs as well.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU/cQSAAoJEEjnJuOKh9lde9sH/1MAPq+6jr7YaEFru0GKajE9
rVHjw8rde/I4tN2UxIVk+Qm6pXRZYpv3OKxHT48EHzkvgm++voioykpJP4IEVrP5
mEDuIcYe28csE2nV5u5Q9kwnZoC86TQW5nVV6zB1Gx/3IEzA8Z046jAov40Jya0y
zqHc/U43JeeVIDIOkwjzbH6OaFEDP13FkF3TO502WJhJLqMo+kPOalIgv0eauKzy
lVCQBSC4WS3rVsgW4W3dSrEBaUxbJxgunjxOuV2DwHj5eghHq0M2MKeIUxBz0PuN
wnhTrpf5cAfshTvYHxKlE0uItdyYfVb7UChAD5zTbBL4kMUFhpb183zVKH8K8kU=
=8R8y
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull seq-buf/ftrace fixes from Steven Rostedt:
"This includes fixes for seq_buf_bprintf() truncation issue. It also
contains fixes to ftrace when /proc/sys/kernel/ftrace_enabled and
function tracing are started. Doing the following causes some issues:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As well as with function tracing too. Pratyush Anand first reported
this issue to me and supplied a patch. When I tested this on my x86
test box, it caused thousands of backtraces and warnings to appear in
dmesg, which also caused a denial of service (a warning for every
function that was listed). I applied Pratyush's patch but it did not
fix the issue for me. I looked into it and found a slight problem
with trampoline accounting. I fixed it and sent Pratyush a patch, but
he said that it did not fix the issue for him.
I later learned tha Pratyush was using an ARM64 server, and when I
tested on my ARM board, I was able to reproduce the same issue as
Pratyush. After applying his patch, it fixed the problem. The above
test uncovered two different bugs, one in x86 and one in ARM and
ARM64. As this looked like it would affect PowerPC, I tested it on my
PPC64 box. It too broke, but neither the patch that fixed ARM or x86
fixed this box (the changes were all in generic code!). The above
test, uncovered two more bugs that affected PowerPC. Again, the
changes were only done to generic code. It's the way the arch code
expected things to be done that was different between the archs. Some
where more sensitive than others.
The rest of this series fixes the PPC bugs as well"
* tag 'trace-fixes-v4.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled
ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl
ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl
seq_buf: Fix seq_buf_bprintf() truncation
seq_buf: Fix seq_buf_vprintf() truncation
Pull cgroup fixes from Tejun Heo:
"The cgroup iteration update two years ago and the recent cpuset
restructuring introduced regressions in subset of cpuset
configurations. Three patches to fix them.
All are marked for -stable"
* 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: Fix cpuset sched_relax_domain_level
cpuset: fix a warning when clearing configured masks in old hierarchy
cpuset: initialize effective masks when clone_children is enabled
Pull workqueue fix from Tejun Heo:
"One fix patch for a subtle livelock condition which can happen on
PREEMPT_NONE kernels involving two racing cancel_work calls. Whoever
comes in the second has to wait for the previous one to finish. This
was implemented by making the later one block for the same condition
that the former would be (work item completion) and then loop and
retest; unfortunately, depending on the wake up order, the later one
could lock out the former one to finish by busy looping on the cpu.
This is fixed by implementing explicit wait mechanism. Work item
might not belong anywhere at this point and there's remote possibility
of thundering herd problem. I originally tried to use bit_waitqueue
but it didn't work for static work items on modules. It's currently
using single wait queue with filtering wake up function and exclusive
wakeup. If this ever becomes a problem, which is not very likely, we
can try to figure out a way to piggy back on bit_waitqueue"
* 'for-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE
Some archs (specifically PowerPC), are sensitive with the ordering of
the enabling of the calls to function tracing and setting of the
function to use to be traced.
That is, update_ftrace_function() sets what function the ftrace_caller
trampoline should call. Some archs require this to be set before
calling ftrace_run_update_code().
Another bug was discovered, that ftrace_startup_sysctl() called
ftrace_run_update_code() directly. If the function the ftrace_caller
trampoline changes, then it will not be updated. Instead a call
to ftrace_startup_enable() should be called because it tests to see
if the callback changed since the code was disabled, and will
tell the arch to update appropriately. Most archs do not need this
notification, but PowerPC does.
The problem could be seen by the following commands:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# cat /sys/kernel/debug/tracing/trace
The trace will show that function tracing was not active.
Cc: stable@vger.kernel.org # 2.6.27+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When ftrace is enabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
ftrace is disabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
Consider the following situation.
# echo 0 > /proc/sys/kernel/ftrace_enabled
After this ftrace_enabled = 0.
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
called.
# echo 1 > /proc/sys/kernel/ftrace_enabled
Now ftrace_enabled will be set to true, but still
ftrace_enable_ftrace_graph_caller() will not be called, which is not
desired.
Further if we execute the following after this:
# echo nop > /sys/kernel/debug/tracing/current_tracer
Now since ftrace_enabled is set it will call
ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
the ARM platform.
On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
it checks whether the old instruction is a nop or not. If it's not a nop,
then it returns an error. If it is a nop then it replaces instruction at
that address with a branch to ftrace_graph_caller.
ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
then it will return an error, which will cause the generic ftrace code to
raise a warning.
Note, x86 does not have an issue with this because the architecture
specific code for ftrace_enable_ftrace_graph_caller() and
ftrace_disable_ftrace_graph_caller() does not check the previous state,
and calling either of these functions twice in a row has no ill effect.
Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Pratyush Anand <panand@redhat.com>
[
removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
if CONFIG_FUNCTION_GRAPH_TRACER is not set.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When /proc/sys/kernel/ftrace_enabled is set to zero, all function
tracing is disabled. But the records that represent the functions
still hold information about the ftrace_ops that are hooked to them.
ftrace_ops may request "REGS" (have a full set of pt_regs passed to
the callback), or "TRAMP" (the ops has its own trampoline to use).
When the record is updated to represent the state of the ops hooked
to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
points to the correct trampoline (REGS has its own trampoline).
When ftrace_enabled is set to zero, all ftrace locations are a nop,
so they do not point to any trampoline. But the _EN flags are still
set. This can cause the accounting to go wrong when ftrace_enabled
is cleared and an ops that has a trampoline is registered or unregistered.
For example, the following will cause ftrace to crash:
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As function_graph uses a trampoline, when ftrace_enabled is set to zero
the updates to the record are not done. When enabling function_graph
again, the record will still have the TRAMP_EN flag set, and it will
look for an op that has a trampoline other than the function_graph
ops, and fail to find one.
Cc: stable@vger.kernel.org # 3.17+
Reported-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Export context_tracking_user_enter/exit so it can be used by KVM.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will deacon <will.deacon@arm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Only run vtime_user_enter, vtime_user_exit, and the user enter & exit
trace points when we are entering or exiting user state, respectively.
The KVM code in guest_enter and guest_exit already take care of calling
vtime_guest_enter and vtime_guest_exit, respectively.
The RCU code only distinguishes between "idle" and "not idle or kernel".
There should be no need to add an additional (unused) state there.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will deacon <will.deacon@arm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Generalize the context tracking APIs to support various nature of
contexts. This is performed by splitting out the mechanism from
context_tracking_user_enter and context_tracking_user_exit into
context_tracking_enter and context_tracking_exit.
The nature of the context we track is now detailed in a ctx_state
parameter pushed to these APIs, allowing the same functions to not just
track kernel <> user space switching, but also kernel <> guest transitions.
But leave the old functions in order to avoid breaking ARM, which calls
these functions from assembler code, and cannot easily use C enum
parameters.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will deacon <will.deacon@arm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Current context tracking symbols are designed to express living state.
As such they are prefixed with "IN_": IN_USER, IN_KERNEL.
Now we are going to use these symbols to also express state transitions
such as context_tracking_enter(IN_USER) or context_tracking_exit(IN_USER).
But while the "IN_" prefix works well to express entering a context, it's
confusing to depict a context exit: context_tracking_exit(IN_USER)
could mean two things:
1) We are exiting the current context to enter user context.
2) We are exiting the user context
We want 2) but the reviewer may be confused and understand 1)
So lets disambiguate these symbols and rename them to CONTEXT_USER and
CONTEXT_KERNEL.
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will deacon <will.deacon@arm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Workqueues are used extensively throughout the kernel but sometimes
it's difficult to debug stalls involving work items because visibility
into its inner workings is fairly limited. Although sysrq-t task dump
annotates each active worker task with the information on the work
item being executed, it is challenging to find out which work items
are pending or delayed on which queues and how pools are being
managed.
This patch implements show_workqueue_state() which dumps all busy
workqueues and pools and is called from the sysrq-t handler. At the
end of sysrq-t dump, something like the following is printed.
Showing busy workqueues and worker pools:
...
workqueue filler_wq: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
in-flight: 491:filler_workfn, 507:filler_workfn
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
in-flight: 501:filler_workfn
pending: filler_workfn
...
workqueue test_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
in-flight: 510(RESCUER):test_workfn BAR(69) BAR(500)
delayed: test_workfn1 BAR(492), test_workfn2
...
pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=2 manager: 137
pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 manager: 469
pool 3: cpus=1 node=0 flags=0x0 nice=-20 workers=2 idle: 16
pool 8: cpus=0-3 flags=0x4 nice=0 workers=2 manager: 62
The above shows that test_wq is executing test_workfn() on pid 510
which is the rescuer and also that there are two tasks 69 and 500
waiting for the work item to finish in flush_work(). As test_wq has
max_active of 1, there are two work items for test_workfn1() and
test_workfn2() which are delayed till the current work item is
finished. In addition, pid 492 is flushing test_workfn1().
The work item for test_workfn() is being executed on pwq of pool 2
which is the normal priority per-cpu pool for CPU 1. The pool has
three workers, two of which are executing filler_workfn() for
filler_wq and the last one is assuming the manager role trying to
create more workers.
This extra workqueue state dump will hopefully help chasing down hangs
involving workqueues.
v3: cpulist_pr_cont() replaced with "%*pbl" printf formatting.
v2: As suggested by Andrew, minor formatting change in pr_cont_work(),
printk()'s replaced with pr_info()'s, and cpumask printing now
uses cpulist_pr_cont().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@redhat.com>
Add wq_barrier->task and worker_pool->manager to keep track of the
flushing task and pool manager respectively. These are purely
informational and will be used to implement sysrq dump of workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
The workqueues list is protected by wq_pool_mutex and a workqueue and
its subordinate data structures are freed directly on destruction. We
want to add the ability dump workqueues from a sysrq callback which
requires walking all workqueues without grabbing wq_pool_mutex. This
patch makes freeing of workqueues RCU protected and makes the
workqueues list walkable while holding RCU read lock.
Note that pool_workqueues and pools are already sched-RCU protected.
For consistency, workqueues are also protected with sched-RCU.
While at it, reverse the workqueues list so that a workqueue which is
created earlier comes before. The order of the list isn't significant
functionally but this makes the planned sysrq dump list system
workqueues first.
Signed-off-by: Tejun Heo <tj@kernel.org>
Here are some tty and serial driver fixes for 4.0-rc3.
Along with the atime fix that you know about, here are some other serial
driver bugfixes as well. Most notable is a wait_until_sent bugfix that
was traced back to being around since before 2.6.12 that Johan has fixed
up.
All have been in linux-next successfully.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlT8RCYACgkQMUfUDdst+yk62QCgycxS4giC2hyRver3dyvaNR6g
zYYAn2w0uRndW+AqP4Tls54isRz6owpF
=gA2k
-----END PGP SIGNATURE-----
Merge tag 'tty-4.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some tty and serial driver fixes for 4.0-rc3.
Along with the atime fix that you know about, here are some other
serial driver bugfixes as well. Most notable is a wait_until_sent
bugfix that was traced back to being around since before 2.6.12 that
Johan has fixed up.
All have been in linux-next successfully"
* tag 'tty-4.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
TTY: fix tty_wait_until_sent maximum timeout
TTY: fix tty_wait_until_sent on 64-bit machines
USB: serial: fix infinite wait_until_sent timeout
TTY: bfin_jtag_comm: remove incorrect wait_until_sent operation
net: irda: fix wait_until_sent poll timeout
serial: uapi: Declare all userspace-visible io types
serial: core: Fix iotype userspace breakage
serial: sprd: Fix missing spin_unlock in sprd_handle_irq()
console: Fix console name size mismatch
tty: fix up atime/mtime mess, take four
serial: 8250_dw: Fix get_mctrl behaviour
serial:8250:8250_pci: delete unneeded quirk entries
serial:8250:8250_pci: fix redundant entry report for WCH_CH352_2S
Change email address for 8250_pci
serial: 8250: Revert "tty: serial: 8250_core: read only RX if there is something in the FIFO"
Revert "tty/serial: of_serial: add DT alias ID handling"
Ming reported soft lockups occurring when running xfstest due to
the following tip:locking/core commit:
b3fd4f03ca0b ("locking/rwsem: Avoid deceiving lock spinners")
When doing optimistic spinning in rwsem, threads should stop
spinning when the lock owner is not running. While a thread is
spinning on owner, if the owner reschedules, owner->on_cpu
returns false and we stop spinning.
However, this commit essentially caused the check to get
ignored because when we break out of the spin loop due to
!on_cpu, we continue spinning if sem->owner != NULL.
This patch fixes this by making sure we stop spinning if the
owner is not running. Furthermore, just like with mutexes,
refactor the code such that we don't have separate checks for
owner_running(). This makes it more straightforward in terms of
why we exit the spin on owner loop and we would also avoid
needing to "guess" why we broke out of the loop to make this
more readable.
Reported-and-tested-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1425714331.2475.388.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before register_console() calls the setup() method of the matched
console, the registering console index is already equal to the index
from the console command line; ie. newcon->index == c->index.
This change is also required to support extensible console matching;
(the command line index may have no relation to the console index
assigned by the console-defined match() function).
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fengguang reported, that on openrisc and avr32 architectures, we
get the following linker errors on *_defconfig builds that have
no bpf syscall support:
net/built-in.o:(.rodata+0x1cd0): undefined reference to `bpf_map_lookup_elem_proto'
net/built-in.o:(.rodata+0x1cd4): undefined reference to `bpf_map_update_elem_proto'
net/built-in.o:(.rodata+0x1cd8): undefined reference to `bpf_map_delete_elem_proto'
Fix it up by providing built-in weak definitions of the symbols,
so they can be overridden when the syscall is enabled. I think
the issue might be that gcc is not able to optimize all that away.
This patch fixes the linker errors for me, tested with Fengguang's
make.cross [1] script.
[1] https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: d4052c4aea0c ("ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>