IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.
Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.
To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:
* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()
Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the
above mentioned commit.
Shile Zhang directed us to a report of an 80% regression in reaim
throughput.
Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fda31c50292a5062332fa0343c084bd9f46604d9 ]
When queueing a signal, we increment both the users count of pending
signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
of the user struct itself (because we keep a reference to the user in
the signal structure in order to correctly account for it when freeing).
That turns out to be fairly expensive, because both of them are atomic
updates, and particularly under extreme signal handling pressure on big
machines, you can get a lot of cache contention on the user struct.
That can then cause horrid cacheline ping-pong when you do these
multiple accesses.
So change the reference counting to only pin the user for the _first_
pending signal, and to unpin it when the last pending signal is
dequeued. That means that when a user sees a lot of concurrent signal
queuing - which is the only situation when this matters - the only
atomic access needed is generally the 'sigpending' count update.
This was noticed because of a particularly odd timing artifact on a
dual-socket 96C/192T Cascade Lake platform: when you get into bad
contention, on that machine for some reason seems to be much worse when
the contention happens in the upper 32-byte half of the cacheline.
As a result, the kernel test robot will-it-scale 'signal1' benchmark had
an odd performance regression simply due to random alignment of the
'struct user_struct' (and pointed to a completely unrelated and
apparently nonsensical commit for the regression).
Avoiding the double increments (and decrements on the dequeueing side,
of course) makes for much less contention and hugely improved
performance on that will-it-scale microbenchmark.
Quoting Feng Tang:
"It makes a big difference, that the performance score is tripled! bump
from original 17000 to 54000. Also the gap between 5.0-rc6 and
5.0-rc6+Jiri's patch is reduced to around 2%"
[ The "2% gap" is the odd cacheline placement difference on that
platform: under the extreme contention case, the effect of which half
of the cacheline was hot was 5%, so with the reduced contention the
odd timing artifact is reduced too ]
It does help in the non-contended case too, but is not nearly as
noticeable.
Reported-and-tested-by: Feng Tang <feng.tang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Philip Li <philip.li@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit aa202f1f56960c60e7befaa0f49c72b8fa11b0a8 upstream.
wq_select_unbound_cpu() is designed for unbound workqueues only, but
it's wrongly called when using a bound workqueue too.
Fixing this ensures work queued to a bound workqueue with
cpu=WORK_CPU_UNBOUND always runs on the local CPU.
Before, that would happen only if wq_unbound_cpumask happened to include
it (likely almost always the case), or was empty, or we got lucky with
forced round-robin placement. So restricting
/sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
CPUs would cause some bound work items to run unexpectedly there.
Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Hillf Danton <hdanton@sina.com>
[dj: massage changelog]
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c974c77246460fa6a92c18554c3311c8c83c160 upstream.
PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.
Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.
Reported-by: Suren Baghdasaryan <surenb@google.com>
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2d4ecb030dcc90fb725ecbfc82ce5d6c37906e0e upstream.
If seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output:
1) dd bs=1 skip output of each 2nd elements
$ dd if=/sys/fs/cgroup/cgroup.procs bs=8 count=1
2
3
4
5
1+0 records in
1+0 records out
8 bytes copied, 0,000267297 s, 29,9 kB/s
[test@localhost ~]$ dd if=/sys/fs/cgroup/cgroup.procs bs=1 count=8
2
4 <<< NB! 3 was skipped
6 <<< ... and 5 too
8 <<< ... and 7
8+0 records in
8+0 records out
8 bytes copied, 5,2123e-05 s, 153 kB/s
This happen because __cgroup_procs_start() makes an extra
extra cgroup_procs_next() call
2) read after lseek beyond end of file generates whole last line.
3) read after lseek into middle of last line generates
expected rest of last line and unexpected whole line once again.
Additionally patch removes an extra position index changes in
__cgroup_procs_start()
Cc: stable@vger.kernel.orghttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ]
We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.
mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.
Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.
WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.
The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:
<IRQ>
dump_stack+0x57/0x75
mem_cgroup_sk_alloc+0xe9/0xf0
sk_clone_lock+0x2a7/0x420
inet_csk_clone_lock+0x1b/0x110
tcp_create_openreq_child+0x23/0x3b0
tcp_v6_syn_recv_sock+0x88/0x730
tcp_check_req+0x429/0x560
tcp_v6_rcv+0x72d/0xa40
ip6_protocol_deliver_rcu+0xc9/0x400
ip6_input+0x44/0xd0
? ip6_protocol_deliver_rcu+0x400/0x400
ip6_rcv_finish+0x71/0x80
ipv6_rcv+0x5b/0xe0
? ip6_sublist_rcv+0x2e0/0x2e0
process_backlog+0x108/0x1e0
net_rx_action+0x26b/0x460
__do_softirq+0x104/0x2a6
do_softirq_own_stack+0x2a/0x40
</IRQ>
do_softirq.part.19+0x40/0x50
__local_bh_enable_ip+0x51/0x60
ip6_finish_output2+0x23d/0x520
? ip6table_mangle_hook+0x55/0x160
__ip6_finish_output+0xa1/0x100
ip6_finish_output+0x30/0xd0
ip6_output+0x73/0x120
? __ip6_finish_output+0x100/0x100
ip6_xmit+0x2e3/0x600
? ipv6_anycast_cleanup+0x50/0x50
? inet6_csk_route_socket+0x136/0x1e0
? skb_free_head+0x1e/0x30
inet6_csk_xmit+0x95/0xf0
__tcp_transmit_skb+0x5b4/0xb20
__tcp_send_ack.part.60+0xa3/0x110
tcp_send_ack+0x1d/0x20
tcp_rcv_state_process+0xe64/0xe80
? tcp_v6_connect+0x5d1/0x5f0
tcp_v6_do_rcv+0x1b1/0x3f0
? tcp_v6_do_rcv+0x1b1/0x3f0
__release_sock+0x7f/0xd0
release_sock+0x30/0xa0
__inet_stream_connect+0x1c3/0x3b0
? prepare_to_wait+0xb0/0xb0
inet_stream_connect+0x3b/0x60
__sys_connect+0x101/0x120
? __sys_getsockopt+0x11b/0x140
__x64_sys_connect+0x1a/0x20
do_syscall_64+0x51/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e4add247789e4ba5e08ad8256183ce2e211877d4 ]
optimize_kprobe() and unoptimize_kprobe() cancels if a given kprobe
is on the optimizing_list or unoptimizing_list already. However, since
the following commit:
f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code")
modified the update timing of the KPROBE_FLAG_OPTIMIZED, it doesn't
work as expected anymore.
The optimized_kprobe could be in the following states:
- [optimizing]: Before inserting jump instruction
op.kp->flags has KPROBE_FLAG_OPTIMIZED and
op->list is not empty.
- [optimized]: jump inserted
op.kp->flags has KPROBE_FLAG_OPTIMIZED and
op->list is empty.
- [unoptimizing]: Before removing jump instruction (including unused
optprobe)
op.kp->flags has KPROBE_FLAG_OPTIMIZED and
op->list is not empty.
- [unoptimized]: jump removed
op.kp->flags doesn't have KPROBE_FLAG_OPTIMIZED and
op->list is empty.
Current code mis-expects [unoptimizing] state doesn't have
KPROBE_FLAG_OPTIMIZED, and that can cause incorrect results.
To fix this, introduce optprobe_queued_unopt() to distinguish [optimizing]
and [unoptimizing] states and fixes the logic in optimize_kprobe() and
unoptimize_kprobe().
[ mingo: Cleaned up the changelog and the code a bit. ]
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code")
Link: https://lkml.kernel.org/r/157840814418.7181.13478003006386303481.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f66c0447cca1281116224d474cdb37d6a18e4b5b upstream.
Set the unoptimized flag after confirming the code is completely
unoptimized. Without this fix, when a kprobe hits the intermediate
modified instruction (the first byte is replaced by an INT3, but
later bytes can still be a jump address operand) while unoptimizing,
it can return to the middle byte of the modified code, which causes
an invalid instruction exception in the kernel.
Usually, this is a rare case, but if we put a probe on the function
call while text patching, it always causes a kernel panic as below:
# echo p text_poke+5 > kprobe_events
# echo 1 > events/kprobes/enable
# echo 0 > events/kprobes/enable
invalid opcode: 0000 [#1] PREEMPT SMP PTI
RIP: 0010:text_poke+0x9/0x50
Call Trace:
arch_unoptimize_kprobe+0x22/0x28
arch_unoptimize_kprobes+0x39/0x87
kprobe_optimizer+0x6e/0x290
process_one_work+0x2a0/0x610
worker_thread+0x28/0x3d0
? process_one_work+0x610/0x610
kthread+0x10d/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x3a/0x50
text_poke() is used for patching the code in optprobes.
This can happen even if we blacklist text_poke() and other functions,
because there is a small time window during which we show the intermediate
code to other CPUs.
[ mingo: Edited the changelog. ]
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 upstream.
This re-applies the commit reverted here:
commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")
I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply:
commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c upstream.
Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child
ordering of the list when it will be added back. In order to remove an
empty and fully decayed cfs_rq, we must remove its children too, so they
will be added back in the right order next time.
With a normal decay of PELT, a parent will be empty and fully decayed
if all children are empty and fully decayed too. In such a case, we just
have to ensure that the whole branch will be added when a new task is
enqueued. This is default behavior since :
commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list")
In case of throttling, the PELT of throttled cfs_rq will not be updated
whereas the parent will. This breaks the assumption made above unless we
remove the children of a cfs_rq that is throttled. Then, they will be
added back when unthrottled and a sched_entity will be enqueued.
As throttled cfs_rq are now removed from the list, we can remove the
associated test in update_blocked_averages().
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: sargun@sargun.me
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Vishnu Rangayyan <vishnu.rangayyan@apple.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 78041c0c9e935d9ce4086feeff6c569ed88ddfd4 upstream.
The tracing seftests checks various aspects of the tracing infrastructure,
and one is filtering. If trace_printk() is active during a self test, it can
cause the filtering to fail, which will disable that part of the trace.
To keep the selftests from failing because of trace_printk() calls,
trace_printk() checks the variable tracing_selftest_running, and if set, it
does not write to the tracing buffer.
As some tracers were registered earlier in boot, the selftest they triggered
would fail because not all the infrastructure was set up for the full
selftest. Thus, some of the tests were post poned to when their
infrastructure was ready (namely file system code). The postpone code did
not set the tracing_seftest_running variable, and could fail if a
trace_printk() was added and executed during their run.
Cc: stable@vger.kernel.org
Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2ad3e17ebf94b7b7f3f64c050ff168f9915345eb upstream.
Commit 219ca39427bf ("audit: use union for audit_field values since
they are mutually exclusive") combined a number of separate fields in
the audit_field struct into a single union. Generally this worked
just fine because they are generally mutually exclusive.
Unfortunately in audit_data_to_entry() the overlap can be a problem
when a specific error case is triggered that causes the error path
code to attempt to cleanup an audit_field struct and the cleanup
involves attempting to free a stored LSM string (the lsm_str field).
Currently the code always has a non-NULL value in the
audit_field.lsm_str field as the top of the for-loop transfers a
value into audit_field.val (both .lsm_str and .val are part of the
same union); if audit_data_to_entry() fails and the audit_field
struct is specified to contain a LSM string, but the
audit_field.lsm_str has not yet been properly set, the error handling
code will attempt to free the bogus audit_field.lsm_str value that
was set with audit_field.val at the top of the for-loop.
This patch corrects this by ensuring that the audit_field.val is only
set when needed (it is cleared when the audit_field struct is
allocated with kcalloc()). It also corrects a few other issues to
ensure that in case of error the proper error code is returned.
Cc: stable@vger.kernel.org
Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e20d3a055a457a10a4c748ce5b7c2ed3173a1324 upstream.
This if guards whether user-space wants a copy of the offload-jited
bytecode and whether this bytecode exists. By erroneously doing a bitwise
AND instead of a logical AND on user- and kernel-space buffer-size can lead
to no data being copied to user-space especially when user-space size is a
power of two and bigger then the kernel-space buffer.
Fixes: fcfb126defda ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info")
Signed-off-by: Johannes Krude <johannes@krude.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cba6437a1854fde5934098ec3bd0ee83af3129f5 upstream.
Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code,
which catches cases where the affinity setting is not done on the CPU which
is the current target of the interrupt, triggers during CPU hotplug stress
testing.
It turns out that the warning which was added with the commit addressing
the MSI affinity race unearthed yet another long standing bug.
If user space writes a bogus affinity mask, i.e. it contains no online CPUs,
then it calls irq_select_affinity_usr(). This was introduced for ALPHA in
eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)")
and subsequently made available for all architectures in
18404756765c ("genirq: Expose default irq affinity mask (take 3)")
which introduced the circumvention of the affinity setting restrictions for
interrupt which cannot be moved in process context.
The whole exercise is bogus in various aspects:
1) If the interrupt is already started up then there is absolutely
no point to honour a bogus interrupt affinity setting from user
space. The interrupt is already assigned to an online CPU and it
does not make any sense to reassign it to some other randomly
chosen online CPU.
2) If the interupt is not yet started up then there is no point
either. A subsequent startup of the interrupt will invoke
irq_setup_affinity() anyway which will chose a valid target CPU.
So the only correct solution is to just return -EINVAL in case user space
wrote an affinity mask which does not contain any online CPUs, except for
ALPHA which has it's own magic sauce for this.
Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6722b23e7a2ace078344064a9735fb73e554e9ef ]
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
6+1 records in
6+1 records out
206 bytes copied, 0.00027916 s, 738 kB/s
Notice the printing of "# Available triggers:..." after the line.
With the patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
2+1 records in
2+1 records out
88 bytes copied, 0.000526867 s, 167 kB/s
It only prints the end of the file, and does not restart.
Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.comhttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e4075e8bdffd93a9b6d6e1d52fabedceeca5a91b ]
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
no pid
2+1 records in
2+1 records out
10 bytes copied, 0.000213285 s, 46.9 kB/s
Notice the "id" followed by "no pid".
With the patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
0+1 records in
0+1 records out
3 bytes copied, 0.000202112 s, 14.8 kB/s
Notice that it only prints "id" and not the "no pid" afterward.
Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.comhttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 90435a7891a2259b0f74c5a1bc5600d0d64cba8f ]
If seq_file .next fuction does not change position index,
read after some lseek can generate an unexpected output.
See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283
v1 -> v2: removed missed increment in end of function
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 708e0ada1916be765b7faa58854062f2bc620bbf ]
In setup_load_info(), info->name (which contains the name of the module,
mostly used for early logging purposes before the module gets set up)
gets unconditionally assigned if .modinfo is missing despite the fact
that there is an if (!info->name) check near the end of the function.
Avoid assigning a placeholder string to info->name if .modinfo doesn't
exist, so that we can fall back to info->mod->name later on.
Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 11e31f608b499f044f24b20be73f1dcab3e43f8a ]
Robert reported that during boot the watchdog timestamp is set to 0 for one
second which is the indicator for a watchdog reset.
The reason for this is that the timestamp is in seconds and the time is
taken from sched clock and divided by ~1e9. sched clock starts at 0 which
means that for the first second during boot the watchdog timestamp is 0,
i.e. reset.
Use ULONG_MAX as the reset indicator value so the watchdog works correctly
right from the start. ULONG_MAX would only conflict with a real timestamp
if the system reaches an uptime of 136 years on 32bit and almost eternity
on 64bit.
Reported-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dfb6cd1e654315168e36d947471bd2a0ccd834ae ]
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).
But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.
Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com
Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit afccc00f75bbbee4e4ae833a96c2d29a7259c693 ]
tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.
Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com
Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 45178ac0cea853fe0e405bf11e101bdebea57b15 ]
Paul reported a very sporadic, rcutorture induced, workqueue failure.
When the planets align, the workqueue rescuer's self-migrate fails and
then triggers a WARN for running a work on the wrong CPU.
Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
could be ignored! When stopper->enabled is false, stop_machine will
insta complete the work, without actually doing the work. Worse, it
will not WARN about this (we really should fix this).
It turns out there is a small window where a freshly online'ed CPU is
marked 'online' but doesn't yet have the stopper task running:
BP AP
bringup_cpu()
__cpu_up(cpu, idle) --> start_secondary()
...
cpu_startup_entry()
bringup_wait_for_ap()
wait_for_ap_thread() <-- cpuhp_online_idle()
while (1)
do_idle()
... available to run kthreads ...
stop_machine_unpark()
stopper->enable = true;
Close this by moving the stop_machine_unpark() into
cpuhp_online_idle(), such that the stopper thread is ready before we
start the idle loop and schedule.
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
The 4.19 backport dc34710a7aba ("padata: Remove broken queue flushing")
removed padata_alloc_pd()'s assignment to pd->pinst, resulting in:
Unable to handle kernel NULL pointer dereference ...
...
pc : padata_reorder+0x144/0x2e0
...
Call trace:
padata_reorder+0x144/0x2e0
padata_do_serial+0xc8/0x128
pcrypt_aead_enc+0x60/0x70 [pcrypt]
padata_parallel_worker+0xd8/0x138
process_one_work+0x1bc/0x4b8
worker_thread+0x164/0x580
kthread+0x134/0x138
ret_from_fork+0x10/0x18
This happened because the backport was based on an enhancement that
moved this assignment but isn't in 4.19:
bfde23ce200e ("padata: unbind parallel jobs from specific CPUs")
Simply restore the assignment to fix the crash.
Fixes: dc34710a7aba ("padata: Remove broken queue flushing")
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 003461559ef7a9bd0239bae35a22ad8924d6e9ad upstream.
Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.
Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.
Fixes: c4b75479741c ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit febac332a819f0e764aa4da62757ba21d18c182b upstream.
Kernel crashes inside QEMU/KVM are observed:
kernel BUG at kernel/time/timer.c:1154!
BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
At the same time another cpu got:
general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
__hlist_del at include/linux/list.h:681
(inlined by) detach_timer at kernel/time/timer.c:818
(inlined by) expire_timers at kernel/time/timer.c:1355
(inlined by) __run_timers at kernel/time/timer.c:1686
(inlined by) run_timer_softirq at kernel/time/timer.c:1699
Unfortunately kernel logs are badly scrambled, stacktraces are lost.
Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().
The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():
expire_timers()
detach_timer(timer, true);
timer->entry.pprev = NULL;
raw_spin_unlock_irq(&base->lock);
call_timer_fn
clocksource_watchdog()
clocksource_watchdog_kthread() or
clocksource_unbind()
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_stop_watchdog();
del_timer(&watchdog_timer);
watchdog_running = 0;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_start_watchdog();
add_timer_on(&watchdog_timer, ...);
watchdog_running = 1;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock(&watchdog_lock);
add_timer_on(&watchdog_timer, ...);
BUG_ON(timer_pending(timer) || !timer->function);
timer_pending() -> true
BUG()
I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.
Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6f1a4891a5928a5969c87fa5a584844c983ec823 upstream.
Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.
- Write address low 32bits
- Write address high 32bits (If supported by device)
- Write data
When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.
On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.
Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:
If MSI is disabled the PCI device is routing an interrupt to the legacy
INTx mechanism. The INTx delivery can be disabled, but the disablement is
not working on all devices.
Some devices lose interrupts when both MSI and INTx delivery are disabled.
Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.
Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.
That makes it possible to solve it with a two step update:
1) Target the MSI msg to the new vector on the current target CPU
2) Target the MSI msg to the new vector on the new target CPU
In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.
After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.
This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.
This can cause spurious interrupts on both the local and the new target
CPU.
1) If the new vector is not in use on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then interrupt entry code will
ignore that spurious interrupt. The vector is marked so that the
'No irq handler for vector' warning is supressed once.
2) If the new vector is in use already on the local CPU then the IRR check
might see an pending interrupt from the device which is using this
vector. The IPI to the new target CPU will then invoke the handler of
the device, which got the affinity change, even if that device did not
issue an interrupt
3) If the new vector is in use already on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then the handler of the device which
uses that vector on the local CPU will be invoked.
expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.
Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 54a16ff6f2e50775145b210bcd94d62c3c2af117 ]
As function_graph tracer can run when RCU is not "watching", it can not be
protected by synchronize_rcu() it requires running a task on each CPU before
it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.
Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72
Cc: stable@vger.kernel.org
Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 16052dd5bdfa16dbe18d8c1d4cde2ddab9d23177 ]
Because the function graph tracer can execute in sections where RCU is not
"watching", the rcu_dereference_sched() for the has needs to be open coded.
This is fine because the RCU "flavor" of the ftrace hash is protected by
its own RCU handling (it does its own little synchronization on every CPU
and does not rely on RCU sched).
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 07928d9bfc81640bab36f5190e8725894d93b659 ]
The function padata_flush_queues is fundamentally broken because
it cannot force padata users to complete the request that is
underway. IOW padata has to passively wait for the completion
of any outstanding work.
As it stands flushing is used in two places. Its use in padata_stop
is simply unnecessary because nothing depends on the queues to
be flushed afterwards.
The other use in padata_replace is more substantial as we depend
on it to free the old pd structure. This patch instead uses the
pd->refcnt to dynamically free the pd structure once all requests
are complete.
Fixes: 2b73b07ab8a4 ("padata: Flush the padata queues actively")
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6b6d188aae79a630957aefd88ff5c42af6553ee3 upstream.
The alarmtimer_rtc_add_device() function creates a wakeup source and then
tries to grab a module reference. If that fails the function returns early
with an error code, but fails to remove the wakeup source.
Cleanup this exit path so there is no dangling wakeup source, which is
named 'alarmtime' left allocated which will conflict with another RTC
device that may be registered later.
Fixes: 51218298a25e ("alarmtimer: Ensure RTC module is not unloaded")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6935c3983b246d5fbfebd3b891c825e65c118f2d upstream.
The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
to read ->gp_tasks while other cpus might overwrite this field.
We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
tricks and KCSAN splats like the following :
BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore
write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
__rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
rcu_read_unlock include/linux/rcupdate.h:645 [inline]
__ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
ip_queue_xmit+0x45/0x60 include/net/ip.h:236
__tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
__tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
sock_recvmsg_nosec net/socket.c:871 [inline]
sock_recvmsg net/socket.c:889 [inline]
sock_recvmsg+0x92/0xb0 net/socket.c:885
sock_read_iter+0x15f/0x1e0 net/socket.c:967
call_read_iter include/linux/fs.h:1864 [inline]
new_sync_read+0x389/0x4f0 fs/read_write.c:414
read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1:
rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
[ paulmck: Added another READ_ONCE() for RCU CPU stall warnings. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 64ae572bc7d0060429e40e1c8d803ce5eb31a0d6 upstream.
Reading the sched_cmdline_ref and sched_tgid_ref initial state within
tracing_start_sched_switch without holding the sched_register_mutex is
racy against concurrent updates, which can lead to tracepoint probes
being registered more than once (and thus trigger warnings within
tracepoint.c).
[ May be the fix for this bug ]
Link: https://lore.kernel.org/r/000000000000ab6f84056c786b93@google.com
Link: http://lkml.kernel.org/r/20190817141208.15226-1-mathieu.desnoyers@efficios.com
Cc: stable@vger.kernel.org
CC: Steven Rostedt (VMware) <rostedt@goodmis.org>
CC: Joel Fernandes (Google) <joel@joelfernandes.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul E. McKenney <paulmck@linux.ibm.com>
Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com
Fixes: d914ba37d7145 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit def97da136515cb289a14729292c193e0a93bc64 ]
Commit f92b070f2dc8 ("printk: Do not miss new messages when replaying
the log") introduced a new variable @exclusive_console_stop_seq to
store when an exclusive console should stop printing. It should be
set to the @console_seq value at registration. However, @console_seq
is previously set to @syslog_seq so that the exclusive console knows
where to begin. This results in the exclusive console immediately
reactivating all the other consoles and thus repeating the messages
for those consoles.
Set @console_seq after @exclusive_console_stop_seq has stored the
current @console_seq value.
Fixes: f92b070f2dc8 ("printk: Do not miss new messages when replaying the log")
Link: http://lkml.kernel.org/r/20191219115322.31160-1-john.ogness@linutronix.de
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f6d061d617124abbd55396a3bc37b9bf7d33233c ]
In module_add_modinfo_attrs() if sysfs_create_file() fails
on the first iteration of the loop (so i = 0), we forget to
free the modinfo_attrs.
Fixes: bc6f2a757d52 ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 3bc0bb36fa30e95ca829e9cf480e1ef7f7638333 upstream.
The test_cgcore_no_internal_process_constraint_on_threads selftest when
running with subsystem controlling noise triggers two warnings:
> [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
> [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160
Both stem from a call to cgroup_type_write. The first warning was also
triggered by syzkaller.
When we're switching cgroup to threaded mode shortly after a subsystem
was disabled on it, we can see the respective subsystem css dying there.
The warning in cgroup_apply_control_enable is harmless in this case
since we're not adding new subsys anyway.
The warning in cgroup_apply_control_disable indicates an attempt to kill
css of recently disabled subsystem repeatedly.
The commit prevents these situations by making cgroup_type_write wait
for all dying csses to go away before re-applying subtree controls.
When at it, the locations of WARN_ON_ONCE calls are moved so that
warning is triggered only when we are about to misuse the dying css.
Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f6783319737f28e4436a69611853a5a098cbe974 upstream.
Sargun reported a crash:
"I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix
infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
and put it on top of 4.19.13. In addition to this, I uninlined
list_add_leaf_cfs_rq for debugging.
This revealed a new bug that we didn't get to because we kept getting
crashes from the previous issue. When we are running with cgroups that
are rapidly changing, with CFS bandwidth control, and in addition
using the cpusets cgroup, we see this crash. Specifically, it seems to
occur with cgroups that are throttled and we change the allowed
cpuset."
The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
it will walk down to root the 1st time a cfs_rq is used and we will finish
to add either a cfs_rq without parent or a cfs_rq with a parent that is
already on the list. But this is not always true in presence of throttling.
Because a cfs_rq can be throttled even if it has never been used but other CPUs
of the cgroup have already used all the bandwdith, we are not sure to go down to
the root and add all cfs_rq in the list.
Ensure that all cfs_rq will be added in the list even if they are throttled.
[ mingo: Fix !CGROUPS build. ]
Reported-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: tj@kernel.org
Fixes: 9c2791f936ef ("Fix hierarchical order in rq->leaf_cfs_rq_list")
Link: https://lkml.kernel.org/r/1548825767-10799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Janne Huttunen <janne.huttunen@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5d299eabea5a251fbf66e8277704b874bbba92dc upstream.
The magic in list_add_leaf_cfs_rq() requires that at the end of
enqueue_task_fair():
rq->tmp_alone_branch == &rq->lead_cfs_rq_list
If this is violated, list integrity is compromised for list entries
and the tmp_alone_branch pointer might dangle.
Also, reflow list_add_leaf_cfs_rq() while there. This looses one
indentation level and generates a form that's convenient for the next
patch.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Janne Huttunen <janne.huttunen@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit feee6b2989165631b17ac6d4ccdbf6759254e85a upstream.
-- snip --
- Missing arm64 hot(un)plug support
- Missing some vmem_altmap_offset() cleanups
- Missing sub-section hotadd support
- Missing unification of mm/hmm.c and kernel/memremap.c
-- snip --
We currently try to shrink a single zone when removing memory. We use
the zone of the first page of the memory we are removing. If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):
BUG: unable to handle page fault for address: 000000000000353d
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:clear_zone_contiguous+0x5/0x10
Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__remove_pages+0x4b/0x640
arch_remove_memory+0x63/0x8d
try_remove_memory+0xdb/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x70/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x227/0x3a0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x221/0x550
worker_thread+0x50/0x3b0
kthread+0x105/0x140
ret_from_fork+0x3a/0x50
Modules linked in:
CR2: 000000000000353d
Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now
properly shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones
Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().
Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 77e080e7680e1e615587352f70c87b9e98126d03 upstream.
-- snip --
- Missing mm/hmm.c and kernel/memremap.c unification.
-- hmm code does not need fixes (no altmap)
- Missing 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
-- snip --
Patch series "mm/memory_hotplug: Shrink zones before removing memory",
v6.
This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect
the ZONE_DEVICE as contiguous).
We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
of code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of
DIMMs at zone boundaries.
--------------------------------------------------------------------------
Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.
Example:
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 98304
present 65536
managed 65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 32768
present 32768
managed 32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
This patch (of 10):
With an altmap, the memmap falling into the reserved altmap space are not
initialized and, therefore, contain a garbage NID and a garbage zone.
Make sure to read the NID/zone from a memmap that was initialized.
This fixes a kernel crash that is observed when destroying a namespace:
kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
pc: c0000000004b9728: memunmap_pages+0x238/0x340
lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
pid = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
devm_action_release+0x30/0x50
release_nodes+0x268/0x2d0
device_release_driver_internal+0x174/0x240
unbind_store+0x13c/0x190
drv_attr_store+0x44/0x60
sysfs_kf_write+0x70/0xa0
kernfs_fop_write+0x1ac/0x290
__vfs_write+0x3c/0x70
vfs_write+0xe4/0x200
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
think we will never have driver reserved memory with
MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
[david@redhat.com: minimze code changes, rephrase description]
Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Damian Tometzki <damian.tometzki@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Rich Felker <dalias@libc.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c2a5af6fed20cf74401c9d64319c76c5ff81309 upstream.
-- snip --
Missing unification of mm/hmm.c and kernel/memremap.c
-- snip --
Patch series "Do not touch pages in hot-remove path", v2.
This patchset aims for two things:
1) A better definition about offline and hot-remove stage
2) Solving bugs where we can access non-initialized pages
during hot-remove operations [2] [3].
This is achieved by moving all page/zone handling to the offline
stage, so we do not need to access pages when hot-removing memory.
[1] https://patchwork.kernel.org/cover/10691415/
[2] https://patchwork.kernel.org/patch/10547445/
[3] https://www.spinics.net/lists/linux-mm/msg161316.html
This patch (of 5):
This is a preparation for the following-up patches. The idea of passing
the nid is that it will allow us to get rid of the zone parameter
afterwards.
Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8bcebc77e85f3d7536f96845a0fe94b1dddb6af0 upstream.
While working on a tool to convert SQL syntex into the histogram language of
the kernel, I discovered the following bug:
# echo 'first u64 start_time u64 end_time pid_t pid u64 delta' >> synthetic_events
# echo 'hist:keys=pid:start=common_timestamp' > events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:delta=common_timestamp-$start,start2=$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger
Would not display any histograms in the sched_switch histogram side.
But if I were to swap the location of
"delta=common_timestamp-$start" with "start2=$start"
Such that the last line had:
# echo 'hist:keys=next_pid:start2=$start,delta=common_timestamp-$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger
The histogram works as expected.
What I found out is that the expressions clear out the value once it is
resolved. As the variables are resolved in the order listed, when
processing:
delta=common_timestamp-$start
The $start is cleared. When it gets to "start2=$start", it errors out with
"unresolved symbol" (which is silent as this happens at the location of the
trace), and the histogram is dropped.
When processing the histogram for variable references, instead of adding a
new reference for a variable used twice, use the same reference. That way,
not only is it more efficient, but the order will no longer matter in
processing of the variables.
From Tom Zanussi:
"Just to clarify some more about what the problem was is that without
your patch, we would have two separate references to the same variable,
and during resolve_var_refs(), they'd both want to be resolved
separately, so in this case, since the first reference to start wasn't
part of an expression, it wouldn't get the read-once flag set, so would
be read normally, and then the second reference would do the read-once
read and also be read but using read-once. So everything worked and
you didn't see a problem:
from: start2=$start,delta=common_timestamp-$start
In the second case, when you switched them around, the first reference
would be resolved by doing the read-once, and following that the second
reference would try to resolve and see that the variable had already
been read, so failed as unset, which caused it to short-circuit out and
not do the trigger action to generate the synthetic event:
to: delta=common_timestamp-$start,start2=$start
With your patch, we only have the single resolution which happens
correctly the one time it's resolved, so this can't happen."
Link: https://lore.kernel.org/r/20200116154216.58ca08eb@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Tom Zanuss <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit de40f033d4e84e843d6a12266e3869015ea9097c upstream.
Have create_var_ref() manage the hist trigger's var_ref list, rather
than having similar code doing it in multiple places. This cleans up
the code and makes sure var_refs are always accounted properly.
Also, document the var_ref-related functions to make what their
purpose clearer.
Link: http://lkml.kernel.org/r/05ddae93ff514e66fc03897d6665231892939913.1545161087.git.tom.zanussi@linux.intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 656fe2ba85e81d00e4447bf77b8da2be3c47acb2 upstream.
Since every var ref for a trigger has an entry in the var_ref[] array,
use that to destroy the var_refs, instead of piecemeal via the field
expressions.
This allows us to avoid having to keep and treat differently separate
lists for the action-related references, which future patches will
remove.
Link: http://lkml.kernel.org/r/fad1a164f0e257c158e70d6eadbf6c586e04b2a2.1545161087.git.tom.zanussi@linux.intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aeed8aa3874dc15b9d82a6fe796fd7cfbb684448 upstream.
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 30350d65ac567 ("tracing: Add variable support to hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d0fbb51dfaa612f960519b798387be436e8f83c5 ]
We need to drop the bpf_devs_lock on error before returning.
Fixes: 9fd7c5559165 ("bpf: offload: aggregate offloads per-device")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Link: https://lore.kernel.org/bpf/20191104091536.GB31509@mwanda
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 711419e504ebd68c8f03656616829c8ad7829389 ]
Recently device pass-through stops working for Linux VM running on Hyper-V.
git-bisect shows the regression is caused by the recent commit
467a3bb97432 ("PCI: hv: Allocate a named fwnode ..."), but the root cause
is that the commit d59f6617eef0 forgets to set the domain->fwnode for
IRQCHIP_FWNODE_NAMED*, and as a result:
1. The domain->fwnode remains to be NULL.
2. irq_find_matching_fwspec() returns NULL since "h->fwnode == fwnode" is
false, and pci_set_bus_msi_domain() sets the Hyper-V PCI root bus's
msi_domain to NULL.
3. When the device is added onto the root bus, the device's dev->msi_domain
is set to NULL in pci_set_msi_domain().
4. When a device driver tries to enable MSI-X, pci_msi_setup_msi_irqs()
calls arch_setup_msi_irqs(), which uses the native MSI chip (i.e.
arch/x86/kernel/apic/msi.c: pci_msi_controller) to set up the irqs, but
actually pci_msi_setup_msi_irqs() is supposed to call
msi_domain_alloc_irqs() with the hbus->irq_domain, which is created in
hv_pcie_init_irq_domain() and is associated with the Hyper-V chip
hv_msi_irq_chip. Consequently, the irq line is not properly set up, and
the device driver can not receive any interrupt.
Fixes: d59f6617eef0 ("genirq: Allow fwnode to carry name information only")
Fixes: 467a3bb97432 ("PCI: hv: Allocate a named fwnode instead of an address-based one")
Reported-by: Lili Deng <v-lide@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/PU1P153MB01694D9AF625AC335C600C5FBFBE0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM
Signed-off-by: Sasha Levin <sashal@kernel.org>