36557 Commits

Author SHA1 Message Date
Hangbin Liu
e624d4ed4a xdp: Extend xdp_redirect_map with broadcast support
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.

With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.

When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.

Function bpf_clear_redirect_map() was removed in
commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.

With test topology:
  +-------------------+             +-------------------+
  | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
  +-------------------+             |                   |
                                    |   Host B          |
  +-------------------+             |                   |
  | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
  +-------------------+             |                   |
                                    |          +------+ |
                                    | veth0 -- | Peer | |
                                    | veth1 -- |      | |
                                    | veth2 -- |  NS  | |
                                    |          +------+ |
                                    +-------------------+

On Host A:
 # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
All the veth peers in the NS have a XDP_DROP program loaded. The
forward_map max_entries in xdp_redirect_map_multi is modify to 4.

Testing the performance impact on the regular xdp_redirect path with and
without patch (to check impact of additional check for broadcast mode):

5.12 rc4         | redirect_map        i40e->i40e      |    2.0M |  9.7M
5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.8M
5.12 rc4 + patch | redirect_map        i40e->i40e      |    2.0M |  9.6M
5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.7M

Testing the performance when cloning packets with the redirect_map_multi
test, using a redirect map size of 4, filled with 1-3 devices:

5.12 rc4 + patch | redirect_map multi  i40e->veth (x1) |    1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi  i40e->veth (x2) |    1.1M |  4.3M
5.12 rc4 + patch | redirect_map multi  i40e->veth (x3) |    0.8M |  2.6M

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
2021-05-26 09:46:16 +02:00
Jesper Dangaard Brouer
cb261b594b bpf: Run devmap xdp_prog on flush instead of bulk enqueue
This changes the devmap XDP program support to run the program when the
bulk queue is flushed instead of before the frame is enqueued. This has
a couple of benefits:

- It "sorts" the packets by destination devmap entry, and then runs the
  same BPF program on all the packets in sequence. This ensures that we
  keep the XDP program and destination device properties hot in I-cache.

- It makes the multicast implementation simpler because it can just
  enqueue packets using bq_enqueue() without having to deal with the
  devmap program at all.

The drawback is that if the devmap program drops the packet, the enqueue
step is redundant. However, arguably this is mostly visible in a
micro-benchmark, and with more mixed traffic the I-cache benefit should
win out. The performance impact of just this patch is as follows:

Using 2 10Gb i40e NIC, redirecting one to another, or into a veth interface,
which do XDP_DROP on veth peer. With xdp_redirect_map in sample/bpf, send
pkts via pktgen cmd:
./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

There are about +/- 0.1M deviation for native testing, the performance
improved for the base-case, but some drop back with xdp devmap prog attached.

Version          | Test                           | Generic | Native | Native + 2nd xdp_prog
5.12 rc4         | xdp_redirect_map   i40e->i40e  |    1.9M |   9.6M |  8.4M
5.12 rc4         | xdp_redirect_map   i40e->veth  |    1.7M |  11.7M |  9.8M
5.12 rc4 + patch | xdp_redirect_map   i40e->i40e  |    1.9M |   9.8M |  8.0M
5.12 rc4 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  12.0M |  9.4M

When bq_xmit_all() is called from bq_enqueue(), another packet will
always be enqueued immediately after, so clearing dev_rx, xdp_prog and
flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
and only check them once in bq_enqueue() since they are all modified
together.

This change also has the side effect of extending the lifetime of the
RCU-protected xdp_prog that lives inside the devmap entries: Instead of
just living for the duration of the XDP program invocation, the
reference now lives all the way until the bq is flushed. This is safe
because the bq flush happens at the end of the NAPI poll loop, so
everything happens between a local_bh_disable()/local_bh_enable() pair.
However, this is by no means obvious from looking at the call sites; in
particular, some drivers have an additional rcu_read_lock() around only
the XDP program invocation, which only confuses matters further.
Cleaning this up will be done in a separate patch series.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-2-liuhangbin@gmail.com
2021-05-26 09:46:16 +02:00
Frederic Weisbecker
a8ea6fc9b0 sched: Stop PF_NO_SETAFFINITY from being inherited by various init system threads
Commit:

  00b89fe0197f ("sched: Make the idle task quack like a per-CPU kthread")

... added PF_KTHREAD | PF_NO_SETAFFINITY to the idle kernel threads.

Unfortunately these properties are inherited to the init/0 children
through kernel_thread() calls: init/1 and kthreadd. There are several
side effects to that:

1) kthreadd affinity can not be reset anymore from userspace. Also
   PF_NO_SETAFFINITY propagates to all kthreadd children, including
   the unbound kthreads Therefore it's not possible anymore to overwrite
   the affinity of any of them. Here is an example of warning reported
   by rcutorture:

		WARNING: CPU: 0 PID: 116 at kernel/rcu/tree_nocb.h:1306 rcu_bind_current_to_nocb+0x31/0x40
		Call Trace:
		 rcu_torture_fwd_prog+0x62/0x730
		 kthread+0x122/0x140
		 ret_from_fork+0x22/0x30

2) init/1 does an exec() in the end which clears both
   PF_KTHREAD and PF_NO_SETAFFINITY so we are fine once kernel_init()
   escapes to userspace. But until then, no initcall or init code can
   successfully call sched_setaffinity() to init/1.

   Also PF_KTHREAD looks legit on init/1 before it calls exec() but
   we better be careful with unknown introduced side effects.

One way to solve the PF_NO_SETAFFINITY issue is to not inherit this flag
on copy_process() at all. The cases where it matters are:

* fork_idle(): explicitly set the flag already.
* fork() syscalls: userspace tasks that shouldn't be concerned by that.
* create_io_thread(): the callers explicitly attribute the flag to the
                      newly created tasks.
* kernel_thread():
	- Fix the issues on init/1 and kthreadd
	- Fix the issues on kthreadd children.
	- Usermode helper created by an unbound workqueue. This shouldn't
	  matter. In the worst case it gives more control to userspace
	  on setting affinity to these short living tasks although this can
	  be tuned with inherited unbound workqueues affinity already.

Fixes: 00b89fe0197f ("sched: Make the idle task quack like a per-CPU kthread")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20210525235849.441842-1-frederic@kernel.org
2021-05-26 08:58:53 +02:00
Daniel Borkmann
a703619127 bpf: No need to simulate speculative domain for immediates
In 801c6058d14a ("bpf: Fix leakage of uninitialized bpf stack under
speculation") we replaced masking logic with direct loads of immediates
if the register is a known constant. Given in this case we do not apply
any masking, there is also no reason for the operation to be truncated
under the speculative domain.

Therefore, there is also zero reason for the verifier to branch-off and
simulate this case, it only needs to do it for unknown but bounded scalars.
As a side-effect, this also enables few test cases that were previously
rejected due to simulation under zero truncation.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-25 22:08:53 +02:00
Daniel Borkmann
bb01a1bba5 bpf: Fix mask direction swap upon off reg sign change
Masking direction as indicated via mask_to_left is considered to be
calculated once and then used to derive pointer limits. Thus, this
needs to be placed into bpf_sanitize_info instead so we can pass it
to sanitize_ptr_alu() call after the pointer move. Piotr noticed a
corner case where the off reg causes masking direction change which
then results in an incorrect final aux->alu_limit.

Fixes: 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask")
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-25 22:08:53 +02:00
Daniel Borkmann
3d0220f686 bpf: Wrap aux data inside bpf_sanitize_info container
Add a container structure struct bpf_sanitize_info which holds
the current aux info, and update call-sites to sanitize_ptr_alu()
to pass it in. This is needed for passing in additional state
later on.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Piotr Krysiuk <piotras@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-25 22:08:53 +02:00
Daniel Borkmann
5c9d706f61 bpf: Fix BPF_LSM kconfig symbol dependency
Similarly as 6bdacdb48e94 ("bpf: Fix BPF_JIT kconfig symbol dependency") we
need to detangle the hard BPF_LSM dependency on NET. This was previously
implicit by its dependency on BPF_JIT which itself was dependent on NET (but
without any actual/real hard dependency code-wise). Given the latter was
lifted, so should be the former as BPF_LSMs could well exist on net-less
systems. This therefore also fixes a randconfig build error recently reported
by Randy:

  ld: kernel/bpf/bpf_lsm.o: in function `bpf_lsm_func_proto':
  bpf_lsm.c:(.text+0x1a0): undefined reference to `bpf_sk_storage_get_proto'
  ld: bpf_lsm.c:(.text+0x1b8): undefined reference to `bpf_sk_storage_delete_proto'
  [...]

Fixes: b24abcff918a ("bpf, kconfig: Add consolidated menu entry for bpf with core options")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
2021-05-25 21:16:23 +02:00
Pavel Begunkov
a82adc7650 futex: Deduplicate cond_resched() invocation in futex_wake_op()
After pagefaulting in futex_wake_op() both branches do cond_resched()
before retry. Deduplicate it as compilers cannot figure it out themself.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lore.kernel.org/r/9b2588c1fd33c91fb01c4e348a3b647ab2c8baab.1621258128.git.asml.silence@gmail.com
2021-05-25 17:30:15 +02:00
Yuan ZhaoXiong
130708331b cpu/hotplug: Simplify access to percpu cpuhp_state
It is unnecessary to invoke per_cpu_ptr() everytime to access cpuhp_state.
Use the available pointer instead.

Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/1621776690-13264-1-git-send-email-yuanzhaoxiong@baidu.com
2021-05-25 17:24:52 +02:00
Zhen Lei
8fb33b6055 bpf: Fix spelling mistakes
Fix some spelling mistakes in comments:
aother ==> another
Netiher ==> Neither
desribe ==> describe
intializing ==> initializing
funciton ==> function
wont ==> won't and move the word 'the' at the end to the next line
accross ==> across
pathes ==> paths
triggerred ==> triggered
excute ==> execute
ether ==> either
conervative ==> conservative
convetion ==> convention
markes ==> marks
interpeter ==> interpreter

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210525025659.8898-2-thunder.leizhen@huawei.com
2021-05-24 21:13:05 -07:00
Denis Salopek
3e87f192b4 bpf: Add lookup_and_delete_elem support to hashtab
Extend the existing bpf_map_lookup_and_delete_elem() functionality to
hashtab map types, in addition to stacks and queues.
Create a new hashtab bpf_map_ops function that does lookup and deletion
of the element under the same bucket lock and add the created map_ops to
bpf.h.

Signed-off-by: Denis Salopek <denis.salopek@sartura.hr>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/4d18480a3e990ffbf14751ddef0325eed3be2966.1620763117.git.denis.salopek@sartura.hr
2021-05-24 13:30:26 -07:00
Linus Torvalds
1434a31278 Merge branch 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - "cgroup_disable=" boot param was being applied too late confusing
   some subsystems. Fix it by moving application to __setup() time.

 - Comment spelling fixes. Included here to lower the chance of trivial
   future merge conflicts.

* 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix spelling mistakes
  cgroup: disable controllers at parse time
2021-05-24 07:46:31 -10:00
Tejun Heo
c2a1197154 Merge branch 'for-5.13-fixes' into for-5.14 2021-05-24 13:43:56 -04:00
Linus Torvalds
5df7ae7bed Merge branch 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "One commit to fix spurious workqueue stall warnings across VM
  suspensions"

* 'for-5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  wq: handle VM suspension in stall detection
2021-05-24 07:28:09 -10:00
Zhen Lei
08b2b6fdf6 cgroup: fix spelling mistakes
Fix some spelling mistakes in comments:
hierarhcy ==> hierarchy
automtically ==> automatically
overriden ==> overridden
In absense of .. or ==> In absence of .. and
assocaited ==> associated
taget ==> target
initate ==> initiate
succeded ==> succeeded
curremt ==> current
udpated ==> updated

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-05-24 12:45:26 -04:00
Zhen Lei
6be2408a1e PM: hibernate: fix spelling mistakes
Fix some spelling mistakes in comments:

corresonds ==> corresponds
alocated ==> allocated
unitialized ==> uninitialized
Deompression ==> Decompression

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-05-24 16:17:07 +02:00
zhangyi (F)
3af3d772f7 block_dump: remove block_dump feature
We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-24 06:47:21 -06:00
Will Deacon
da3862e727 Export irq_set_affinity() for cleaning up drivers/perf
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCk1J0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofMAD/46QCJ9CDJZbBSuF5TBiA6mK90WWqXj
 tcGoZntClBUhFWE7+xR9S/si0z/mI1qysRK9QWzLD6UTBOEclwL4uQnWqinJCxoz
 ekNESFVdDo6JA3q4ybwV0D1Ey3JuewlUHuq4wRWGVCFbSknv2oVha9yL6BPUk+Or
 YRaZAOxaVpOO6XHpPPmFPn7ENOV7Eh7KAtKItFEZMKsIZNSTGYWHhTjBp4U4FgTQ
 Os2YF12p8eyTteQQpE445M6fn/2YQKBzUA+lRE5BxJigZPqydT2+oB/NGIuOEsyY
 rab943VBjbTZyT1AphdpExA3xtw3miNwfXG+ck2ZeZ2SD0RNwdNgnnyLQEmSdedk
 HU3kXL/omWlZfMJmmn3U93ydmC1CZlwR0JP/a0KcbgihctQ2/6KLtNjG+iCRha6s
 xlPhZgljKzKGg+cT6XhVtyxkGKHhntd3llLYzuS0+6Ny89kudFd0Fo47CTf31dS5
 NobDxoqMaGkFO0fgG4eD0xj9Al+1lsV/h+jTt7MDa6y9uQQNSKP9iLDoLupCRLwl
 8UmRfNkS0ylr6K2dtjmrNx3PMva44Ttj1bNH+pRoolgZwTkrG63QoomGb+SuzBZ7
 6ZdJ98Ig55GdV5gU7yFy5qWaK49/QqbYWo/QzpeB+7Ahf2K0bjLs3NSIkGzqsSt4
 jzNH3U0aFNcSPw==
 =Vzcx
 -----END PGP SIGNATURE-----

Merge tag 'irq-export-set-affinity' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-next/perf

Export irq_set_affinity() for cleaning up drivers/perf

Pull export of irq_set_affinity() from Thomas Gleixner, so we can convert
all new and exiting Arm PMU drivers to the new interface.

* tag 'irq-export-set-affinity' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Export affinity setter for modules
2021-05-24 10:58:14 +01:00
Linus Torvalds
0898678c74 Two locking fixes:
- Invoke the lockdep tracepoints in the correct place so the ordering
    is correct again.
 
  - Don't leave the mutex WAITER bit stale when the last waiter is dropping
    out early due to a signal as that forces all subsequent lock operations
    needlessly into the slowpath until it's cleaned up again.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCqVmQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUqXEACR3LsrJ+VfkktVlZK8DWygRcjLFvvo
 mV620OEdJcCOwy/Qs3qKkyIoMiba7ASrbIWoZa28+tbZZ/iovXotkRH5rh1MhDoU
 3S6QlpPeg7shyN5iDG0JlvDTVPQs6g4oC+8bAQmJIuUeQ7hPh72O49vIDSEF6mzG
 6j8C0l5tYvmojgJKY6PJYWSZ6MNVv/gCUWWwRdmShSYmdNR3W/GaN6jTFI6qVitS
 a3NE5ksVr1LC5Ro5QraVdmif/XlUxZ8UaEN6VyaXjBuOBO2UxUevm61khv0X5fpS
 IHpcDjZukgSwccXSzd9bttWJ5EKqLDC+nfFeOdJg2GFXfRZd+uGwVV3IN2U8r7fj
 pP9Wcy5dDJrFF7dVYnDU7y7IP2ZOwDoh98mQkVt90SV4zp2HcZnl3x5iqvxQrND3
 r3c88myDOZBCCroRIMxxlNpYWOozlVYtHi/mmFj3x97YoPQYwpuMunz+/i8b5j6B
 UvtM2VsevyiGZd9pzSZ/dl3Tf19VXrtY60Sc8qG6LdTukOldLBq6J9fOcUI2fHCZ
 kXiS+utT1nIWyvwRgoMcFOTOTgfzdDKRYkPu7pMVcNoRB91KgTmozGVCT4uIN3dF
 kHpm+FyGLgKDdL8AB7VTWSSTFgb2quZBeLGSr4OnVVSTJlQ3xfpKD5vtKBysYhf7
 6My7E5pCZhgr9Q==
 =7vqW
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two locking fixes:

   - Invoke the lockdep tracepoints in the correct place so the ordering
     is correct again

   - Don't leave the mutex WAITER bit stale when the last waiter is
     dropping out early due to a signal as that forces all subsequent
     lock operations needlessly into the slowpath until it's cleaned up
     again"

* tag 'locking-urgent-2021-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signal
  locking/lockdep: Correct calling tracepoints
2021-05-23 06:30:08 -10:00
Christophe Leroy
7ee3e97e00 kprobes: Allow architectures to override optinsn page allocation
Some architectures like powerpc require a non standard
allocation of optinsn page, because module pages are
too far from the kernel for direct branches.

Define weak alloc_optinsn_page() and free_optinsn_page(), that
fall back on alloc_insn_page() and free_insn_page() when not
overridden by the architecture.

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/40a43d6df1fdf41ade36e9a46e60a4df774ca9f6.1620896780.git.christophe.leroy@csgroup.eu
2021-05-23 20:51:35 +10:00
Petr Mladek
0f90b88dbc watchdog: reliable handling of timestamps
Commit 9bf3bc949f8a ("watchdog: cleanup handling of false positives")
tried to handle a virtual host stopped by the host a more
straightforward and cleaner way.

But it introduced a risk of false softlockup reports.  The virtual host
might be stopped at any time, for example between
kvm_check_and_clear_guest_paused() and is_softlockup().  As a result,
is_softlockup() might read the updated jiffies and detects a softlockup.

A solution might be to put back kvm_check_and_clear_guest_paused() after
is_softlockup() and detect it.  But it would put back the cycle that
complicates the logic.

In fact, the handling of all the timestamps is not reliable.  The code
does not guarantee when and how many times the timestamps are read.  For
example, "period_ts" might be touched anytime also from NMI and re-read in
is_softlockup().  It works just by chance.

Fix all the problems by making the code even more explicit.

1. Make sure that "now" and "period_ts" timestamps are read only once.
   They might be changed at anytime by NMI or when the virtual guest is
   stopped by the host.  Note that "now" timestamp does this implicitly
   because "jiffies" is marked volatile.

2. "now" time must be read first.  The state of "period_ts" will
   decide whether it will be used or the period will get restarted.

3. kvm_check_and_clear_guest_paused() must be called before reading
   "period_ts".  It touches the variable when the guest was stopped.

As a result, "now" timestamp is used only when the watchdog was not
touched and the guest not stopped in the meantime.  "period_ts" is
restarted in all other situations.

Link: https://lkml.kernel.org/r/YKT55gw+RZfyoFf7@alley
Fixes: 9bf3bc949f8aeefeacea4b ("watchdog: cleanup handling of false positives")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22 15:09:07 -10:00
Linus Torvalds
a0e31f3a38 Merge branch 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo fix from Eric Biederman:
 "During the merge window an issue with si_perf and the siginfo ABI came
  up. The alpha and sparc siginfo structure layout had changed with the
  addition of SIGTRAP TRAP_PERF and the new field si_perf.

  The reason only alpha and sparc were affected is that they are the
  only architectures that use si_trapno.

  Looking deeper it was discovered that si_trapno is used for only a few
  select signals on alpha and sparc, and that none of the other
  _sigfault fields past si_addr are used at all. Which means technically
  no regression on alpha and sparc.

  While the alignment concerns might be dismissed the abuse of si_errno
  by SIGTRAP TRAP_PERF does have the potential to cause regressions in
  existing userspace.

  While we still have time before userspace starts using and depending
  on the new definition siginfo for SIGTRAP TRAP_PERF this set of
  changes cleans up siginfo_t.

   - The si_trapno field is demoted from magic alpha and sparc status
     and made an ordinary union member of the _sigfault member of
     siginfo_t. Without moving it of course.

   - si_perf is replaced with si_perf_data and si_perf_type ending the
     abuse of si_errno.

   - Unnecessary additions to signalfd_siginfo are removed"

* 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
  signal: Deliver all of the siginfo perf data in _perf
  signal: Factor force_sig_perf out of perf_sigtrap
  signal: Implement SIL_FAULT_TRAPNO
  siginfo: Move si_trapno inside the union inside _si_fault
2021-05-21 06:12:52 -10:00
Linus Torvalds
c1f47ebc9b Modules fixes for v5.13-rc3
- When CONFIG_MODULE_UNLOAD=n, module exit sections get sorted into the
   init region of the module in order to satisfy the requirements of
   jump_labels and static_calls. Previously, the exit section check was
   done in module_init_section(), but the solution there is not completely
   arch-indepedent as ARM is a special case and supplies its own
   module_init_section() function. Instead of pushing this logic further to
   the arch-specific code, switch to an arch-independent solution to check
   for module exit sections in the core module loader code in
   layout_sections() instead.
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEVrp26glSWYuDNrCUwEV+OM47wXIFAmCndboQHGpleXVAa2Vy
 bmVsLm9yZwAKCRDARX44zjvBcqzYD/9+W1lTEMXwEpayn71iMNs3ECMBY4ZaKbEd
 bxH1oLDADnIunS00tHBn4LO/t1tK18en/du9NtXfH1rmmH7jp9qDqJPxZbAOg6+i
 g8UxDAj1D+o6X5WfaVx8ygJ5JFTo927yk8rzQz4nqy8D7ZT87x4BvRaZF199jRMk
 MBQWDh9AfOC5DehauMKu4CjeEEWebPjG9QUQlg9ngQMrsGtGdOHv1Ex8zH66Oi4X
 xxOVqmRQu3yLMGfv03znHKvRSVXAponCZVT1VOiHBK9T1CaEgdP3eBE4mlTTAcLh
 X913OV69dQeNzoDFsECZfE4fmypym5CnvloCEg8Kx4zi5GN6TBO3RSU3EyRQChva
 7RgNFZsS43Q9d3Q3ZfL5HX9Db/kd4oex3tA3mvuAh4CkA9400x2H4FeHsfMrOfJB
 avxvgQhUUnfphQ0chIDHVtWfSAIWcLlNkl6VEx6MB5A/m4qJUz6VyZafoC2khE52
 98NzXNdmmRuuI+hrUqVsUsDH3ZybbAf362OqiImiRFjlVfbnZzUnSsVS3j4+Ckj8
 VWBy4fMivpBu5CT2P4CVhJAA2VfgzVJpZ7HrjSP4uSI1xcJxxsXb8hWbQ10NSynq
 tr3Q6gvWQN+/I4RLAv4XlXkvXPfIDN10DavhmY6ZBTuUTuRcTyP6Sd1TYvwUzMJi
 iGMTFFrXcg==
 =uBiQ
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module fix from Jessica Yu:
 "When CONFIG_MODULE_UNLOAD=n, module exit sections get sorted into the
  init region of the module in order to satisfy the requirements of
  jump_labels and static_calls.

  Previously, the exit section check was done in module_init_section(),
  but the solution there is not completely arch-indepedent as ARM is a
  special case and supplies its own module_init_section() function.

  Instead of pushing this logic further to the arch-specific code,
  switch to an arch-independent solution to check for module exit
  sections in the core module loader code in layout_sections() instead"

* tag 'modules-for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: check for exit sections in layout_sections() instead of module_init_section()
2021-05-21 06:09:17 -10:00
Linus Torvalds
921dd23597 Merge branch 'urgent.2021.05.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull kcsan fix from Paul McKenney:
 "Fix for a regression introduced in this merge window by commit
  e36299efe7d7 ("kcsan, debugfs: Move debugfs file creation out of early
  init").

  The regression is not easy to trigger, requiring a KCSAN build using
  clang with CONFIG_LTO_CLANG=y. The fix is to simply make the
  kcsan_debugfs_init() function's type initcall-compatible. This has
  been posted to the relevant mailing lists:"

* 'urgent.2021.05.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  kcsan: Fix debugfs initcall return type
2021-05-20 14:43:33 -10:00
Yinjun Zhang
ceb11679d9 bpf, offload: Reorder offload callback 'prepare' in verifier
Commit 4976b718c355 ("bpf: Introduce pseudo_btf_id") switched the
order of resolve_pseudo_ldimm(), in which some pseudo instructions
are rewritten. Thus those rewritten instructions cannot be passed
to driver via 'prepare' offload callback.

Reorder the 'prepare' offload callback to fix it.

Fixes: 4976b718c355 ("bpf: Introduce pseudo_btf_id")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210520085834.15023-1-simon.horman@netronome.com
2021-05-20 23:51:52 +02:00
Florent Revest
0af02eb2a7 bpf: Avoid using ARRAY_SIZE on an uninitialized pointer
The cppcheck static code analysis reported the following error:

    if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bufs->tmp_bufs))) {
                                             ^
ARRAY_SIZE is a macro that expands to sizeofs, so bufs is not actually
dereferenced at runtime, and the code is actually safe. But to keep
things tidy, this patch removes the need for a call to ARRAY_SIZE by
extracting the size of the array into a macro. Cppcheck should no longer
be confused and the code ends up being a bit cleaner.

Fixes: e2d5b2bb769f ("bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20210517092830.1026418-2-revest@chromium.org
2021-05-20 23:48:38 +02:00
Florent Revest
8afcc19fbf bpf: Clarify a bpf_bprintf_prepare macro
The per-cpu buffers contain bprintf data rather than printf arguments.
The macro name and comment were a bit confusing, this rewords them in a
clearer way.

Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20210517092830.1026418-1-revest@chromium.org
2021-05-20 23:48:38 +02:00
Daniel Borkmann
6bdacdb48e bpf: Fix BPF_JIT kconfig symbol dependency
Randy reported a randconfig build error recently on i386:

  ld: arch/x86/net/bpf_jit_comp32.o: in function `do_jit':
  bpf_jit_comp32.c:(.text+0x28c9): undefined reference to `__bpf_call_base'
  ld: arch/x86/net/bpf_jit_comp32.o: in function `bpf_int_jit_compile':
  bpf_jit_comp32.c:(.text+0x3694): undefined reference to `bpf_jit_blind_constants'
  ld: bpf_jit_comp32.c:(.text+0x3719): undefined reference to `bpf_jit_binary_free'
  ld: bpf_jit_comp32.c:(.text+0x3745): undefined reference to `bpf_jit_binary_alloc'
  ld: bpf_jit_comp32.c:(.text+0x37d3): undefined reference to `bpf_jit_prog_release_other'
  [...]

The cause was that b24abcff918a ("bpf, kconfig: Add consolidated menu entry for
bpf with core options") moved BPF_JIT from net/Kconfig into kernel/bpf/Kconfig
and previously BPF_JIT was guarded by a 'if NET'. However, there is no actual
dependency on NET, it's just that menuconfig NET selects BPF. And the latter in
turn causes kernel/bpf/core.o to be built which contains above symbols. Randy's
randconfig didn't have NET set, and BPF wasn't either, but BPF_JIT otoh was.
Detangle this by making BPF_JIT depend on BPF instead. arm64 was the only arch
that pulled in its JIT in net/ via obj-$(CONFIG_NET), all others unconditionally
pull this dir in via obj-y. Do the same since CONFIG_NET guard there is really
useless as we compiled the JIT via obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o anyway.

Fixes: b24abcff918a ("bpf, kconfig: Add consolidated menu entry for bpf with core options")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
2021-05-20 23:48:37 +02:00
Sergey Senozhatsky
940d71c646 wq: handle VM suspension in stall detection
If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then
once this VCPU resumes it will see the new jiffies value, while it
may take a while before IRQ detects PVCLOCK_GUEST_STOPPED on this
VCPU and updates all the watchdogs via pvclock_touch_watchdogs().
There is a small chance of misreported WQ stalls in the meantime,
because new jiffies is time_after() old 'ts + thresh'.

wq_watchdog_timer_fn()
{
	for_each_pool(pool, pi) {
		if (time_after(jiffies, ts + thresh)) {
			pr_emerg("BUG: workqueue lockup - pool");
		}
	}
}

Save jiffies at the beginning of this function and use that value
for stall detection. If VM gets suspended then we continue using
"old" jiffies value and old WQ touch timestamps. If IRQ at some
point restarts the stall detection cycle (pvclock_touch_watchdogs())
then old jiffies will always be before new 'ts + thresh'.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-05-20 12:58:30 -04:00
Shakeel Butt
45e1ba4083 cgroup: disable controllers at parse time
This patch effectively reverts the commit a3e72739b7a7 ("cgroup: fix
too early usage of static_branch_disable()"). The commit 6041186a3258
("init: initialize jump labels before command line option parsing") has
moved the jump_label_init() before parse_args() which has made the
commit a3e72739b7a7 unnecessary. On the other hand there are
consequences of disabling the controllers later as there are subsystems
doing the controller checks for different decisions. One such incident
is reported [1] regarding the memory controller and its impact on memory
reclaim code.

[1] https://lore.kernel.org/linux-mm/921e53f3-4b13-aab8-4a9e-e83ff15371e4@nec.com

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: NOMURA JUNICHI(野村 淳一) <junichi.nomura@nec.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Jun'ichi Nomura <junichi.nomura@nec.com>
2021-05-20 12:27:53 -04:00
Pu Lehui
3a2daa7248 bpf: Make some symbols static
The sparse tool complains as follows:

kernel/bpf/syscall.c:4567:29: warning:
 symbol 'bpf_sys_bpf_proto' was not declared. Should it be static?
kernel/bpf/syscall.c:4592:29: warning:
 symbol 'bpf_sys_close_proto' was not declared. Should it be static?

This symbol is not used outside of syscall.c, so marks it static.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210519064116.240536-1-pulehui@huawei.com
2021-05-19 10:47:43 -07:00
Masahiro Yamada
1699949d33 sched: Fix a stale comment in pick_next_task()
fair_sched_class->next no longer exists since commit:

  a87e749e8fa1 ("sched: Remove struct sched_class::next field").

Now the sched_class order is specified by the linker script.

Rewrite the comment in a more generic way.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210519063709.323162-1-masahiroy@kernel.org
2021-05-19 13:03:21 +02:00
Thomas Gleixner
006ae1970a Merge branch 'irq/affinity' into irq/core
Merge the export of irq_set_affinity() which is a standalone commit so it
can be pulled into other trees.
2021-05-19 11:04:47 +02:00
Thomas Gleixner
4d80d6ca5d genirq: Export affinity setter for modules
Perf modules abuse irq_set_affinity_hint() to set the affinity of system
PMU interrupts just because irq_set_affinity() was not exported.

The fact that irq_set_affinity_hint() actually sets the affinity is a
non-documented side effect and the name is clearly saying it's a hint.

To clean this up, export the real affinity setter.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210518093117.968251441@linutronix.de
2021-05-19 11:01:51 +02:00
Qais Yousef
93b7385870 sched/uclamp: Fix locking around cpu_util_update_eff()
cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the
uclamp_mutex or rcu_read_lock() like other call sites, which is
a mistake.

The uclamp_mutex is required to protect against concurrent reads and
writes that could update the cgroup hierarchy.

The rcu_read_lock() is required to traverse the cgroup data structures
in cpu_util_update_eff().

Surround the caller with the required locks and add some asserts to
better document the dependency in cpu_util_update_eff().

Fixes: 7226017ad37a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com
2021-05-19 10:53:02 +02:00
Qais Yousef
0c18f2ecfc sched/uclamp: Fix wrong implementation of cpu.uclamp.min
cpu.uclamp.min is a protection as described in cgroup-v2 Resource
Distribution Model

	Documentation/admin-guide/cgroup-v2.rst

which means we try our best to preserve the minimum performance point of
tasks in this group. See full description of cpu.uclamp.min in the
cgroup-v2.rst.

But the current implementation makes it a limit, which is not what was
intended.

For example:

	tg->cpu.uclamp.min = 20%

	p0->uclamp[UCLAMP_MIN] = 0
	p1->uclamp[UCLAMP_MIN] = 50%

	Previous Behavior (limit):

		p0->effective_uclamp = 0
		p1->effective_uclamp = 20%

	New Behavior (Protection):

		p0->effective_uclamp = 20%
		p1->effective_uclamp = 50%

Which is inline with how protections should work.

With this change the cgroup and per-task behaviors are the same, as
expected.

Additionally, we remove the confusing relationship between cgroup and
!user_defined flag.

We don't want for example RT tasks that are boosted by default to max to
change their boost value when they attach to a cgroup. If a cgroup wants
to limit the max performance point of tasks attached to it, then
cpu.uclamp.max must be set accordingly.

Or if they want to set different boost value based on cgroup, then
sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max
and set the right cpu.uclamp.min for each group to let the RT tasks
obtain the desired boost value when attached to that group.

As it stands the dependency on !user_defined flag adds an extra layer of
complexity that is not required now cpu.uclamp.min behaves properly as
a protection.

The propagation model of effective cpu.uclamp.min in child cgroups as
implemented by cpu_util_update_eff() is still correct. The parent
protection sets an upper limit of what the child cgroups will
effectively get.

Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com
2021-05-19 10:53:02 +02:00
Alexei Starovoitov
3abea08924 bpf: Add bpf_sys_close() helper.
Add bpf_sys_close() helper to be used by the syscall/loader program to close
intermediate FDs and other cleanup.
Note this helper must never be allowed inside fdget/fdput bracketing.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-11-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov
3d78417b60 bpf: Add bpf_btf_find_by_name_kind() helper.
Add new helper:
long bpf_btf_find_by_name_kind(char *name, int name_sz, u32 kind, int flags)
Description
	Find BTF type with given name and kind in vmlinux BTF or in module's BTFs.
Return
	Returns btf_id and btf_obj_fd in lower and upper 32 bits.

It will be used by loader program to find btf_id to attach the program to
and to find btf_ids of ksyms.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-10-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov
387544bfa2 bpf: Introduce fd_idx
Typical program loading sequence involves creating bpf maps and applying
map FDs into bpf instructions in various places in the bpf program.
This job is done by libbpf that is using compiler generated ELF relocations
to patch certain instruction after maps are created and BTFs are loaded.
The goal of fd_idx is to allow bpf instructions to stay immutable
after compilation. At load time the libbpf would still create maps as usual,
but it wouldn't need to patch instructions. It would store map_fds into
__u32 fd_array[] and would pass that pointer to sys_bpf(BPF_PROG_LOAD).

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-9-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov
c571bd752e bpf: Make btf_load command to be bpfptr_t compatible.
Similar to prog_load make btf_load command to be availble to
bpf_prog_type_syscall program.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-7-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov
af2ac3e13e bpf: Prepare bpf syscall to be used from kernel and user space.
With the help from bpfptr_t prepare relevant bpf syscall commands
to be used from kernel and user space.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-4-alexei.starovoitov@gmail.com
2021-05-19 00:33:40 +02:00
Alexei Starovoitov
79a7f8bdb1 bpf: Introduce bpf_sys_bpf() helper and program type.
Add placeholders for bpf_sys_bpf() helper and new program type.
Make sure to check that expected_attach_type is zero for future extensibility.
Allow tracing helper functions to be used in this program type, since they will
only execute from user context via bpf_prog_test_run.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210514003623.28033-2-alexei.starovoitov@gmail.com
2021-05-19 00:33:39 +02:00
Eric W. Biederman
0683b53197 signal: Deliver all of the siginfo perf data in _perf
Don't abuse si_errno and deliver all of the perf data in _perf member
of siginfo_t.

Note: The data field in the perf data structures in a u64 to allow a
pointer to be encoded without needed to implement a 32bit and 64bit
version of the same structure.  There already exists a 32bit and 64bit
versions siginfo_t, and the 32bit version can not include a 64bit
member as it only has 32bit alignment.  So unsigned long is used in
siginfo_t instead of a u64 as unsigned long can encode a pointer on
all architectures linux supports.

v1: https://lkml.kernel.org/r/m11rarqqx2.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210503203814.25487-10-ebiederm@xmission.com
v3: https://lkml.kernel.org/r/20210505141101.11519-11-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-4-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:54 -05:00
Eric W. Biederman
af5eeab7e8 signal: Factor force_sig_perf out of perf_sigtrap
Separate filling in siginfo for TRAP_PERF from deciding that
siginal needs to be sent.

There are enough little details that need to be correct when
properly filling in siginfo_t that it is easy to make mistakes
if filling in the siginfo_t is in the same function with other
logic.  So factor out force_sig_perf to reduce the cognative
load of on reviewers, maintainers and implementors.

v1: https://lkml.kernel.org/r/m17dkjqqxz.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-10-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-3-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:54 -05:00
Eric W. Biederman
9abcabe311 signal: Implement SIL_FAULT_TRAPNO
Now that si_trapno is part of the union in _si_fault and available on
all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to
return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid.

There is room for future changes to reduce when si_trapno is valid but
this is all that is needed to make si_trapno and the other members of
the the union in _sigfault mutually exclusive.

Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO
and have the same code ignore si_trapno in in all other cases.

v1: https://lkml.kernel.org/r/m1o8dvs7s7.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-6-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-2-ebiederm@xmission.com
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:20:34 -05:00
Eric W. Biederman
add0b32ef9 siginfo: Move si_trapno inside the union inside _si_fault
It turns out that linux uses si_trapno very sparingly, and as such it
can be considered extra information for a very narrow selection of
signals, rather than information that is present with every fault
reported in siginfo.

As such move si_trapno inside the union inside of _si_fault.  This
results in no change in placement, and makes it eaiser
to extend _si_fault in the future as this reduces the number of
special cases.  In particular with si_trapno included in the union it
is no longer a concern that the union must be pointer aligned on most
architectures because the union follows immediately after si_addr
which is a pointer.

This change results in a difference in siginfo field placement on
sparc and alpha for the fields si_addr_lsb, si_lower, si_upper,
si_pkey, and si_perf.  These architectures do not implement the
signals that would use si_addr_lsb, si_lower, si_upper, si_pkey, and
si_perf.  Further these architecture have not yet implemented the
userspace that would use si_perf.

The point of this change is in fact to correct these placement issues
before sparc or alpha grow userspace that cares.  This change was
discussed[1] and the agreement is that this change is currently safe.

[1]: https://lkml.kernel.org/r/CAK8P3a0+uKYwL1NhY6Hvtieghba2hKYGD6hcKx5n8=4Gtt+pHA@mail.gmail.com
Acked-by: Marco Elver <elver@google.com>
v1: https://lkml.kernel.org/r/m1tunns7yf.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-5-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20210517195748.8880-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-05-18 16:17:03 -05:00
Arnd Bergmann
976aac5f88 kcsan: Fix debugfs initcall return type
clang with CONFIG_LTO_CLANG points out that an initcall function should
return an 'int' due to the changes made to the initcall macros in commit
3578ad11f3fb ("init: lto: fix PREL32 relocations"):

kernel/kcsan/debugfs.c:274:15: error: returning 'void' from a function with incompatible result type 'int'
late_initcall(kcsan_debugfs_init);
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
include/linux/init.h:292:46: note: expanded from macro 'late_initcall'
 #define late_initcall(fn)               __define_initcall(fn, 7)

Fixes: e36299efe7d7 ("kcsan, debugfs: Move debugfs file creation out of early init")
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-18 10:58:02 -07:00
Valentin Schneider
00b89fe019 sched: Make the idle task quack like a per-CPU kthread
For all intents and purposes, the idle task is a per-CPU kthread. It isn't
created via the same route as other pcpu kthreads however, and as a result
it is missing a few bells and whistles: it fails kthread_is_per_cpu() and
it doesn't have PF_NO_SETAFFINITY set.

Fix the former by giving the idle task a kthread struct along with the
KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle()
call be called more than once on the same idle task.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210510151024.2448573-2-valentin.schneider@arm.com
2021-05-18 12:53:53 +02:00
Peter Zijlstra
90a0ff4ec9 sched,stats: Further simplify sched_info
There's no point doing delta==0 updates.

Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-05-18 12:53:53 +02:00
Zqiang
3a010c4932 locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signal
When a interruptible mutex locker is interrupted by a signal
without acquiring this lock and removed from the wait queue.
if the mutex isn't contended enough to have a waiter
put into the wait queue again, the setting of the WAITER
bit will force mutex locker to go into the slowpath to
acquire the lock every time, so if the wait queue is empty,
the WAITER bit need to be clear.

Fixes: 040a0a371005 ("mutex: Add support for wound/wait style locks")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210517034005.30828-1-qiang.zhang@windriver.com
2021-05-18 12:53:51 +02:00