67134 Commits

Author SHA1 Message Date
Richard Palethorpe
685c3f2fba vsock: Refactor vsock_*_getsockopt to resemble sock_getsockopt
In preparation for sharing the implementation of sock_get_timeout.

Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
Cc: Richard Palethorpe <rpalethorpe@richiejp.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08 16:21:53 +01:00
Antoine Tenart
146e5e7333 net-sysfs: try not to restart the syscall if it will fail eventually
Due to deadlocks in the networking subsystem spotted 12 years ago[1],
a workaround was put in place[2] to avoid taking the rtnl lock when it
was not available and restarting the syscall (back to VFS, letting
userspace spin). The following construction is found a lot in the net
sysfs and sysctl code:

  if (!rtnl_trylock())
          return restart_syscall();

This can be problematic when multiple userspace threads use such
interfaces in a short period, making them to spin a lot. This happens
for example when adding and moving virtual interfaces: userspace
programs listening on events, such as systemd-udevd and NetworkManager,
do trigger actions reading files in sysfs. It gets worse when a lot of
virtual interfaces are created concurrently, say when creating
containers at boot time.

Returning early without hitting the above pattern when the syscall will
fail eventually does make things better. While it is not a fix for the
issue, it does ease things.

[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
    https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
    and https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
[2] Rightfully, those deadlocks are *hard* to solve.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08 15:24:02 +01:00
Davide Caratti
454d3e1ae0 net/sched: sch_ets: properly init all active DRR list handles
leaf classes of ETS qdiscs are served in strict priority or deficit round
robin (DRR), depending on the value of 'nstrict'. Since this value can be
changed while traffic is running, we need to be sure that the active list
of DRR classes can be updated at any time, so:

1) call INIT_LIST_HEAD(&alist) on all leaf classes in .init(), before the
   first packet hits any of them.
2) ensure that 'alist' is not overwritten with zeros when a leaf class is
   no more strict priority nor DRR (i.e. array elements beyond 'nbands').

Link: https://lore.kernel.org/netdev/YS%2FoZ+f0Nr8eQkzH@dcaratti.users.ipa.redhat.com
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08 15:20:05 +01:00
Paolo Abeni
612f71d732 mptcp: fix possible stall on recvmsg()
recvmsg() can enter an infinite loop if the caller provides the
MSG_WAITALL, the data present in the receive queue is not sufficient to
fulfill the request, and no more data is received by the peer.

When the above happens, mptcp_wait_data() will always return with
no wait, as the MPTCP_DATA_READY flag checked by such function is
set and never cleared in such code path.

Leveraging the above syzbot was able to trigger an RCU stall:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    0-...!: (10499 ticks this GP) idle=0af/1/0x4000000000000000 softirq=10678/10678 fqs=1
        (t=10500 jiffies g=13089 q=109)
rcu: rcu_preempt kthread starved for 10497 jiffies! g13089 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu:    Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:R  running task     stack:28696 pid:   14 ppid:     2 flags:0x00004000
Call Trace:
 context_switch kernel/sched/core.c:4955 [inline]
 __schedule+0x940/0x26f0 kernel/sched/core.c:6236
 schedule+0xd3/0x270 kernel/sched/core.c:6315
 schedule_timeout+0x14a/0x2a0 kernel/time/timer.c:1881
 rcu_gp_fqs_loop+0x186/0x810 kernel/rcu/tree.c:1955
 rcu_gp_kthread+0x1de/0x320 kernel/rcu/tree.c:2128
 kthread+0x405/0x4f0 kernel/kthread.c:327
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
rcu: Stack dump where RCU GP kthread last ran:
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 8510 Comm: syz-executor827 Not tainted 5.15.0-rc2-next-20210920-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:bytes_is_nonzero mm/kasan/generic.c:84 [inline]
RIP: 0010:memory_is_nonzero mm/kasan/generic.c:102 [inline]
RIP: 0010:memory_is_poisoned_n mm/kasan/generic.c:128 [inline]
RIP: 0010:memory_is_poisoned mm/kasan/generic.c:159 [inline]
RIP: 0010:check_region_inline mm/kasan/generic.c:180 [inline]
RIP: 0010:kasan_check_range+0xc8/0x180 mm/kasan/generic.c:189
Code: 38 00 74 ed 48 8d 50 08 eb 09 48 83 c0 01 48 39 d0 74 7a 80 38 00 74 f2 48 89 c2 b8 01 00 00 00 48 85 d2 75 56 5b 5d 41 5c c3 <48> 85 d2 74 5e 48 01 ea eb 09 48 83 c0 01 48 39 d0 74 50 80 38 00
RSP: 0018:ffffc9000cd676c8 EFLAGS: 00000283
RAX: ffffed100e9a110e RBX: ffffed100e9a110f RCX: ffffffff88ea062a
RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff888074d08870
RBP: ffffed100e9a110e R08: 0000000000000001 R09: ffff888074d08877
R10: ffffed100e9a110e R11: 0000000000000000 R12: ffff888074d08000
R13: ffff888074d08000 R14: ffff888074d08088 R15: ffff888074d08000
FS:  0000555556d8e300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
S:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000180 CR3: 0000000068909000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
 test_and_clear_bit include/asm-generic/bitops/instrumented-atomic.h:83 [inline]
 mptcp_release_cb+0x14a/0x210 net/mptcp/protocol.c:3016
 release_sock+0xb4/0x1b0 net/core/sock.c:3204
 mptcp_wait_data net/mptcp/protocol.c:1770 [inline]
 mptcp_recvmsg+0xfd1/0x27b0 net/mptcp/protocol.c:2080
 inet6_recvmsg+0x11b/0x5e0 net/ipv6/af_inet6.c:659
 sock_recvmsg_nosec net/socket.c:944 [inline]
 ____sys_recvmsg+0x527/0x600 net/socket.c:2626
 ___sys_recvmsg+0x127/0x200 net/socket.c:2670
 do_recvmmsg+0x24d/0x6d0 net/socket.c:2764
 __sys_recvmmsg net/socket.c:2843 [inline]
 __do_sys_recvmmsg net/socket.c:2866 [inline]
 __se_sys_recvmmsg net/socket.c:2859 [inline]
 __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2859
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc200d2dc39
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc5758e5a8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc200d2dc39
RDX: 0000000000000002 RSI: 00000000200017c0 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000f0b5ff
R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
R13: 00007ffc5758e5d0 R14: 00007ffc5758e5c0 R15: 0000000000000003

Fix the issue by replacing the MPTCP_DATA_READY bit with direct
inspection of the msk receive queue.

Reported-and-tested-by: syzbot+3360da629681aa0d22fe@syzkaller.appspotmail.com
Fixes: 7a6a6cbc3e59 ("mptcp: recvmsg() can drain data from multiple subflow")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08 14:55:54 +01:00
Jakub Kicinski
ba882580f2 eth: platform: add a helper for loading netdev->dev_addr
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it got through appropriate helpers.

There is a handful of drivers which pass netdev->dev_addr as
the destination buffer to eth_platform_get_mac_address().
Add a helper which takes a dev pointer instead, so it can call
an appropriate helper.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08 14:54:33 +01:00
Jakub Kicinski
da8f606e15 ethernet: un-export nvmem_get_mac_address()
nvmem_get_mac_address() is only called from of_net.c
we don't need the export.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-08 14:54:33 +01:00
Jakub Kicinski
9fe1155233 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07 15:24:06 -07:00
Linus Torvalds
1da38549dd Bug fixes for NFSD error handling paths
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmFfUVQACgkQM2qzM29m
 f5fW9w/+MthSLnUW4edoq45d9pH7jYJrtSo54NavIknXXWYaSnDydFsV/msAsJH8
 kNmwk0JAmhQ6GIkRLm4gZ2cHT+cCtlU/1gJWamvstUGM6XUpmwODdD8nacmXUh4q
 fgh9yJooe2GERIhv2/04XA8dP7UcqyZeWAGOpUZNlYEBF/Pcp1i8fJHkbJ2zEueH
 AtTwQY5atuJVQYeno7hSd38p7whWMPF37pbL8u72fbJkOefAy0/UW3AdUiMkKTOT
 TT/1bgNhOAEo20F9vspVaYAOhC8rAGaWr4j82N1QvgBtJhGt9bayQEIZQ5e+HdCg
 It4d5qtzE0zZQ/ARYsQxfF7AgNitGYEfjVu6F3hxeHFKJQCSQoxuPbBl2FiVUl7I
 JeVgPRRfYLjOjEG2E3NCWQXuzy0MzPFKqnNrvtfTE41vz1Bzrnx9Feu9GEffAn4l
 K59pIWYcVgSaC1nu8ba/sfZTVjpKShsxcTB/GJl9cgCkenZG1bqbqNCwnzcH1s3u
 zXyJZ8CjncLWHkcm2bi/xZ3jdRAyOwVCth37wI5KTBXvEiPG3yKloQifi9yKU0Zi
 a93l7hs1swcj2GfutWVjVwVsi2d1YSRRGpVgmK5pbOAhSFBU+TXOUfGo5VG5JsUW
 LA3enCmuXrcnrsFABf43mwikLw2w8/rwgXANS6LE8vaZ7A/c07Q=
 =CTLP
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Bug fixes for NFSD error handling paths"

* tag 'nfsd-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Keep existing listeners on portlist error
  SUNRPC: fix sign error causing rpcsec_gss drops
  nfsd: Fix a warning for nfsd_file_close_inode
  nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zero
  nfsd: fix error handling of register_pernet_subsys() in init_nfsd()
2021-10-07 14:11:40 -07:00
Pablo Neira Ayuso
7b1394892d netfilter: nft_dynset: relax superfluous check on set updates
Relax this condition to make add and update commands idempotent for sets
with no timeout. The eval function already checks if the set element
timeout is available and updates it if the update command is used.

Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07 19:53:15 +02:00
Dust Li
2232642ec3 ipvs: add sysctl_run_estimation to support disable estimation
estimation_timer will iterate the est_list to do estimation
for each ipvs stats. When there are lots of services, the
list can be very large.
We found that estimation_timer() run for more then 200ms on a
machine with 104 CPU and 50K services.

yunhong-cgl jiang report the same phenomenon before:
https://www.spinics.net/lists/lvs-devel/msg05426.html

In some cases(for example a large K8S cluster with many ipvs services),
ipvs estimation may not be needed. So adding a sysctl blob to allow
users to disable this completely.

Default is: 1 (enable)

Cc: yunhong-cgl jiang <xintian1976@gmail.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07 19:52:58 +02:00
Florian Westphal
68a3765c65 netfilter: nf_tables: skip netdev events generated on netns removal
syzbot reported following (harmless) WARN:

 WARNING: CPU: 1 PID: 2648 at net/netfilter/core.c:468
  nft_netdev_unregister_hooks net/netfilter/nf_tables_api.c:230 [inline]
  nf_tables_unregister_hook include/net/netfilter/nf_tables.h:1090 [inline]
  __nft_release_basechain+0x138/0x640 net/netfilter/nf_tables_api.c:9524
  nft_netdev_event net/netfilter/nft_chain_filter.c:351 [inline]
  nf_tables_netdev_event+0x521/0x8a0 net/netfilter/nft_chain_filter.c:382

reproducer:
unshare -n bash -c 'ip link add br0 type bridge; nft add table netdev t ; \
 nft add chain netdev t ingress \{ type filter hook ingress device "br0" \
 priority 0\; policy drop\; \}'

Problem is that when netns device exit hooks create the UNREGISTER
event, the .pre_exit hook for nf_tables core has already removed the
base hook.  Notifier attempts to do this again.

The need to do base hook unregister unconditionally was needed in the past,
because notifier was last stage where reg->dev dereference was safe.

Now that nf_tables does the hook removal in .pre_exit, this isn't
needed anymore.

Reported-and-tested-by: syzbot+154bd5be532a63aa778b@syzkaller.appspotmail.com
Fixes: 767d1216bff825 ("netfilter: nftables: fix possible UAF over chains from packet path in netns")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07 19:37:38 +02:00
Vegard Nossum
77076934af netfilter: Kconfig: use 'default y' instead of 'm' for bool config option
This option, NF_CONNTRACK_SECMARK, is a bool, so it can never be 'm'.

Fixes: 33b8e77605620 ("[NETFILTER]: Add CONFIG_NETFILTER_ADVANCED option")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07 19:37:25 +02:00
Juhee Kang
902c0b1887 netfilter: xt_IDLETIMER: fix panic that occurs when timer_type has garbage value
Currently, when the rule related to IDLETIMER is added, idletimer_tg timer
structure is initialized by kmalloc on executing idletimer_tg_create
function. However, in this process timer->timer_type is not defined to
a specific value. Thus, timer->timer_type has garbage value and it occurs
kernel panic. So, this commit fixes the panic by initializing
timer->timer_type using kzalloc instead of kmalloc.

Test commands:
    # iptables -A OUTPUT -j IDLETIMER --timeout 1 --label test
    $ cat /sys/class/xt_idletimer/timers/test
      Killed

Splat looks like:
    BUG: KASAN: user-memory-access in alarm_expires_remaining+0x49/0x70
    Read of size 8 at addr 0000002e8c7bc4c8 by task cat/917
    CPU: 12 PID: 917 Comm: cat Not tainted 5.14.0+ #3 79940a339f71eb14fc81aee1757a20d5bf13eb0e
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
    Call Trace:
     dump_stack_lvl+0x6e/0x9c
     kasan_report.cold+0x112/0x117
     ? alarm_expires_remaining+0x49/0x70
     __asan_load8+0x86/0xb0
     alarm_expires_remaining+0x49/0x70
     idletimer_tg_show+0xe5/0x19b [xt_IDLETIMER 11219304af9316a21bee5ba9d58f76a6b9bccc6d]
     dev_attr_show+0x3c/0x60
     sysfs_kf_seq_show+0x11d/0x1f0
     ? device_remove_bin_file+0x20/0x20
     kernfs_seq_show+0xa4/0xb0
     seq_read_iter+0x29c/0x750
     kernfs_fop_read_iter+0x25a/0x2c0
     ? __fsnotify_parent+0x3d1/0x570
     ? iov_iter_init+0x70/0x90
     new_sync_read+0x2a7/0x3d0
     ? __x64_sys_llseek+0x230/0x230
     ? rw_verify_area+0x81/0x150
     vfs_read+0x17b/0x240
     ksys_read+0xd9/0x180
     ? vfs_write+0x460/0x460
     ? do_syscall_64+0x16/0xc0
     ? lockdep_hardirqs_on+0x79/0x120
     __x64_sys_read+0x43/0x50
     do_syscall_64+0x3b/0xc0
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f0cdc819142
    Code: c0 e9 c2 fe ff ff 50 48 8d 3d 3a ca 0a 00 e8 f5 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
    RSP: 002b:00007fff28eee5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f0cdc819142
    RDX: 0000000000020000 RSI: 00007f0cdc032000 RDI: 0000000000000003
    RBP: 00007f0cdc032000 R08: 00007f0cdc031010 R09: 0000000000000000
    R10: 0000000000000022 R11: 0000000000000246 R12: 00005607e9ee31f0
    R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000

Fixes: 68983a354a65 ("netfilter: xtables: Add snapshot of hardidletimer target")
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07 19:35:57 +02:00
Mike Manning
8d6c414cd2 net: prefer socket bound to interface when not in VRF
The commit 6da5b0f027a8 ("net: ensure unbound datagram socket to be
chosen when not in a VRF") modified compute_score() so that a device
match is always made, not just in the case of an l3mdev skb, then
increments the score also for unbound sockets. This ensures that
sockets bound to an l3mdev are never selected when not in a VRF.
But as unbound and bound sockets are now scored equally, this results
in the last opened socket being selected if there are matches in the
default VRF for an unbound socket and a socket bound to a dev that is
not an l3mdev. However, handling prior to this commit was to always
select the bound socket in this case. Reinstate this handling by
incrementing the score only for bound sockets. The required isolation
due to choosing between an unbound socket and a socket bound to an
l3mdev remains in place due to the device match always being made.
The same approach is taken for compute_score() for stream sockets.

Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF")
Fixes: e78190581aff ("net: ensure unbound stream socket to be chosen when not in a VRF")
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/cf0a8523-b362-1edf-ee78-eef63cbbb428@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07 07:27:55 -07:00
Jakub Kicinski
7671b026bb Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-10-07

We've added 7 non-merge commits during the last 8 day(s) which contain
a total of 8 files changed, 38 insertions(+), 21 deletions(-).

The main changes are:

1) Fix ARM BPF JIT to preserve caller-saved regs for DIV/MOD JIT-internal
   helper call, from Johan Almbladh.

2) Fix integer overflow in BPF stack map element size calculation when
   used with preallocation, from Tatsuhiko Yasumatsu.

3) Fix an AF_UNIX regression due to added BPF sockmap support related
   to shutdown handling, from Jiang Wang.

4) Fix a segfault in libbpf when generating light skeletons from objects
   without BTF, from Kumar Kartikeya Dwivedi.

5) Fix a libbpf memory leak in strset to free the actual struct strset
   itself, from Andrii Nakryiko.

6) Dual-license bpf_insn.h similarly as we did for libbpf and bpftool,
   with ACKs from all contributors, from Luca Boccassi.
====================

Link: https://lore.kernel.org/r/20211007135010.21143-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07 07:11:33 -07:00
Jakub Kicinski
d9eb44904e eth: fwnode: add a helper for loading netdev->dev_addr
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it got through appropriate helpers.

There is a handful of drivers which pass netdev->dev_addr as
the destination buffer to device_get_mac_address(). Add a helper
which takes a dev pointer instead, so it can call an appropriate
helper.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 13:39:51 +01:00
Jakub Kicinski
0a14501ed8 eth: fwnode: remove the addr len from mac helpers
All callers pass in ETH_ALEN and the function itself
will return -EINVAL for any other address length.
Just assume it's ETH_ALEN like all other mac address
helpers (nvm, of, platform).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 13:39:51 +01:00
Jakub Kicinski
8017c4d817 eth: fwnode: change the return type of mac address helpers
fwnode_get_mac_address() and device_get_mac_address()
return a pointer to the buffer that was passed to them
on success or NULL on failure. None of the callers
care about the actual value, only if it's NULL or not.

These semantics differ from of_get_mac_address() which
returns an int so to avoid confusion make the device
helpers return an errno.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 13:39:51 +01:00
Jakub Kicinski
433baf0719 device property: move mac addr helpers to eth.c
Move the mac address helpers out, eth.c already contains
a bunch of similar helpers.

Suggested-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 13:39:51 +01:00
Jakub Kicinski
d466effe28 of: net: add a helper for loading netdev->dev_addr
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it got through appropriate helpers.

There are roughly 40 places where netdev->dev_addr is passed
as the destination to a of_get_mac_address() call. Add a helper
which takes a dev pointer instead, so it can call an appropriate
helper.

Note that of_get_mac_address() already assumes the address is
6 bytes long (ETH_ALEN) so use eth_hw_addr_set().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 13:39:51 +01:00
Jakub Kicinski
e330fb1459 of: net: move of_net under net/
Rob suggests to move of_net.c from under drivers/of/ somewhere
to the networking code.

Suggested-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 13:39:51 +01:00
David S. Miller
578f393227 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/
ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2021-10-07

1) Fix a sysbot reported shift-out-of-bounds in xfrm_get_default.
   From Pavel Skripkin.

2) Fix XFRM_MSG_MAPPING ABI breakage. The new XFRM_MSG_MAPPING
   messages were accidentally not paced at the end.
   Fix by Eugene Syromiatnikov.

3) Fix the uapi for the default policy, use explicit field and macros
   and make it accessible to userland.
   From Nicolas Dichtel.

4) Fix a missing rcu lock in xfrm_notify_userpolicy().
   From Nicolas Dichtel.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-07 12:44:41 +01:00
Ido Schimmel
353407d917 ethtool: Add ability to control transceiver modules' power mode
Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
modules parameters and retrieve their status.

The first parameter to control is the power mode of the module. It is
only relevant for paged memory modules, as flat memory modules always
operate in low power mode.

When a paged memory module is in low power mode, its power consumption
is reduced to the minimum, the management interface towards the host is
available and the data path is deactivated.

User space can choose to put modules that are not currently in use in
low power mode and transition them to high power mode before putting the
associated ports administratively up. This is useful for user space that
favors reduced power consumption and lower temperatures over reduced
link up times. In QSFP-DD modules the transition from low power mode to
high power mode can take a few seconds and this transition is only
expected to get longer with future / more complex modules.

User space can control the power mode of the module via the power mode
policy attribute ('ETHTOOL_A_MODULE_POWER_MODE_POLICY'). Possible
values:

* high: Module is always in high power mode.

* auto: Module is transitioned by the host to high power mode when the
  first port using it is put administratively up and to low power mode
  when the last port using it is put administratively down.

The operational power mode of the module is available to user space via
the 'ETHTOOL_A_MODULE_POWER_MODE' attribute. The attribute is not
reported to user space when a module is not plugged-in.

The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).

The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the CPU.

CMIS testing
============

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

The module is not in low power mode, as it is not forced by hardware
(LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off).

The power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space.

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy high
 power-mode high

Change the power mode policy to 'auto':

 # ethtool --set-module swp11 power-mode-policy auto

Query the power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x01 (ModuleLowPwr)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : On

Put the associated port administratively up which will instruct the host
to transition the module to high power mode:

 # ip link set dev swp11 up

Query the power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy auto
 power-mode high

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

Put the associated port administratively down which will instruct the
host to transition the module to low power mode:

 # ip link set dev swp11 down

Query the power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x01 (ModuleLowPwr)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : On

SFF-8636 testing
================

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7733 mW / -1.12 dBm
 Transmit avg optical power (Channel 2)    : 0.7649 mW / -1.16 dBm
 Transmit avg optical power (Channel 3)    : 0.7790 mW / -1.08 dBm
 Transmit avg optical power (Channel 4)    : 0.7837 mW / -1.06 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9302 mW / -0.31 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9079 mW / -0.42 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.8993 mW / -0.46 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8778 mW / -0.57 dBm

The module is not in low power mode, as it is not forced by hardware
(Power override is on) or by software (Power set is off).

The power mode can be queried from the kernel. In case Power override
was off, the kernel would need to take into account the state of the
LPMode signal, which is not visible to user space.

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy high
 power-mode high

Change the power mode policy to 'auto':

 # ethtool --set-module swp13 power-mode-policy auto

Query the power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
 Power set                                 : On
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm

Put the associated port administratively up which will instruct the host
to transition the module to high power mode:

 # ip link set dev swp13 up

Query the power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy auto
 power-mode high

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7934 mW / -1.01 dBm
 Transmit avg optical power (Channel 2)    : 0.7859 mW / -1.05 dBm
 Transmit avg optical power (Channel 3)    : 0.7885 mW / -1.03 dBm
 Transmit avg optical power (Channel 4)    : 0.7985 mW / -0.98 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9325 mW / -0.30 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9034 mW / -0.44 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.9086 mW / -0.42 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8885 mW / -0.51 dBm

Put the associated port administratively down which will instruct the
host to transition the module to low power mode:

 # ip link set dev swp13 down

Query the power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 power-mode-policy auto
 power-mode low

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
 Power set                                 : On
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-06 17:47:49 -07:00
Eric Dumazet
d343679919 rtnetlink: fix if_nlmsg_stats_size() under estimation
rtnl_fill_statsinfo() is filling skb with one mandatory if_stats_msg structure.

nlmsg_put(skb, pid, seq, type, sizeof(struct if_stats_msg), flags);

But if_nlmsg_stats_size() never considered the needed storage.

This bug did not show up because alloc_skb(X) allocates skb with
extra tailroom, because of added alignments. This could very well
be changed in the future to have deterministic behavior.

Fixes: 10c9ead9f3c6 ("rtnetlink: add new RTM_GETSTATS message to dump link stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-06 15:09:46 +01:00
Jiang Wang
d0c6416bd7 unix: Fix an issue in unix_shutdown causing the other end read/write failures
Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") sets
unix domain socket peer state to TCP_CLOSE in unix_shutdown. This could
happen when the local end is shutdown but the other end is not. Then,
the other end will get read or write failures which is not expected.
Fix the issue by setting the local state to shutdown.

Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211004232530.2377085-1-jiang.wang@bytedance.com
2021-10-06 14:40:21 +02:00
Kumar Kartikeya Dwivedi
c48e51c8b0 bpf: selftests: Add selftests for module kfunc support
This adds selftests that tests the success and failure path for modules
kfuncs (in presence of invalid kfunc calls) for both libbpf and
gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can
add module BTF ID set from bpf_testmod.

This also introduces  a couple of test cases to verifier selftests for
validating whether we get an error or not depending on if invalid kfunc
call remains after elimination of unreachable instructions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com
2021-10-05 17:07:42 -07:00
Kumar Kartikeya Dwivedi
0e32dfc80b bpf: Enable TCP congestion control kfunc from modules
This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.

This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.

Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.

See following commits for background on use of:

 CONFIG_X86 ifdef:
 569c484f9995 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)

 CONFIG_DYNAMIC_FTRACE ifdef:
 7aae231ac93b (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com
2021-10-05 17:07:41 -07:00
Kumar Kartikeya Dwivedi
2357672c54 bpf: Introduce BPF support for kernel module function calls
This change adds support on the kernel side to allow for BPF programs to
call kernel module functions. Userspace will prepare an array of module
BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
In the kernel, the module BTFs are placed in the auxilliary struct for
bpf_prog, and loaded as needed.

The verifier then uses insn->off to index into the fd_array. insn->off
0 is reserved for vmlinux BTF (for backwards compat), so userspace must
use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
sorted based on offset in an array, and each offset corresponds to one
descriptor, with a max limit up to 256 such module BTFs.

We also change existing kfunc_tab to distinguish each element based on
imm, off pair as each such call will now be distinct.

Another change is to check_kfunc_call callback, which now include a
struct module * pointer, this is to be used in later patch such that the
kfunc_id and module pointer are matched for dynamically registered BTF
sets from loadable modules, so that same kfunc_id in two modules doesn't
lead to check_kfunc_call succeeding. For the duration of the
check_kfunc_call, the reference to struct module exists, as it returns
the pointer stored in kfunc_btf_tab.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
2021-10-05 17:07:41 -07:00
Jakub Kicinski
d0f1c248b4 bluetooth-next pull request for net-next:
- Add support for MediaTek MT7922 and MT7921
  - Enable support for AOSP extention in Qualcomm WCN399x and Realtek
    8822C/8852A.
  - Add initial support for link quality and audio/codec offload.
  - Rework of sockets sendmsg to avoid locking issues.
  - Add vhci suspend/resume emulation.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmFXkGYZHGx1aXoudm9u
 LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKVQdD/9dtVeMRjzAQbvPI5InAi4N
 AjEy7IUAV27AE0QVZW/Q201BNnFixAivDWiQXHFiTV1ocrmX/qiW15AsKOTlpVKx
 BAzy97KVLPHNoNBN3XV9PwP8OovX7zkWTL/XPHzg1lIAfniWiInuQDrlU/F3TKOO
 2yJOFTy3x9RrwZcXfZUHltBo6766SC40zW4H+3WA42jljOPKXR1jH6lSIzSezBFt
 qsaw/CS/aW1Z8JAA8fhZurCmoljHMgRNOsnh8AfHPCYsUZSsw9ZE6wMrDUvjXBtm
 Zp9pI+h3mwc9tW/BGSZSpcktUDdXlxo9cXSCrXtxHjmrUxAsNGtHmXE7adogWSHs
 PaXfst8qFdsqp+kjrx+ZbMksfhhq2/ysYNIFWvSGim3VBBw5x74tu/VebSw19PjC
 1ZWzOt/5J5WCBD6BTGApCQg+YTg1u9koPRG441ZwIJ82eYgIQx80Y78uj/tG7mUH
 HC3GMxEwgQzYnQsDfDI936umNAgcdGw0DL7Tu71z4zM+Tn0WcvcQgHNk42zN5gZI
 XMeulxdj++pUoBYweDhlgJ88lr+gTeVwvfrYGWHpNvKsfmEcQJoqBvxTS1OJvMM5
 WfTwJgJCG+o4Jfj4oH2haUeMEUFQXikV0C7Dlw6MmhmcLmzhpbaMRdLQvcDocbTA
 hxhxSYhK6LQTRWv2h71MMw==
 =QRkQ
 -----END PGP SIGNATURE-----

Merge tag 'for-net-next-2021-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - Add support for MediaTek MT7922 and MT7921
 - Enable support for AOSP extention in Qualcomm WCN399x and Realtek
   8822C/8852A.
 - Add initial support for link quality and audio/codec offload.
 - Rework of sockets sendmsg to avoid locking issues.
 - Add vhci suspend/resume emulation.

====================

Link: https://lore.kernel.org/r/20211001230850.3635543-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-05 07:41:16 -07:00
Eric Dumazet
7707a4d01a netlink: annotate data races around nlk->bound
While existing code is correct, KCSAN is reporting
a data-race in netlink_insert / netlink_sendmsg [1]

It is correct to read nlk->bound without a lock, as netlink_autobind()
will acquire all needed locks.

[1]
BUG: KCSAN: data-race in netlink_insert / netlink_sendmsg

write to 0xffff8881031c8b30 of 1 bytes by task 18752 on cpu 0:
 netlink_insert+0x5cc/0x7f0 net/netlink/af_netlink.c:597
 netlink_autobind+0xa9/0x150 net/netlink/af_netlink.c:842
 netlink_sendmsg+0x479/0x7c0 net/netlink/af_netlink.c:1892
 sock_sendmsg_nosec net/socket.c:703 [inline]
 sock_sendmsg net/socket.c:723 [inline]
 ____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
 ___sys_sendmsg net/socket.c:2446 [inline]
 __sys_sendmsg+0x1ed/0x270 net/socket.c:2475
 __do_sys_sendmsg net/socket.c:2484 [inline]
 __se_sys_sendmsg net/socket.c:2482 [inline]
 __x64_sys_sendmsg+0x42/0x50 net/socket.c:2482
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff8881031c8b30 of 1 bytes by task 18751 on cpu 1:
 netlink_sendmsg+0x270/0x7c0 net/netlink/af_netlink.c:1891
 sock_sendmsg_nosec net/socket.c:703 [inline]
 sock_sendmsg net/socket.c:723 [inline]
 __sys_sendto+0x2a8/0x370 net/socket.c:2019
 __do_sys_sendto net/socket.c:2031 [inline]
 __se_sys_sendto net/socket.c:2027 [inline]
 __x64_sys_sendto+0x74/0x90 net/socket.c:2027
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x00 -> 0x01

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 18751 Comm: syz-executor.0 Not tainted 5.14.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: da314c9923fe ("netlink: Replace rhash_portid with bound")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05 13:11:09 +01:00
Florian Westphal
549017aa1b netlink: remove netlink_broadcast_filtered
No users in tree since commit a3498436b3a0 ("netns: restrict uevents"),
so remove this functionality.

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05 13:07:03 +01:00
Eric Dumazet
a56d447f19 net/sched: sch_taprio: properly cancel timer from taprio_destroy()
There is a comment in qdisc_create() about us not calling ops->reset()
in some cases.

err_out4:
	/*
	 * Any broken qdiscs that would require a ops->reset() here?
	 * The qdisc was never in action so it shouldn't be necessary.
	 */

As taprio sets a timer before actually receiving a packet, we need
to cancel it from ops->destroy, just in case ops->reset has not
been called.

syzbot reported:

ODEBUG: free active (active state 0) object type: hrtimer hint: advance_sched+0x0/0x9a0 arch/x86/include/asm/atomic64_64.h:22
WARNING: CPU: 0 PID: 8441 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 0 PID: 8441 Comm: syz-executor813 Not tainted 5.14.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd e0 d3 e3 89 4c 89 ee 48 c7 c7 e0 c7 e3 89 e8 5b 86 11 05 <0f> 0b 83 05 85 03 92 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
RSP: 0018:ffffc9000130f330 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: ffff88802baeb880 RSI: ffffffff815d87b5 RDI: fffff52000261e58
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815d25ee R11: 0000000000000000 R12: ffffffff898dd020
R13: ffffffff89e3ce20 R14: ffffffff81653630 R15: dffffc0000000000
FS:  0000000000f0d300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffb64b3e000 CR3: 0000000036557000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __debug_check_no_obj_freed lib/debugobjects.c:987 [inline]
 debug_check_no_obj_freed+0x301/0x420 lib/debugobjects.c:1018
 slab_free_hook mm/slub.c:1603 [inline]
 slab_free_freelist_hook+0x171/0x240 mm/slub.c:1653
 slab_free mm/slub.c:3213 [inline]
 kfree+0xe4/0x540 mm/slub.c:4267
 qdisc_create+0xbcf/0x1320 net/sched/sch_api.c:1299
 tc_modify_qdisc+0x4c8/0x1a60 net/sched/sch_api.c:1663
 rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5571
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
 netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
 netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:724
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2403
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2457
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2486
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80

Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05 12:45:27 +01:00
Eric Dumazet
0854a05133 net: bridge: fix under estimation in br_get_linkxstats_size()
Commit de1799667b00 ("net: bridge: add STP xstats")
added an additional nla_reserve_64bit() in br_fill_linkxstats(),
but forgot to update br_get_linkxstats_size() accordingly.

This can trigger the following in rtnl_stats_get()

	WARN_ON(err == -EMSGSIZE);

Fixes: de1799667b00 ("net: bridge: add STP xstats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05 12:39:07 +01:00
Eric Dumazet
dbe0b88064 net: bridge: use nla_total_size_64bit() in br_get_linkxstats_size()
bridge_fill_linkxstats() is using nla_reserve_64bit().

We must use nla_total_size_64bit() instead of nla_total_size()
for corresponding data structure.

Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-05 12:39:07 +01:00
Chuck Lever
22a027e8c0 SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
This is an operational low memory situation that needs to be
flagged. The new tracepoint records a timestamp and the nfsd thread
that failed to allocate pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Chuck Lever
45f1358468 svcrdma: Split svcrmda_wc_{read,write} tracepoints
There are currently three separate purposes being served by single
tracepoints. Split them up, as was done with wc_send.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Chuck Lever
eef2d8d47c svcrdma: Split the svcrdma_wc_send() tracepoint
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.

svcrdma_wc_send:
 - status is always zero, so there's no value in recording it.
 - vendor_err is meaningless unless status is not zero, so
   there's no value in recording it.
 - This tracepoint is needed only when developing modifications,
   so it should be left disabled most of the time.

svcrdma_wc_send_flush:
 - As above, needed only rarely, and not an error.

svcrdma_wc_send_err:
 - This tracepoint can be left persistently enabled because
   completion errors are run-time problems (except for FLUSHED_ERR).
 - Tracepoint name now ends in _err to reflect its purpose.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Chuck Lever
8dcc5721da svcrdma: Split the svcrdma_wc_receive() tracepoint
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.

svcrdma_wc_recv:
 - status is always zero, so there's no value in recording it.
 - vendor_err is meaningless unless status is not zero, so
   there's no value in recording it.
 - This tracepoint is needed only when developing modifications,
   so it should be left disabled most of the time.

svcrdma_wc_recv_flush:
 - As above, needed only rarely, and not an error.

svcrdma_wc_recv_err:
 - received is always zero, so there's no value in recording it.
 - This tracepoint can be left enabled because completion
   errors are run-time problems (except for FLUSHED_ERR).
 - Tracepoint name now ends in _err to reflect its purpose.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-10-04 15:40:15 -04:00
Andrew Lunn
b44d52a50b dsa: tag_dsa: Fix mask for trunked packets
A packet received on a trunk will have bit 2 set in Forward DSA tagged
frame. Bit 1 can be either 0 or 1 and is otherwise undefined and bit 0
indicates the frame CFI. Masking with 7 thus results in frames as
being identified as being from a trunk when in fact they are not. Fix
the mask to just look at bit 2.

Fixes: 5b60dadb71db ("net: dsa: tag_dsa: Support reception of packets from LAG devices")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04 13:43:55 +01:00
MichelleJin
23b0826048 net: ipv6: fix use after free of struct seg6_pernet_data
sdata->tun_src should be freed before sdata is freed
because sdata->tun_src is allocated after sdata allocation.
So, kfree(sdata) and kfree(rcu_dereference_raw(sdata->tun_src)) are
changed code order.

Fixes: f04ed7d277e8 ("net: ipv6: check return value of rhashtable_init")

Signed-off-by: MichelleJin <shjy180909@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04 13:40:19 +01:00
Justin Iurman
8cb3bf8bff ipv6: ioam: Add support for the ip6ip6 encapsulation
This patch adds support for the ip6ip6 encapsulation by providing three encap
modes: inline, encap and auto.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04 12:53:35 +01:00
Justin Iurman
7b34e449e0 ipv6: ioam: Prerequisite patch for ioam6_iptunnel
This prerequisite patch provides some minor edits (alignments, renames) and a
minor modification inside a function to facilitate the next patch by using
existing nla_* functions.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04 12:53:35 +01:00
Justin Iurman
52d0378645 ipv6: ioam: Distinguish input and output for hop-limit
This patch anticipates the support for the IOAM insertion inside in-transit
packets, by making a difference between input and output in order to determine
the right value for its hop-limit (inherited from the IPv6 hop-limit).

Input case: happens before ip6_forward, the IPv6 hop-limit is not decremented
yet -> decrement the IOAM hop-limit to reflect the new hop inside the trace.

Output case: happens after ip6_forward, the IPv6 hop-limit has already been
decremented -> keep the same value for the IOAM hop-limit.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-04 12:53:35 +01:00
Trond Myklebust
33c3214bf4 SUNRPC: xprt_clear_locked() only needs release memory semantics
The clearing of the XPRT_LOCKED bit has to happen after we clear
xprt->snd_task, but we don't require any extra memory barriers after
that.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:06 -04:00
Trond Myklebust
6dbcbe3f78 SUNRPC: Remove WQ_HIGHPRI from xprtiod
Don't let xprtiod pre-empt softirq.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:05 -04:00
Trond Myklebust
47dd8796a3 SUNRPC: Add cond_resched() at the appropriate point in __rpc_execute()
Allow tasks that need to pre-empt rpciod/xprtiod to do so when it is
safe.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:05 -04:00
Trond Myklebust
ea7a1019d8 SUNRPC: Partial revert of commit 6f9f17287e78
The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in
xprt_transmit()") was that cond_resched() is expensive and unnecessary
when there has been just a single send.
The point of cond_resched() is to ensure that tasks that should pre-empt
this one get a chance to do so when it is safe to do so. The code prior
to commit 6f9f17287e78 failed to take into account that it was keeping a
rpc_task pinned for longer than it needed to, and so rather than doing a
full revert, let's just move the cond_resched.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-10-03 20:49:05 -04:00
Jeremy Kerr
1e5e9250d4 mctp: Add input reassembly tests
Add multi-packet route input tests, for message reassembly. These will
feed packets to be received by a bound socket, or dropped.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-03 14:35:41 +01:00
Jeremy Kerr
8892c04907 mctp: Add route input to socket tests
Add a few tests for single-packet route inputs, testing the
mctp_route_input function.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-03 14:35:41 +01:00
Jeremy Kerr
b504db408c mctp: Add packet rx tests
Add a few tests for the initial packet ingress through
mctp_pkttype_receive function; mainly packet header sanity checks. Full
input routing checks will be added as a separate change.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-03 14:35:41 +01:00