7644 Commits

Author SHA1 Message Date
Cong Wang
0739cd28f2 net: Introduce skb_send_sock() for sock_map
We only have skb_send_sock_locked() which requires callers
to use lock_sock(). Introduce a variant skb_send_sock()
which locks on its own, callers do not need to lock it
any more. This will save us from adding a ->sendmsg_locked
for each protocol.

To reuse the code, pass function pointers to __skb_send_sock()
and build skb_send_sock() and skb_send_sock_locked() on top.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
b01fd6e802 skmsg: Introduce a spinlock to protect ingress_msg
Currently we rely on lock_sock to protect ingress_msg,
it is too big for this, we can actually just use a spinlock
to protect this list like protecting other skb queues.

__tcp_bpf_recvmsg() is still special because of peeking,
it still has to use lock_sock.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Cong Wang
37f0e514db skmsg: Lock ingress_skb when purging
Currently we purge the ingress_skb queue only when psock
refcnt goes down to 0, so locking the queue is not necessary,
but in order to be called during ->close, we have to lock it
here.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210331023237.41094-2-xiyou.wangcong@gmail.com
2021-04-01 10:56:13 -07:00
Ong Boon Leong
622d13694b xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory model
xdp_return_frame() may be called outside of NAPI context to return
xdpf back to page_pool. xdp_return_frame() calls __xdp_return() with
napi_direct = false. For page_pool memory model, __xdp_return() calls
xdp_return_frame_no_direct() unconditionally and below false negative
kernel BUG throw happened under preempt-rt build:

[  430.450355] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3884
[  430.451678] caller is __xdp_return+0x1ff/0x2e0
[  430.452111] CPU: 0 PID: 3884 Comm: modprobe Tainted: G     U      E     5.12.0-rc2+ #45

Changes in v2:
 - This patch fixes the issue by making xdp_return_frame_no_direct() is
   only called if napi_direct = true, as recommended for better by
   Jesper Dangaard Brouer. Thanks!

Fixes: 2539650fadbf ("xdp: Helpers for disabling napi_direct of xdp_return_frame")
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 15:15:23 -07:00
Tong Zhu
d47ec7a0a7 neighbour: Disregard DEAD dst in neigh_update
After a short network outage, the dst_entry is timed out and put
in DST_OBSOLETE_DEAD. We are in this code because arp reply comes
from this neighbour after network recovers. There is a potential
race condition that dst_entry is still in DST_OBSOLETE_DEAD.
With that, another neighbour lookup causes more harm than good.

In best case all packets in arp_queue are lost. This is
counterproductive to the original goal of finding a better path
for those packets.

I observed a worst case with 4.x kernel where a dst_entry in
DST_OBSOLETE_DEAD state is associated with loopback net_device.
It leads to an ethernet header with all zero addresses.
A packet with all zero source MAC address is quite deadly with
mac80211, ath9k and 802.11 block ack.  It fails
ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx
queue (ath_tx_complete_aggr). BAW (block ack window) is not
updated. BAW logic is damaged and ath9k transmission is disabled.

Signed-off-by: Tong Zhu <zhutong@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31 14:10:46 -07:00
Paolo Abeni
9adc89af72 net: let skb_orphan_partial wake-up waiters.
Currently the mentioned helper can end-up freeing the socket wmem
without waking-up any processes waiting for more write memory.

If the partially orphaned skb is attached to an UDP (or raw) socket,
the lack of wake-up can hang the user-space.

Even for TCP sockets not calling the sk destructor could have bad
effects on TSQ.

Address the issue using skb_orphan to release the sk wmem before
setting the new sock_efree destructor. Additionally bundle the
whole ownership update in a new helper, so that later other
potential users could avoid duplicate code.

v1 -> v2:
 - use skb_orphan() instead of sort of open coding it (Eric)
 - provide an helper for the ownership change (Eric)

Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-30 13:57:28 -07:00
Xiongfeng Wang
bb2882bc6c net: core: Correct function name netevent_unregister_notifier() in the kerneldoc
Fix the following W=1 kernel build warning(s):

 net/core/netevent.c:45: warning: expecting prototype for netevent_unregister_notifier(). Prototype was for unregister_netevent_notifier() instead

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28 17:56:56 -07:00
Xiongfeng Wang
af82508743 net: core: Correct function name dev_uc_flush() in the kerneldoc
Fix the following W=1 kernel build warning(s):

 net/core/dev_addr_lists.c:732: warning: expecting prototype for dev_uc_flush(). Prototype was for dev_uc_init() instead

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28 17:56:56 -07:00
Martin KaFai Lau
7bd1590d4e bpf: selftests: Add kfunc_call test
This patch adds a few kernel function bpf_kfunc_call_test*() for the
selftest's test_run purpose.  They will be allowed for tc_cls prog.

The selftest calling the kernel function bpf_kfunc_call_test*()
is also added in this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015252.1551395-1-kafai@fb.com
2021-03-26 20:41:52 -07:00
Dmitry Vyukov
6c996e1994 net: change netdev_unregister_timeout_secs min value to 1
netdev_unregister_timeout_secs=0 can lead to printing the
"waiting for dev to become free" message every jiffy.
This is too frequent and unnecessary.
Set the min value to 1 second.

Also fix the merge issue introduced by
"net: make unregister netdev warning timeout configurable":
it changed "refcnt != 1" to "refcnt".

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout configurable")
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:24:06 -07:00
Lu Wei
897b9fae7a net: core: Fix a typo in dev_addr_lists.c
Modify "funciton" to "function" in net/core/dev_addr_lists.c.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:05:07 -07:00
David S. Miller
241949e488 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-03-24

The following pull-request contains BPF updates for your *net-next* tree.

We've added 37 non-merge commits during the last 15 day(s) which contain
a total of 65 files changed, 3200 insertions(+), 738 deletions(-).

The main changes are:

1) Static linking of multiple BPF ELF files, from Andrii.

2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo.

3) Spelling fixes from various folks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 16:30:46 -07:00
David S. Miller
efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Pablo Neira Ayuso
ddb94eafab net: resolve forwarding path from virtual netdevice and HW destination address
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.

For instance, assuming the following topology:

               IP forwarding
              /             \
           br0              eth0
           / \
       eth1  eth2
        .
        .
        .
       ethX
 ab💿ef🆎cd:ef

where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.

For packets going through IP forwarding to br0 whose destination MAC
address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the
following path:

	br0 -> eth1

.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.

This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.

             fast path
      .------------------------.
     /                          \
    |           IP forwarding   |
    |          /             \  \/
    |       br0               eth0
    .       / \
     -> eth1  eth2
        .
        .
        .
       ethX
 ab💿ef🆎cd:ef

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:38 -07:00
Dmitry Vyukov
5aa3afe107 net: make unregister netdev warning timeout configurable
netdev_wait_allrefs() issues a warning if refcount does not drop to 0
after 10 seconds. While 10 second wait generally should not happen
under normal workload in normal environment, it seems to fire falsely
very often during fuzzing and/or in qemu emulation (~10x slower).
At least it's not possible to understand if it's really a false
positive or not. Automated testing generally bumps all timeouts
to very high values to avoid flake failures.
Add net.core.netdev_unregister_timeout_secs sysctl to make
the timeout configurable for automated testing systems.
Lowering the timeout may also be useful for e.g. manual bisection.
The default value matches the current behavior.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 17:22:50 -07:00
Eric Dumazet
add2d73631 net: set initial device refcount to 1
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.

When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4

Fix is to change initial (and final) refcount to be 1.

Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.

Fixes: 919067cc845f ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 16:57:36 -07:00
Antoine Tenart
7f08ec6e04 net-sysfs: remove possible sleep from an RCU read-side critical section
xps_queue_show is mostly made of an RCU read-side critical section and
calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not
allowed as this call may sleep and such behaviours aren't allowed in RCU
read-side critical sections. Fix this by using GFP_NOWAIT instead.

Fixes: 5478fcd0f483 ("net: embed nr_ids in the xps maps")
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:28:13 -07:00
Vladimir Oltean
744b837663 net: move the ptype_all and ptype_base declarations to include/linux/netdevice.h
ptype_all and ptype_base are declared in net/core/dev.c as non-static,
because they are used by net-procfs.c too. However, a "make W=1" build
complains that there was no previous declaration of ptype_all and
ptype_base in a header file, so this way of declaring things constitutes
a violation of coding style.

Let's move the extern declarations of ptype_all and ptype_base to the
linux/netdevice.h file, which is included by net-procfs.c too.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:14:45 -07:00
Vladimir Oltean
5da9ace340 net: make xps_needed and xps_rxqs_needed static
Since their introduction in commit 04157469b7b8 ("net: Use static_key
for XPS maps"), xps_needed and xps_rxqs_needed were never used outside
net/core/dev.c, so I don't really understand why they were exported as
symbols in the first place.

This is needed in order to silence a "make W=1" warning about these
static keys not being declared as static variables, but not having a
previous declaration in a header file nonetheless.

Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:13:55 -07:00
Eric Dumazet
919067cc84 net: add CONFIG_PCPU_DEV_REFCNT
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1

unregister_netdevice: waiting for sit0 to become free. Usage count = -1

It would be nice if syzbot could trigger a warning at the time
this reference count became negative.

This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.

v2: free_dev label in alloc_netdev_mqs() is moved to avoid
    a compiler warning (-Wunused-label), as reported
    by kernel test robot <lkp@intel.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 13:38:46 -07:00
Xiong Zhenwu
a835f9034e /net/core/: fix misspellings using codespell tool
A typo is found out by codespell tool in 1734th line of drop_monitor.c:

$ codespell ./net/core/
./net/core/drop_monitor.c:1734: guarnateed  ==> guaranteed

Fix a typo found by codespell.

Signed-off-by: Xiong Zhenwu <xiong.zhenwu@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:13:41 -07:00
Antoine Tenart
75b2758abc net: NULL the old xps map entries when freeing them
In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
2d05bf0153 net: fix use after free in xps
When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.

This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
2db6cdaeba net-sysfs: move the xps cpus/rxqs retrieval in a common function
Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance.
This patch moves their common logic into a new function, xps_queue_show,
to improve this.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
d7be87a687 net-sysfs: move the rtnl unlock up in the xps show helpers
Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the maps in the rtnl lock.
Move the rtnl unlock up so we reduce the rtnl locking section.

We also increase the reference count on the subordinate device if any,
as we don't want this device to be freed while we use it (now that the
rtnl lock isn't protecting it in the whole function).

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
132f743b01 net: improve queue removal readability in __netif_set_xps_queue
Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
402fbb992e net: add an helper to copy xps maps to the new dev_maps
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
044ab86d43 net: move the xps maps to an array
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
6f36158e05 net: remove the xps possible_mask
Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
5478fcd0f4 net: embed nr_ids in the xps maps
Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
255c04a87f net: embed num_tc in the xps maps
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
73f5e52b15 net-sysfs: make xps_cpus_show and xps_rxqs_show consistent
Make the implementations of xps_cpus_show and xps_rxqs_show to converge,
as the two share the same logic but diverted over time. This should not
modify their behaviour but will help future changes and improve
maintenance.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
d9a063d207 net-sysfs: store the return of get_netdev_queue_index in an unsigned int
In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
ea4fe7e842 net-sysfs: convert xps_cpus_show to bitmap_zalloc
Use bitmap_zalloc instead of zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Jiri Bohac
6c015a2256 net: check all name nodes in __dev_alloc_name
__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.

Since commit ff92741270bf8b6e78aa885f166b68c7a67ab13a ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names.  __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.

This demonstrates the bug:

    # rmmod dummy 2>/dev/null
    # ip link property add dev lo altname dummy0
    # modprobe dummy numdummies=1
    modprobe: ERROR: could not insert 'dummy': Too many open files in system

Instead of creating a device named dummy1, modprobe fails.

Fix this by checking all the names in the d->name_node list, not just d->name.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: ff92741270bf ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:40:53 -07:00
Wei Wang
cb03835793 net: fix race between napi kthread mode and busy poll
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi. napi_disable() could grab the SCHED bit as well.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
____napi_schedule() if the threaded mode is enabled, and gets cleared
in napi_complete_done(), and we only poll the napi in kthread if this
bit is set. This helps distinguish the ownership of the napi between
kthread and other scenarios and fixes the race issue.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17 14:31:17 -07:00
Martin Willi
3a5ca85707 can: dev: Move device back to init netns on owning netns delete
When a non-initial netns is destroyed, the usual policy is to delete
all virtual network interfaces contained, but move physical interfaces
back to the initial netns. This keeps the physical interface visible
on the system.

CAN devices are somewhat special, as they define rtnl_link_ops even
if they are physical devices. If a CAN interface is moved into a
non-initial netns, destroying that netns lets the interface vanish
instead of moving it back to the initial netns. default_device_exit()
skips CAN interfaces due to having rtnl_link_ops set. Reproducer:

  ip netns add foo
  ip link set can0 netns foo
  ip netns delete foo

WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
Workqueue: netns cleanup_net
[<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
[<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
[<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
[<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
[<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
[<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
[<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
[<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
[<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
[<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)

To properly restore physical CAN devices to the initial netns on owning
netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
For CAN devices setting this flag, default_device_exit() considers them
non-virtual, applying the usual namespace move.

The issue was introduced in the commit mentioned below, as at that time
CAN devices did not have a dellink() operation.

Fixes: e008b5fc8dc7 ("net: Simplfy default_device_exit and improve batching.")
Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-16 08:40:04 +01:00
Lorenzo Bianconi
8f64860f8b net: export dev_set_threaded symbol
For wireless devices (e.g. mt76 driver) multiple net_devices belongs to
the same wireless phy and the napi object is registered in a dummy
netdevice related to the wireless phy.
Export dev_set_threaded in order to be reused in device drivers enabling
threaded NAPI.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15 12:35:23 -07:00
Manu Bretelle
6503b9f29a bpf: Add getter and setter for SO_REUSEPORT through bpf_{g,s}etsockopt
Augment the current set of options that are accessible via
bpf_{g,s}etsockopt to also support SO_REUSEPORT.

Signed-off-by: Manu Bretelle <chantra@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210310182305.1910312-1-chantra@fb.com
2021-03-15 17:22:22 +01:00
Alexander Lobakin
f96533cded flow_dissector: constify raw input data argument
Flow Dissector code never modifies the input buffer, neither skb nor
raw data.
Make 'data' argument const for all of the Flow dissector's functions.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:46:32 -07:00
Alexander Lobakin
d0eed5c325 gro: give 'hash' variable in dev_gro_receive() a less confusing name
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, &napi->gro_bitmask)'.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:09 -07:00
Alexander Lobakin
9dc2c31337 gro: consistentify napi->gro_hash[x] access in dev_gro_receive()
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since &gro_list->list == gro_list.

Misc: remove superfluous braces around single-line branches.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:08 -07:00
Alexander Lobakin
0ccf4d50d1 gro: simplify gro_list_prepare()
gro_list_prepare() always returns &napi->gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:08 -07:00
Alexander Lobakin
a25f822285 flow_dissector: fix byteorder of dissected ICMP ID
flow_dissector_key_icmp::id is of type u16 (CPU byteorder),
ICMP header has its ID field in network byteorder obviously.
Sparse says:

net/core/flow_dissector.c:178:43: warning: restricted __be16 degrades to integer

Convert ID value to CPU byteorder when storing it into
flow_dissector_key_icmp.

Fixes: 5dec597e5cd0 ("flow_dissector: extract more ICMP information")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:30:20 -07:00
Tonghao Zhang
b80350f393 net: sock: simplify tw proto registration
Introduce the new function tw_prot_init (inspired by
req_prot_init) to simplify "proto_register" function.

tw_prot_cleanup will take care of a partially initialized
timewait_sock_ops.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-11 16:05:08 -08:00
Ido Schimmel
9398e9c0b1 drop_monitor: Perform cleanup upon probe registration failure
In the rare case that drop_monitor fails to register its probe on the
'napi_poll' tracepoint, it will not deactivate its hysteresis timer as
part of the error path. If the hysteresis timer was armed by the shortly
lived 'kfree_skb' probe and user space retries to initiate tracing, a
warning will be emitted for trying to initialize an active object [1].

Fix this by properly undoing all the operations that were done prior to
probe registration, in both software and hardware code paths.

Note that syzkaller managed to fail probe registration by injecting a
slab allocation failure [2].

[1]
ODEBUG: init active (active state 0) object type: timer_list hint: sched_send_work+0x0/0x60 include/linux/list.h:135
WARNING: CPU: 1 PID: 8649 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 1 PID: 8649 Comm: syz-executor.0 Not tainted 5.11.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
[...]
Call Trace:
 __debug_object_init+0x524/0xd10 lib/debugobjects.c:588
 debug_timer_init kernel/time/timer.c:722 [inline]
 debug_init kernel/time/timer.c:770 [inline]
 init_timer_key+0x2d/0x340 kernel/time/timer.c:814
 net_dm_trace_on_set net/core/drop_monitor.c:1111 [inline]
 set_all_monitor_traces net/core/drop_monitor.c:1188 [inline]
 net_dm_monitor_start net/core/drop_monitor.c:1295 [inline]
 net_dm_cmd_trace+0x720/0x1220 net/core/drop_monitor.c:1339
 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739
 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
 genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2348
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2402
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2435
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xae

[2]
 FAULT_INJECTION: forcing a failure.
 name failslab, interval 1, probability 0, space 0, times 1
 CPU: 1 PID: 8645 Comm: syz-executor.0 Not tainted 5.11.0-syzkaller #0
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 Call Trace:
  dump_stack+0xfa/0x151
  should_fail.cold+0x5/0xa
  should_failslab+0x5/0x10
  __kmalloc+0x72/0x3f0
  tracepoint_add_func+0x378/0x990
  tracepoint_probe_register+0x9c/0xe0
  net_dm_cmd_trace+0x7fc/0x1220
  genl_family_rcv_msg_doit+0x228/0x320
  genl_rcv_msg+0x328/0x580
  netlink_rcv_skb+0x153/0x420
  genl_rcv+0x24/0x40
  netlink_unicast+0x533/0x7d0
  netlink_sendmsg+0x856/0xd90
  sock_sendmsg+0xcf/0x120
  ____sys_sendmsg+0x6e8/0x810
  ___sys_sendmsg+0xf3/0x170
  __sys_sendmsg+0xe5/0x1b0
  do_syscall_64+0x2d/0x70
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 70c69274f354 ("drop_monitor: Initialize timer and work item upon tracing enable")
Fixes: 8ee2267ad33e ("drop_monitor: Convert to using devlink tracepoint")
Reported-by: syzbot+779559d6503f3a56213d@syzkaller.appspotmail.com
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 15:19:02 -08:00
David S. Miller
547fd08377 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-03-10

The following pull-request contains BPF updates for your *net* tree.

We've added 8 non-merge commits during the last 5 day(s) which contain
a total of 11 files changed, 136 insertions(+), 17 deletions(-).

The main changes are:

1) Reject bogus use of vmlinux BTF as map/prog creation BTF, from Alexei Starovoitov.

2) Fix allocation failure splat in x86 JIT for large progs. Also fix overwriting
   percpu cgroup storage from tracing programs when nested, from Yonghong Song.

3) Fix rx queue retrieval in XDP for multi-queue veth, from Maciej Fijalkowski.

4) Fix bpf_check_mtu() helper API before freeze to have mtu_len as custom skb/xdp
   L3 input length, from Jesper Dangaard Brouer.

5) Fix inode_storage's lookup_elem return value upon having bad fd, from Tal Lossos.

6) Fix bpftool and libbpf cross-build on MacOS, from Georgi Valkov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 15:14:56 -08:00
Yunsheng Lin
1ddc3229ad skbuff: remove some unnecessary operation in skb_segment_list()
gro list uses skb_shinfo(skb)->frag_list to link two skb together,
and NAPI_GRO_CB(p)->last->next is used when there are more skb,
see skb_gro_receive_list(). gso expects that each segmented skb is
linked together using skb->next, so only the first skb->next need
to set to skb_shinfo(skb)-> frag_list when doing gso list segment.

It is the same reason that nskb->next does not need to be set to
list_skb before goto the error handling, because nskb->next already
pointers to list_skb.

And nskb is also the last skb at the end of loop, so remove tail
variable and use nskb instead.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:45:15 -08:00
Gustavo A. R. Silva
b1866bfff9 net: core: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:45:15 -08:00
Daniel Borkmann
a188bb5638 net, bpf: Fix ip6ip6 crash with collect_md populated skbs
I ran into a crash where setting up a ip6ip6 tunnel device which was /not/
set to collect_md mode was receiving collect_md populated skbs for xmit.

The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is
assigning special metadata dst entry and then redirecting the skb to the
device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit()
and in the latter it performs a neigh lookup based on skb_dst(skb) where
we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since
the md_dst_ops do not populate neigh_lookup callback with a fake handler.

Transform the md_dst_ops into generic dst_blackhole_ops that can also be
reused elsewhere when needed, and use them for the metadata dst entries as
callback ops.

Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}()
from dst_init() which free the skb the same way modulo the splat. Given we
will be able to recover just fine from there, avoid any potential splats
iff this gets ever triggered in future (or worse, panic on warns when set).

Fixes: f38a9eb1f77b ("dst: Metadata destinations")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:24:18 -08:00