64407 Commits

Author SHA1 Message Date
Eric Dumazet
add2d73631 net: set initial device refcount to 1
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.

When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4

Fix is to change initial (and final) refcount to be 1.

Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.

Fixes: 919067cc845f ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 16:57:36 -07:00
Nikolay Aleksandrov
0353b4a96b net: bridge: when suppression is enabled exclude RARP packets
Recently we had an interop issue where RARP packets got suppressed with
bridge neigh suppression enabled, but the check in the code was meant to
suppress GARP. Exclude RARP packets from it which would allow some VMWare
setups to work, to quote the report:
"Those RARP packets usually get generated by vMware to notify physical
switches when vMotion occurs. vMware may use random sip/tip or just use
sip=tip=0. So the RARP packet sometimes get properly flooded by the vtep
and other times get dropped by the logic"

Reported-by: Amer Abdalamer <amer@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:30:24 -07:00
Antoine Tenart
7f08ec6e04 net-sysfs: remove possible sleep from an RCU read-side critical section
xps_queue_show is mostly made of an RCU read-side critical section and
calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not
allowed as this call may sleep and such behaviours aren't allowed in RCU
read-side critical sections. Fix this by using GFP_NOWAIT instead.

Fixes: 5478fcd0f483 ("net: embed nr_ids in the xps maps")
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:28:13 -07:00
Bhaskar Chowdhury
aa785f93fc net: l2tp: Fix a typo
s/verifed/verified/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:17:49 -07:00
Vladimir Oltean
744b837663 net: move the ptype_all and ptype_base declarations to include/linux/netdevice.h
ptype_all and ptype_base are declared in net/core/dev.c as non-static,
because they are used by net-procfs.c too. However, a "make W=1" build
complains that there was no previous declaration of ptype_all and
ptype_base in a header file, so this way of declaring things constitutes
a violation of coding style.

Let's move the extern declarations of ptype_all and ptype_base to the
linux/netdevice.h file, which is included by net-procfs.c too.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:14:45 -07:00
Vladimir Oltean
5da9ace340 net: make xps_needed and xps_rxqs_needed static
Since their introduction in commit 04157469b7b8 ("net: Use static_key
for XPS maps"), xps_needed and xps_rxqs_needed were never used outside
net/core/dev.c, so I don't really understand why they were exported as
symbols in the first place.

This is needed in order to silence a "make W=1" warning about these
static keys not being declared as static variables, but not having a
previous declaration in a header file nonetheless.

Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:13:55 -07:00
Vladimir Oltean
f5fcca89f5 net: bridge: declare br_vlan_tunnel_lookup argument tunnel_id as __be64
The only caller of br_vlan_tunnel_lookup, br_handle_ingress_vlan_tunnel,
extracts the tunnel_id from struct ip_tunnel_info::struct ip_tunnel_key::
tun_id which is a __be64 value.

The exact endianness does not seem to matter, because the tunnel id is
just used as a lookup key for the VLAN group's tunnel hash table, and
the value is not interpreted directly per se. Moreover,
rhashtable_lookup_fast treats the key argument as a const void *.

Therefore, there is no functional change associated with this patch,
just one to silence "make W=1" builds.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:11:59 -07:00
Bhaskar Chowdhury
f44773058c openvswitch: Fix a typo
s/subsytem/subsystem/

Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 12:59:46 -07:00
Vladimir Oltean
a50a151e31 net: ipconfig: ic_dev can be NULL in ic_close_devs
ic_close_dev contains a generalization of the logic to not close a
network interface if it's the host port for a DSA switch. This logic is
disguised behind an iteration through the lowers of ic_dev in
ic_close_dev.

When no interface for ipconfig can be found, ic_dev is NULL, and
ic_close_dev:
- dereferences a NULL pointer when assigning selected_dev
- would attempt to search through the lower interfaces of a NULL
  net_device pointer

So we should protect against that case.

The "lower_dev" iterator variable was shortened to "lower" in order to
keep the 80 character limit.

Fixes: f68cbaed67cb ("net: ipconfig: avoid use-after-free in ic_close_devs")
Fixes: 46acf7bdbc72 ("Revert "net: ipv4: handle DSA enabled master network devices"")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Heiko Thiery <heiko.thiery@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 12:57:51 -07:00
Vladimir Oltean
abee13f53e net/sched: cls_flower: use nla_get_be32 for TCA_FLOWER_KEY_FLAGS
The existing code is functionally correct: iproute2 parses the ip_flags
argument for tc-flower and really packs it as big endian into the
TCA_FLOWER_KEY_FLAGS netlink attribute. But there is a problem in the
fact that W=1 builds complain:

net/sched/cls_flower.c:1047:15: warning: cast to restricted __be32

This is because we should use the dedicated helper for obtaining a
__be32 pointer to the netlink attribute, not a u32 one. This ensures
type correctness for be32_to_cpu.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 12:48:20 -07:00
Vladimir Oltean
6215afcb9a net/sched: cls_flower: use ntohs for struct flow_dissector_key_ports
A make W=1 build complains that:

net/sched/cls_flower.c:214:20: warning: cast from restricted __be16
net/sched/cls_flower.c:214:20: warning: incorrect type in argument 1 (different base types)
net/sched/cls_flower.c:214:20:    expected unsigned short [usertype] val
net/sched/cls_flower.c:214:20:    got restricted __be16 [usertype] dst

This is because we use htons on struct flow_dissector_key_ports members
src and dst, which are defined as __be16, so they are already in network
byte order, not host. The byte swap function for the other direction
should have been used.

Because htons and ntohs do the same thing (either both swap, or none
does), this change has no functional effect except to silence the
warnings.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 12:48:20 -07:00
Alexander Lobakin
227d72063f dsa: simplify Kconfig symbols and dependencies
1. Remove CONFIG_HAVE_NET_DSA.

CONFIG_HAVE_NET_DSA is a legacy leftover from the times when drivers
should have selected CONFIG_NET_DSA manually.
Currently, all drivers has explicit 'depends on NET_DSA', so this is
no more needed.

2. CONFIG_HAVE_NET_DSA dependencies became CONFIG_NET_DSA's ones.

 - dropped !S390 dependency which was introduced to be sure NET_DSA
   can select CONFIG_PHYLIB. DSA migrated to Phylink almost 3 years
   ago and the PHY library itself doesn't depend on !S390 since
   commit 870a2b5e4fcd ("phylib: remove !S390 dependeny from Kconfig");
 - INET dependency is kept to be sure we can select NET_SWITCHDEV;
 - NETDEVICES dependency is kept to be sure we can select PHYLINK.

3. DSA drivers menu now depends on NET_DSA.

Instead on 'depends on NET_DSA' on every single driver, the entire
menu now depends on it. This eliminates a lot of duplicated lines
from Kconfig with no loss (when CONFIG_NET_DSA=m, drivers also can
be only m or n).
This also has a nice side effect that there's no more empty menu on
configurations without DSA.

4. Kbuild will now descend into 'drivers/net/dsa' only when
   CONFIG_NET_DSA is y or m.

This is safe since no objects inside this folder can be built without
DSA core, as well as when CONFIG_NET_DSA=m, no objects can be
built-in.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 12:15:37 -07:00
Sai Kalyaan Palla
b29648ad5b net: decnet: Fixed multiple coding style issues
Made changes to coding style as suggested by checkpatch.pl
changes are of the type:
	open brace '{' following struct go on the same line
	do not use assignment in if condition

Signed-off-by: Sai Kalyaan Palla <saikalyaan63@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-20 18:50:51 -07:00
Oliver Hartkopp
b5f020f82a can: isotp: tx-path: zero initialize outgoing CAN frames
Commit d4eb538e1f48 ("can: isotp: TX-path: ensure that CAN frame flags are
initialized") ensured the TX flags to be properly set for outgoing CAN
frames.

In fact the root cause of the issue results from a missing initialization
of outgoing CAN frames created by isotp. This is no problem on the CAN bus
as the CAN driver only picks the correctly defined content from the struct
can(fd)_frame. But when the outgoing frames are monitored (e.g. with
candump) we potentially leak some bytes in the unused content of
struct can(fd)_frame.

Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol")
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20210319100619.10858-1-socketcan@hartkopp.net
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-20 20:21:35 +01:00
David Brazdil
1f935e8e72 selinux: vsock: Set SID for socket returned by accept()
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.

Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 13:46:55 -07:00
Eric Dumazet
919067cc84 net: add CONFIG_PCPU_DEV_REFCNT
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1

unregister_netdevice: waiting for sit0 to become free. Usage count = -1

It would be nice if syzbot could trigger a warning at the time
this reference count became negative.

This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.

v2: free_dev label in alloc_netdev_mqs() is moved to avoid
    a compiler warning (-Wunused-label), as reported
    by kernel test robot <lkp@intel.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 13:38:46 -07:00
Xin Long
8ff0b1f08e sctp: move sk_route_caps check and set into sctp_outq_flush_transports
The sk's sk_route_caps is set in sctp_packet_config, and later it
only needs to change when traversing the transport_list in a loop,
as the dst might be changed in the tx path.

So move sk_route_caps check and set into sctp_outq_flush_transports
from sctp_packet_transmit. This also fixes a dst leak reported by
Chen Yi:

  https://bugzilla.kernel.org/show_bug.cgi?id=212227

As calling sk_setup_caps() in sctp_packet_transmit may also set the
sk_route_caps for the ctrl sock in a netns. When the netns is being
deleted, the ctrl sock's releasing is later than dst dev's deleting,
which will cause this dev's deleting to hang and dmesg error occurs:

  unregister_netdevice: waiting for xxx to become free. Usage count = 1

Reported-by: Chen Yi <yiche@redhat.com>
Fixes: bcd623d8e9fa ("sctp: call sk_setup_caps in sctp_packet_transmit instead")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 11:34:49 -07:00
Kurt Kanzenbach
497cc00224 taprio: Handle short intervals and large packets
When using short intervals e.g. below one millisecond, large packets won't be
transmitted at all. The software implementations checks whether the packet can
be fit into the remaining interval. Therefore, it takes the packet length and
the transmission speed into account. That is correct.

However, for large packets it may be that the transmission time exceeds the
interval resulting in no packet transmission. The same situation works fine with
hardware offloading applied.

The problem has been observed with the following schedule and iperf3:

|tc qdisc replace dev lan1 parent root handle 100 taprio \
|   num_tc 8 \
|   map 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 \
|   queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
|   base-time $base \
|   sched-entry S 0x40 500000 \
|   sched-entry S 0xbf 500000 \
|   clockid CLOCK_TAI \
|   flags 0x00

[...]

|root@tsn:~# iperf3 -c 192.168.2.105
|Connecting to host 192.168.2.105, port 5201
|[  5] local 192.168.2.121 port 52610 connected to 192.168.2.105 port 5201
|[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
|[  5]   0.00-1.00   sec  45.2 KBytes   370 Kbits/sec    0   1.41 KBytes
|[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes

After debugging, it seems that the packet length stored in the SKB is about
7000-8000 bytes. Using a 100 Mbit/s link the transmission time is about 600us
which larger than the interval of 500us.

Therefore, segment the SKB into smaller chunks if the packet is too big. This
yields similar results than the hardware offload:

|root@tsn:~# iperf3 -c 192.168.2.105
|Connecting to host 192.168.2.105, port 5201
|- - - - - - - - - - - - - - - - - - - - - - - - -
|[ ID] Interval           Transfer     Bitrate         Retr
|[  5]   0.00-10.00  sec  48.9 MBytes  41.0 Mbits/sec    0             sender
|[  5]   0.00-10.02  sec  48.7 MBytes  40.7 Mbits/sec                  receiver

Furthermore, the segmentation can be skipped for the full offload case, as the
driver or the hardware is expected to handle this.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 11:32:44 -07:00
Alexander Lobakin
5588796e89 ethernet: avoid retpoline overhead on TEB (GENEVE, NvGRE, VxLAN) GRO
The two most popular headers going after Ethernet are IPv4 and IPv6.
Retpoline overhead for them is addressed only in dev_gro_receive(),
when they lie right after the outermost Ethernet header.
Use the indirect call wrappers in TEB (Transparent Ethernet Bridging,
such as GENEVE, NvGRE, VxLAN etc.) GRO receive code to reduce the
penalty when processing the inner headers.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:51:12 -07:00
Alexander Lobakin
4a6e7ec93a vlan/8021q: avoid retpoline overhead on GRO
The two most popular headers going after VLAN are IPv4 and IPv6.
Retpoline overhead for them is addressed only in dev_gro_receive(),
when they lie right after the outermost Ethernet header.
Use the indirect call wrappers in VLAN GRO receive code to reduce
the penalty on receiving tagged frames (when hardware stripping is
off or not available).

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:51:12 -07:00
David S. Miller
84f4aced67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Several patches to testore use of memory barriers instead of RCU to
   ensure consistent access to ruleset, from Mark Tomlinson.

2) Fix dump of expectation via ctnetlink, from Florian Westphal.

3) GRE helper works for IPv6, from Ludovic Senecaux.

4) Set error on unsupported flowtable flags.

5) Use delayed instead of deferrable workqueue in the flowtable,
   from Yinjun Zhang.

6) Fix spurious EEXIST in case of add-after-delete flowtable in
   the same batch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:19:06 -07:00
Xiong Zhenwu
a835f9034e /net/core/: fix misspellings using codespell tool
A typo is found out by codespell tool in 1734th line of drop_monitor.c:

$ codespell ./net/core/
./net/core/drop_monitor.c:1734: guarnateed  ==> guaranteed

Fix a typo found by codespell.

Signed-off-by: Xiong Zhenwu <xiong.zhenwu@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:13:41 -07:00
Xiong Zhenwu
7f1330c1b1 /net/hsr: fix misspellings using codespell tool
A typo is found out by codespell tool in 111th line of hsr_debugfs.c:

$ codespell ./net/hsr/

net/hsr/hsr_debugfs.c:111: Debufs  ==> Debugfs

Fix typos found by codespell.

Signed-off-by: Xiong Zhenwu <xiong.zhenwu@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 19:13:41 -07:00
Tobias Waldekranz
cc76ce9e8d net: dsa: Add helper to resolve bridge port from DSA port
In order for a driver to be able to query a bridge for information
about itself, e.g. reading out port flags, it has to use a netdev that
is known to the bridge. In the simple case, that is just the netdev
representing the port, e.g. swp0 or swp1 in this example:

   br0
   / \
swp0 swp1

But in the case of an offloaded lag, this will be the bond or team
interface, e.g. bond0 in this example:

     br0
     /
  bond0
   / \
swp0 swp1

Add a helper that hides some of this complexity from the
drivers. Then, redefine dsa_port_offloads_bridge_port using the helper
to avoid double accounting of the set of possible offloaded uppers.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 16:24:06 -07:00
Antoine Tenart
75b2758abc net: NULL the old xps map entries when freeing them
In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
2d05bf0153 net: fix use after free in xps
When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.

This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
2db6cdaeba net-sysfs: move the xps cpus/rxqs retrieval in a common function
Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance.
This patch moves their common logic into a new function, xps_queue_show,
to improve this.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
d7be87a687 net-sysfs: move the rtnl unlock up in the xps show helpers
Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the maps in the rtnl lock.
Move the rtnl unlock up so we reduce the rtnl locking section.

We also increase the reference count on the subordinate device if any,
as we don't want this device to be freed while we use it (now that the
rtnl lock isn't protecting it in the whole function).

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
132f743b01 net: improve queue removal readability in __netif_set_xps_queue
Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
402fbb992e net: add an helper to copy xps maps to the new dev_maps
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
044ab86d43 net: move the xps maps to an array
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
6f36158e05 net: remove the xps possible_mask
Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
5478fcd0f4 net: embed nr_ids in the xps maps
Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
255c04a87f net: embed num_tc in the xps maps
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
73f5e52b15 net-sysfs: make xps_cpus_show and xps_rxqs_show consistent
Make the implementations of xps_cpus_show and xps_rxqs_show to converge,
as the two share the same logic but diverted over time. This should not
modify their behaviour but will help future changes and improve
maintenance.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
d9a063d207 net-sysfs: store the return of get_netdev_queue_index in an unsigned int
In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart
ea4fe7e842 net-sysfs: convert xps_cpus_show to bitmap_zalloc
Use bitmap_zalloc instead of zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Jiri Bohac
6c015a2256 net: check all name nodes in __dev_alloc_name
__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.

Since commit ff92741270bf8b6e78aa885f166b68c7a67ab13a ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names.  __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.

This demonstrates the bug:

    # rmmod dummy 2>/dev/null
    # ip link property add dev lo altname dummy0
    # modprobe dummy numdummies=1
    modprobe: ERROR: could not insert 'dummy': Too many open files in system

Instead of creating a device named dummy1, modprobe fails.

Fix this by checking all the names in the d->name_node list, not just d->name.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: ff92741270bf ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:40:53 -07:00
Jakub Kicinski
dcc32f4f18 ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf3f1bc8defedddfa00df3192445bbf3.

Commit 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out

  https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02

lists potential issues.

Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.

Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.

Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.

Fixes: 6af1799aaf3f ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 11:19:23 -07:00
Yonghong Song
97a19caf1b bpf: net: Emit anonymous enum with BPF_TCP_CLOSE value explicitly
The selftest failed to compile with clang-built bpf-next.
Adding LLVM=1 to your vmlinux and selftest build will use clang.
The error message is:
  progs/test_sk_storage_tracing.c:38:18: error: use of undeclared identifier 'BPF_TCP_CLOSE'
          if (newstate == BPF_TCP_CLOSE)
                          ^
  1 error generated.
  make: *** [Makefile:423: /bpf-next/tools/testing/selftests/bpf/test_sk_storage_tracing.o] Error 1

The reason for the failure is that BPF_TCP_CLOSE, a value of
an anonymous enum defined in uapi bpf.h, is not defined in
vmlinux.h. gcc does not have this problem. Since vmlinux.h
is derived from BTF which is derived from vmlinux DWARF,
that means gcc-produced vmlinux DWARF has BPF_TCP_CLOSE
while llvm-produced vmlinux DWARF does not have.

BPF_TCP_CLOSE is referenced in net/ipv4/tcp.c as
  BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
The following test mimics the above BUILD_BUG_ON, preprocessed
with clang compiler, and shows gcc DWARF contains BPF_TCP_CLOSE while
llvm DWARF does not.

  $ cat t.c
  enum {
    BPF_TCP_ESTABLISHED = 1,
    BPF_TCP_CLOSE = 7,
  };
  enum {
    TCP_ESTABLISHED = 1,
    TCP_CLOSE = 7,
  };

  int test() {
    do {
      extern void __compiletime_assert_767(void) ;
      if ((int)BPF_TCP_CLOSE != (int)TCP_CLOSE) __compiletime_assert_767();
    } while (0);
    return 0;
  }
  $ clang t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
  $ gcc t.c -O2 -c -g && llvm-dwarfdump t.o | grep BPF_TCP_CLOSE
                    DW_AT_name    ("BPF_TCP_CLOSE")

Further checking clang code find clang actually tried to
evaluate condition at compile time. If it is definitely
true/false, it will perform optimization and the whole if condition
will be removed before generating IR/debuginfo.

This patch explicited add an expression after the
above mentioned BUILD_BUG_ON in net/ipv4/tcp.c like
  (void)BPF_TCP_ESTABLISHED
to enable generation of debuginfo for the anonymous
enum which also includes BPF_TCP_CLOSE.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210317174132.589276-1-yhs@fb.com
2021-03-17 18:45:40 -07:00
Pablo Neira Ayuso
0ce7cf4127 netfilter: nftables: update table flags from the commit phase
Do not update table flags from the preparation phase. Store the flags
update into the transaction, then update the flags from the commit
phase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 01:35:39 +01:00
Pablo Neira Ayuso
86fe2c19ee netfilter: nftables: skip hook overlap logic if flowtable is stale
If the flowtable has been previously removed in this batch, skip the
hook overlap checks. This fixes spurious EEXIST errors when removing and
adding the flowtable in the same batch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 01:08:54 +01:00
Pablo Neira Ayuso
1b9cd7690a netfilter: flowtable: refresh timeout after dst and writable checks
Refresh the timeout (and retry hardware offload) once the skbuff dst
is confirmed to be current and after the skbuff is made writable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
Pablo Neira Ayuso
e5075c0bad netfilter: flowtable: call dst_check() to fall back to classic forwarding
In case the route is stale, pass up the packet to the classic forwarding
path for re-evaluation and schedule this flow entry for removal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
Pablo Neira Ayuso
f4401262b9 netfilter: flowtable: fast NAT functions never fail
Simplify existing fast NAT routines by returning void. After the
skb_try_make_writable() call consolidation, these routines cannot ever
fail.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
Pablo Neira Ayuso
4f08f173d0 netfilter: flowtable: move FLOW_OFFLOAD_DIR_MAX away from enumeration
This allows to remove the default case which should not ever happen and
that was added to avoid gcc warnings on unhandled FLOW_OFFLOAD_DIR_MAX
enumeration case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
Pablo Neira Ayuso
2babb46c8c netfilter: flowtable: move skb_try_make_writable() before NAT in IPv4
For consistency with the IPv6 flowtable datapath and to make sure the
skbuff is writable right before the NAT header updates.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:44:00 +01:00
Pablo Neira Ayuso
2fc11745c3 netfilter: flowtable: consolidate skb_try_make_writable() call
Fetch the layer 4 header size to be mangled by NAT when building the
tuple, then use it to make writable the network and the transport
headers. After this update, the NAT routines now assumes that the skbuff
area is writable. Do the pointer refetch only after the single
skb_try_make_writable() call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:43:59 +01:00
Gustavo A. R. Silva
c2168e6bd7 netfilter: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
warnings by explicitly adding multiple break statements instead of just
letting the code fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:32:48 +01:00
Yinjun Zhang
740b486a8d netfilter: flowtable: Make sure GC works periodically in idle system
Currently flowtable's GC work is initialized as deferrable, which
means GC cannot work on time when system is idle. So the hardware
offloaded flow may be deleted for timeout, since its used time is
not timely updated.

Resolve it by initializing the GC work as delayed work instead of
deferrable.

Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-03-18 00:32:21 +01:00