66488 Commits

Author SHA1 Message Date
Paolo Abeni
ff5a0b421c mptcp: faster active backup recovery
The msk can use backup subflows to transmit in-sequence data
only if there are no other active subflow. On active backup
scenario, the MPTCP connection can do forward progress only
due to MPTCP retransmissions - rtx can pick backup subflows.

This patch introduces a new flag flow MPTCP subflows: if the
underlying TCP connection made no progresses for long time,
and there are other less problematic subflows available, the
given subflow become stale.

Stale subflows are not considered active: if all non backup
subflows become stale, the MPTCP scheduler can pick backup
subflows for plain transmissions.

Stale subflows can return in active state, as soon as any reply
from the peer is observed.

Active backup scenarios can now leverage the available b/w
with no restrinction.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14 11:37:25 +01:00
Paolo Abeni
6da14d74e2 mptcp: cleanup sysctl data and helpers
Reorder the data in mptcp_pernet to avoid wasting space
with no reasons and constify the access helpers.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14 11:37:25 +01:00
Paolo Abeni
1e1d9d6f11 mptcp: handle pending data on closed subflow
The PM can close active subflow, e.g. due to ingress RM_ADDR
option. Such subflow could carry data still unacked at the
MPTCP-level, both in the write and the rtx_queue, which has
never reached the other peer.

Currently the mptcp-level retransmission will deliver such data,
but at a very low rate (at most 1 DSM for each MPTCP rtx interval).

We can speed-up the recovery a lot, moving all the unacked in the
tcp write_queue, so that it will be pushed again via other
subflows, at the speed allowed by them.

Also make available the new helper for later patches.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14 11:37:25 +01:00
Paolo Abeni
71b7dec27f mptcp: less aggressive retransmission strategy
The current mptcp re-inject strategy is very aggressive,
we have mptcp-level retransmissions even on single subflow
connection, if the link in-use is lossy.

Let's be a little more conservative: we do retransmit
only if at least a subflow has write and rtx queue empty.

Additionally use the backup subflows only if the active
subflows are stale - no progresses in at least an rtx period
and ignore stale subflows for rtx timeout update

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14 11:37:25 +01:00
Paolo Abeni
33d41c9cd7 mptcp: more accurate timeout
As reported by Maxim, we have a lot of MPTCP-level
retransmissions when multilple links with different latencies
are in use.

This patch refactor the mptcp-level timeout accounting so that
the maximum of all the active subflow timeout is used. To avoid
traversing the subflow list multiple times, the update is
performed inside the packet scheduler.

Additionally clean-up a bit timeout handling.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-14 11:37:25 +01:00
Lukas Bulwahn
d8d9ba8dc9 net: 802: remove dead leftover after ipx driver removal
Commit 7a2e838d28cf ("staging: ipx: delete it from the tree") removes the
ipx driver and the config IPX. Since then, there is some dead leftover in
./net/802/, that was once used by the IPX driver, but has no other user.

Remove this dead leftover.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 16:30:35 -07:00
Changbin Du
afa79d08c6 net: in_irq() cleanup
Replace the obsolete and ambiguos macro in_irq() with new
macro in_hardirq().

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 14:09:19 -07:00
Rao Shoaib
876c14ad01 af_unix: fix holding spinlock in oob handling
syzkaller found that OOB code was holding spinlock
while calling a function in which it could sleep.

Reported-by: syzbot+8760ca6c1ee783ac4abd@syzkaller.appspotmail.com
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Link: https://lore.kernel.org/r/20210811220652.567434-1-Rao.Shoaib@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 10:31:22 -07:00
Jakub Kicinski
f4083a752a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h
  9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp")
  9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins")
  099fdeda659d ("bnxt_en: Event handler for PPS events")

kernel/bpf/helpers.c
include/linux/bpf-cgroup.h
  a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()")
  c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current")

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
  5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray")
  2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer")

MAINTAINERS
  7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo")
  7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 06:41:22 -07:00
Kees Cook
8c89f7b3d3 mac80211: Use flex-array for radiotap header bitmap
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.

The it_present member of struct ieee80211_radiotap_header is treated as a
flexible array (multiple u32s can be conditionally present). In order for
memcpy() to reason (or really, not reason) about the size of operations
against this struct, use of bytes beyond it_present need to be treated
as part of the flexible array. Add a trailing flexible array and
initialize its initial index via pointer arithmetic.

Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210806215305.2875621-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:58:25 +02:00
Kees Cook
5cafd3784a mac80211: radiotap: Use BIT() instead of shifts
IEEE80211_RADIOTAP_EXT has a value of 31, which means if shift was ever
cast to 64-bit, the result would become sign-extended. As a matter of
robustness, just replace all the open-coded shifts with BIT().

Suggested-by: David Sterba <dsterba@suse.cz>
Link: https://lore.kernel.org/lkml/20210728092323.GW5047@twin.jikos.cz/
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210806215112.2874773-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:58:25 +02:00
dingsenjie
0323689d30 mac80211: Remove unnecessary variable and label
The variable ret and label just used as return, so we delete it and
use the return statement instead of the goto statement.

Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210805064349.202148-1-dingsenjie@163.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:58:25 +02:00
Johannes Berg
779969e3c8 mac80211: include <linux/rbtree.h>
This is needed for the rbtree, and we shouldn't just rely
on it getting included somewhere implicitly. Include it
explicitly.

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20210715180234.512d64dee655.Ia51c29a9fb1e651e06bc00eabec90974103d333e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:58:23 +02:00
Johan Almbladh
79f5962bae mac80211: Fix monitor MTU limit so that A-MSDUs get through
The maximum MTU was set to 2304, which is the maximum MSDU size. While
this is valid for normal WLAN interfaces, it is too low for monitor
interfaces. A monitor interface may receive and inject MPDU frames, and
the maximum MPDU frame size is larger than 2304. The MPDU may also
contain an A-MSDU frame, in which case the size may be much larger than
the MTU limit. Since the maximum size of an A-MSDU depends on the PHY
mode of the transmitting STA, it is not possible to set an exact MTU
limit for a monitor interface. Now the maximum MTU for a monitor
interface is unrestricted.

Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Link: https://lore.kernel.org/r/20210628123246.2070558-1-johan.almbladh@anyfinetworks.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:51:14 +02:00
Dan Carpenter
4a11174d6d mac80211: remove unnecessary NULL check in ieee80211_register_hw()
The address "&sband->iftype_data[i]" points to an array at the end of
struct.  It can't be NULL and so the check can be removed.

Fixes: bac2fd3d7534 ("mac80211: remove use of ieee80211_get_he_sta_cap()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YNmgHi7Rh3SISdog@mwanda
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:51:03 +02:00
YueHaibing
deebea0ae3 mac80211: Reject zero MAC address in sta_info_insert_check()
As commit 52dba8d7d5ab ("mac80211: reject zero MAC address in add station")
said, we don't consider all-zeroes to be a valid MAC address in most places,
so also reject it here.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20210626130334.13624-1-yuehaibing@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-08-13 09:50:43 +02:00
Jakub Kicinski
a9a507013a Merge tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:

====================
ieee802154 for net 2021-08-12

Mostly fixes coming from bot reports. Dongliang Mu tackled some syzkaller
reports in hwsim again and Takeshi Misawa a memory leak  in  ieee802154 raw.

* tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan:
  net: Fix memory leak in ieee802154_raw_deliver
  ieee802154: hwsim: fix GPF in hwsim_new_edge_nl
  ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi
====================

Link: https://lore.kernel.org/r/20210812183912.1663996-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-12 11:50:17 -07:00
Longpeng(Mike)
49b0b6ffe2 vsock/virtio: avoid potential deadlock when vsock device remove
There's a potential deadlock case when remove the vsock device or
process the RESET event:

  vsock_for_each_connected_socket:
      spin_lock_bh(&vsock_table_lock) ----------- (1)
      ...
          virtio_vsock_reset_sock:
              lock_sock(sk) --------------------- (2)
      ...
      spin_unlock_bh(&vsock_table_lock)

lock_sock() may do initiative schedule when the 'sk' is owned by
other thread at the same time, we would receivce a warning message
that "scheduling while atomic".

Even worse, if the next task (selected by the scheduler) try to
release a 'sk', it need to request vsock_table_lock and the deadlock
occur, cause the system into softlockup state.
  Call trace:
   queued_spin_lock_slowpath
   vsock_remove_bound
   vsock_remove_sock
   virtio_transport_release
   __vsock_release
   vsock_release
   __sock_release
   sock_close
   __fput
   ____fput

So we should not require sk_lock in this case, just like the behavior
in vhost_vsock or vmci.

Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko")
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210812053056.1699-1-longpeng2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-12 10:57:27 -07:00
Vladimir Oltean
724395f4dc net: dsa: tag_8021q: don't broadcast during setup/teardown
Currently, on my board with multiple sja1105 switches in disjoint trees
described in commit f66a6a69f97a ("net: dsa: permit cross-chip bridging
between all trees in the system"), rebooting the board triggers the
following benign warnings:

[   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT

Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and
RX VLANs cannot be found on switch 2's CPU port.

But why would switch 2 even attempt to delete switch 1's TX and RX
tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast,
and it is supposed that it had added those VLANs in the first place
(because in dsa_port_tag_8021q_vlan_match, all CPU ports match
regardless of their tree index or switch index).

The two trees probe asynchronously, and when switch 1 probed, it called
dsa_broadcast which did not notify the tree of switch 2, because that
didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it
_is_ notified of the deletion.

Before jumping to introduce a synchronization mechanism between the
probing across disjoint switch trees, let's take a step back and see
whether we _need_ to do that in the first place.

The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port
only if switch 1 and 2 were part of a cross-chip bridge. And
dsa_tag_8021q_bridge_join takes care precisely of that (but if probing
was synchronous, the bridge_join would just end up bumping the VLANs'
refcount, because they are already installed by the setup path).

Since by the time the ports are bridged, all DSA trees are already set
up, and we don't need the tag_8021q VLANs of one switch installed on the
other switches during probe time, the answer is that we don't need to
fix the synchronization issue.

So make the setup and teardown code paths call dsa_port_notify, which
notifies only the local tree, and the bridge code paths call
dsa_broadcast, which let the other trees know as well.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12 11:46:21 +01:00
Vladimir Oltean
ab97462beb net: dsa: print more information when a cross-chip notifier fails
Currently this error message does not say a lot:

[   32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[   32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[   32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[   32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[   32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT
[   32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT

but in this form, it is immediately obvious (at least to me) what the
problem is, even without further looking at the code:

[   12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT
[   12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT
[   12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT
[   12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT
[   12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT
[   12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12 11:46:21 +01:00
Nick Richardson
769afb3fda pktgen: Add output for imix results
The bps for imix mode is calculated by:
sum(imix_entry.size) / time_elapsed

The actual counts of each imix_entry are displayed under the
"Current:" section of the interface output in the following format:
imix_size_counts: size_1,count_1 size_2,count_2 ... size_n,count_n

Example (count = 200000):
imix_weights: 256,1 859,3 205,2
imix_size_counts: 256,32082 859,99796 205,68122
Result: OK: 17992362(c17964678+d27684) usec, 200000 (859byte,0frags)
  11115pps 47Mb/sec (47977140bps) errors: 0

Summary of changes:
Calculate bps based on imix counters when in IMIX mode.
Add output for IMIX counters.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12 10:50:21 +01:00
Nick Richardson
9014903132 pktgen: Add imix distribution bins
In order to represent the distribution of imix packet sizes, a
pre-computed data structure is used. It features 100 (IMIX_PRECISION)
"bins". Contiguous ranges of these bins represent the respective
packet size of each imix entry. This is done to avoid the overhead of
selecting the correct imix packet size based on the corresponding weights.

Example:
imix_weights 40,7 576,4 1500,1
total_weight = 7 + 4 + 1 = 12

pkt_size 40 occurs 7/total_weight = 58% of the time
pkt_size 576 occurs 4/total_weight = 33% of the time
pkt_size 1500 occurs 1/total_weight = 9% of the time

We generate a random number between 0-100 and select the corresponding
packet size based on the specified weights.
Eg. random number = 358723895 % 100 = 65
Selects the packet size corresponding to index:65 in the pre-computed
imix_distribution array.
An example of the  pre-computed array is below:

The imix_distribution will look like the following:
0        ->  0 (index of imix_entry.size == 40)
1        ->  0 (index of imix_entry.size == 40)
2        ->  0 (index of imix_entry.size == 40)
[...]    ->  0 (index of imix_entry.size == 40)
57       ->  0 (index of imix_entry.size == 40)
58       ->  1 (index of imix_entry.size == 576)
[...]    ->  1 (index of imix_entry.size == 576)
90       ->  1 (index of imix_entry.size == 576)
91       ->  2 (index of imix_entry.size == 1500)
[...]    ->  2 (index of imix_entry.size == 1500)
99       ->  2 (index of imix_entry.size == 1500)

Create and use "bin" representation of the imix distribution.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12 10:50:20 +01:00
Nick Richardson
52a62f8603 pktgen: Parse internet mix (imix) input
Adds "imix_weights" command for specifying internet mix distribution.

The command is in this format:
"imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n"
where the probability that packet size_i is picked is:
weight_i / (weight_1 + weight_2 + .. + weight_n)

The user may provide up to 100 imix entries (size_i,weight_i) in this
command.

The user specified imix entries will be displayed in the "Params"
section of the interface output.

Values for clone_skb > 0 is not supported in IMIX mode.

Summary of changes:
Add flag for enabling internet mix mode.
Add command (imix_weights) for internet mix input.
Return -ENOTSUPP when clone_skb > 0 in IMIX mode.
Display imix_weights in Params.
Create data structures to store imix entries and distribution.

Signed-off-by: Nick Richardson <richardsonnick@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12 10:50:20 +01:00
Hoang Le
86704993e6 Revert "tipc: Return the correct errno code"
This reverts commit 0efea3c649f0 because of:
- The returning -ENOBUF error is fine on socket buffer allocation.
- There is side effect in the calling path
tipc_node_xmit()->tipc_link_xmit() when checking error code returning.

Fixes: 0efea3c649f0 ("tipc: Return the correct errno code")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-12 09:44:31 +01:00
Nikolay Aleksandrov
6c4110d9f4 net: bridge: vlan: fix global vlan option range dumping
When global vlan options are equal sequentially we compress them in a
range to save space and reduce processing time. In order to have the
proper range end id we need to update range_end if the options are equal
otherwise we get ranges with the same end vlan id as the start.

Fixes: 743a53d9636a ("net: bridge: vlan: add support for dumping global vlan options")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210810092139.11700-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11 16:02:34 -07:00
Jeremy Kerr
83f0a0b728 mctp: Specify route types, require rtm_type in RTM_*ROUTE messages
This change adds a 'type' attribute to routes, which can be parsed from
a RTM_NEWROUTE message. This will help to distinguish local vs. peer
routes in a future change.

This means userspace will need to set a correct rtm_type in RTM_NEWROUTE
and RTM_DELROUTE messages; we currently only accept RTN_UNICAST.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20210810023834.2231088-1-jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11 16:01:17 -07:00
Eric Dumazet
b69dd5b378 net: igmp: increase size of mr_ifc_count
Some arches support cmpxchg() on 4-byte and 8-byte only.
Increase mr_ifc_count width to 32bit to fix this problem.

Fixes: 4a2b285e7e10 ("net: igmp: fix data-race in igmp_ifc_timer_expire()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210811195715.3684218-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11 15:54:10 -07:00
Neal Cardwell
6de035fec0 tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.

The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:

  !before(rs->prior_delivered, bbr->next_rtt_delivered)

and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:

  bbr->round_start = 1;

This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.

This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.

This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11 15:00:15 -07:00
Willy Tarreau
6922110d15 net: linkwatch: fix failure to restore device state across suspend/resume
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).

It happens that the following happens:

  - the network driver (e1000e here) prepares to suspend, calls e1000e_down()
    which calls netif_carrier_off() to signal that the link is going down.
  - netif_carrier_off() adds a link_watch event to the list of events for
    this device
  - the device is completely stopped.
  - the machine suspends
  - the cable is unplugged and the machine brought to another location
  - the machine is resumed
  - the queued linkwatch events are processed for the device
  - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
    are silently dropped
  - the device is resumed with its link down
  - the upper VLAN and bond interfaces are never notified that the link had
    been turned down and remain up
  - the only way to provoke a change is to physically connect the machine
    to a port and possibly unplug it.

The state after resume looks like this:
  $ ip -br li | egrep 'bond|eth'
  bond0            UP             e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
  eth0             DOWN           e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
  eth0.2@eth0      UP             e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>

Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.

The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.

Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.

Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
Link: https://lists.openwall.net/netdev/2018/03/15/62
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11 14:43:16 -07:00
Vladimir Oltean
a72808b658 net: dsa: create a helper for locating EtherType DSA headers on TX
Create a similar helper for locating the offset to the DSA header
relative to skb->data, and make the existing EtherType header taggers to
use it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:44:58 +01:00
Vladimir Oltean
5d928ff486 net: dsa: create a helper for locating EtherType DSA headers on RX
It seems that protocol tagging driver writers are always surprised about
the formula they use to reach their EtherType header on RX, which
becomes apparent from the fact that there are comments in multiple
drivers that mention the same information.

Create a helper that returns a void pointer to skb->data - 2, as well as
centralize the explanation why that is the case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:44:58 +01:00
Vladimir Oltean
6bef794da6 net: dsa: create a helper which allocates space for EtherType DSA headers
Hide away the memmove used by DSA EtherType header taggers to shift the
MAC SA and DA to the left to make room for the header, after they've
called skb_push(). The call to skb_push() is still left explicit in
drivers, to be symmetric with dsa_strip_etype_header, and because not
all callers can be refactored to do it (for example, brcm_tag_xmit_ll
has common code for a pre-Ethernet DSA tag and an EtherType DSA tag).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:44:58 +01:00
Vladimir Oltean
f1dacd7aea net: dsa: create a helper that strips EtherType DSA headers on RX
All header taggers open-code a memmove that is fairly not all that
obvious, and we can hide the details behind a helper function, since the
only thing specific to the driver is the length of the header tag.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:44:58 +01:00
Parav Pandit
9c4a7665b4 devlink: Add APIs to publish, unpublish individual parameter
Enable drivers to publish/unpublish individual parameter.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
b40c51efef devlink: Add API to register and unregister single parameter
Currently device configuration parameters can be registered as an array.
Due to this a constant array must be registered. A single driver
supporting multiple devices each with different device capabilities end
up registering all parameters even if it doesn't support it.

One possible workaround a driver can do is, it registers multiple single
entry arrays to overcome such limitation.

Better is to provide a API that enables driver to register/unregister a
single parameter. This also further helps in two ways.
(1) to reduce the memory of devlink_param_entry by avoiding in registering
parameters which are not supported by the device.
(2) avoid generating multiple parameter add, delete, publish, unpublish,
init value notifications for such unsupported parameters

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
699784f7b7 devlink: Create a helper function for one parameter registration
Create and use a helper function for one parameter registration.
Subsequent patch also will reuse this for driver facing routine to
register a single parameter.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
076b2a9dbb devlink: Add new "enable_vnet" generic device param
Add new device generic parameter to enable/disable creation of
VDPA net auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
              name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
VDPA net functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
8ddaabee3c devlink: Add new "enable_rdma" generic device param
Add new device generic parameter to enable/disable creation of
RDMA auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
              name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
RDMA functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Parav Pandit
f13a5ad881 devlink: Add new "enable_eth" generic device param
Add new device generic parameter to enable/disable creation of
Ethernet auxiliary device and associated device functionality
in the devlink instance.

User who prefers to disable such functionality can disable it using below
example.

$ devlink dev param set pci/0000:06:00.0 \
              name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0

At this point devlink instance do not create auxiliary device for the
Ethernet functionality.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 14:34:21 +01:00
Nikolay Aleksandrov
dc002875c2 net: bridge: vlan: use br_rports_fill_info() to export mcast router ports
Embed the standard multicast router port export by br_rports_fill_info()
into a new global vlan attribute BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS.
In order to have the same format for the global bridge mcast context and
the per-vlan mcast context we need a double-nesting:
 - BRIDGE_VLANDB_GOPTS_MCAST_ROUTER_PORTS
   - MDBA_ROUTER

Currently we don't compare router lists, if any router port exists in
the bridge mcast contexts we consider their option sets as different and
export them separately.

In addition we export the router port vlan id when dumping similar to
the router port notification format.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
e04d377ff6 net: bridge: mcast: use the proper multicast context when dumping router ports
When we are dumping the router ports of a vlan mcast context we need to
use the bridge/vlan and port/vlan's multicast contexts to check if
IPv4/IPv6 router port is present and later to dump the vlan id.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
a97df080b6 net: bridge: vlan: add support for mcast router global option
Add support to change and retrieve global vlan multicast router state
which is used for the bridge itself. We just need to pass multicast context
to br_multicast_set_router instead of bridge device and the rest of the
logic remains the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
62938182c3 net: bridge: vlan: add support for mcast querier global option
Add support to change and retrieve global vlan multicast querier state.
We just need to pass multicast context to br_multicast_set_querier
instead of bridge device and the rest of the logic remains the same.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
cb486ce995 net: bridge: mcast: querier and query state affect only current context type
It is a minor optimization and better behaviour to make sure querier and
query sending routines affect only the matching multicast context
depending if vlan snooping is enabled (vlan ctx vs bridge ctx).
It also avoids sending unnecessary extra query packets.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
4d5b4e84c7 net: bridge: mcast: move querier state to the multicast context
We need to have the querier state per multicast context in order to have
per-vlan control, so remove the internal option bit and move it to the
multicast context. Also annotate the lockless reads of the new variable.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
941121ee22 net: bridge: vlan: add support for mcast startup query interval global option
Add support to change and retrieve global vlan multicast startup query
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
425214508b net: bridge: vlan: add support for mcast query response interval global option
Add support to change and retrieve global vlan multicast query response
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
d6c08aba4f net: bridge: vlan: add support for mcast query interval global option
Add support to change and retrieve global vlan multicast query interval
option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
cd9269d463 net: bridge: vlan: add support for mcast querier interval global option
Add support to change and retrieve global vlan multicast querier interval
option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00
Nikolay Aleksandrov
2da0aea21f net: bridge: vlan: add support for mcast membership interval global option
Add support to change and retrieve global vlan multicast membership
interval option.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11 13:34:41 +01:00