16535 Commits

Author SHA1 Message Date
Chuck Lever
2fd5532044 net/handshake: Add a kernel API for requesting a TLSv1.3 handshake
To enable kernel consumers of TLS to request a TLS handshake, add
support to net/handshake/ to request a handshake upcall.

This patch also acts as a template for adding handshake upcall
support for other kernel transport layer security providers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19 18:48:48 -07:00
Jakub Kicinski
eb6fba7555 net: skbuff: hide wifi_acked when CONFIG_WIRELESS not set
Datacenter kernel builds will very likely not include WIRELESS,
so let them shave 2 bits off the skb by hiding the wifi fields.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19 13:04:30 +01:00
Tzung-Bi Shih
73db1b8f2b netfilter: conntrack: fix wrong ct->timeout value
(struct nf_conn)->timeout is an interval before the conntrack
confirmed.  After confirmed, it becomes a timestamp.

It is observed that timeout of an unconfirmed conntrack:
- Set by calling ctnetlink_change_timeout(). As a result,
  `nfct_time_stamp` was wrongly added to `ct->timeout` twice.
- Get by calling ctnetlink_dump_timeout(). As a result,
  `nfct_time_stamp` was wrongly subtracted.

Call Trace:
 <TASK>
 dump_stack_lvl
 ctnetlink_dump_timeout
 __ctnetlink_glue_build
 ctnetlink_glue_build
 __nfqnl_enqueue_packet
 nf_queue
 nf_hook_slow
 ip_mc_output
 ? __pfx_ip_finish_output
 ip_send_skb
 ? __pfx_dst_output
 udp_send_skb
 udp_sendmsg
 ? __pfx_ip_generic_getfrag
 sock_sendmsg

Separate the 2 cases in:
- Setting `ct->timeout` in __nf_ct_set_timeout().
- Getting `ct->timeout` in ctnetlink_dump_timeout().

Pablo appends:

Update ctnetlink to set up the timeout _after_ the IPS_CONFIRMED flag is
set on, otherwise conntrack creation via ctnetlink breaks.

Note that the problem described in this patch occurs since the
introduction of the nfnetlink_queue conntrack support, select a
sufficiently old Fixes: tag for -stable kernel to pick up this fix.

Fixes: a4b4766c3ceb ("netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info")
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-19 12:08:38 +02:00
Jakub Kicinski
92e8c732d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Unbreak br_netfilter physdev match support, from Florian Westphal.

2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian.

3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal.

4) Fix validation of catch-all set element.

5) Tighten requirements for catch-all set elements.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
  netfilter: nf_tables: validate catch-all set elements
  netfilter: nf_tables: fix ifdef to also consider nf_tables=m
  netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT
  netfilter: br_netfilter: fix recent physdev match breakage
====================

Link: https://lore.kernel.org/r/20230418145048.67270-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18 20:46:31 -07:00
Felix Fietkau
9ae708f001 wifi: mac80211: remove ieee80211_tx_status_8023
It is unused and should not be used. In order to avoid limitations in
4-address mode, the driver should always use ieee80211_tx_status_ext for
802.3 frames with a valid sta pointer.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20230417133751.79160-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-04-18 14:48:01 +02:00
Heiner Kallweit
cb18e5595d net: add macro netif_subqueue_completed_wake
Add netif_subqueue_completed_wake, complementing the subqueue versions
netif_subqueue_try_stop and netif_subqueue_maybe_stop.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-18 12:59:01 +02:00
Pablo Neira Ayuso
d46fc89414 netfilter: nf_tables: validate catch-all set elements
catch-all set element might jump/goto to chain that uses expressions
that require validation.

Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-18 09:12:22 +02:00
Xin Long
bd4b281894 sctp: delete the obsolete code for the host name address param
In the latest RFC9260, the Host Name Address param has been deprecated.
For INIT chunk:

  Note 3: An INIT chunk MUST NOT contain the Host Name Address
  parameter.  The receiver of an INIT chunk containing a Host Name
  Address parameter MUST send an ABORT chunk and MAY include an
  "Unresolvable Address" error cause.

For Supported Address Types:

  The value indicating the Host Name Address parameter MUST NOT be
  used when sending this parameter and MUST be ignored when receiving
  this parameter.

Currently Linux SCTP doesn't really support Host Name Address param,
but only saves some flag and print debug info, which actually won't
even be triggered due to the verification in sctp_verify_param().
This patch is to delete those dead code.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-17 08:28:20 +01:00
Jakub Kicinski
8c48eea3ad page_pool: allow caching from safely localized NAPI
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:12 -07:00
Haiyang Zhang
80f6215b45 net: mana: Add support for jumbo frame
During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Haiyang Zhang
2fbbd712ba net: mana: Enable RX path to handle various MTU sizes
Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Haiyang Zhang
a2917b2349 net: mana: Refactor RX buffer allocation code to prepare for various MTU
Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-14 08:56:19 +01:00
Vladimir Oltean
f62af20bed net/sched: mqprio: allow per-TC user input of FP adminStatus
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).

The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.

I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).

The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.

Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Vladimir Oltean
c54876cd59 net/sched: pass netlink extack to mqprio and taprio offload
With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.

Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.

Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.

In fact, the taprio handling is a bit more tricky, for 2 reasons.

First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().

The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 22:22:10 -07:00
Jakub Kicinski
c2865b1122 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI
 g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk
 mLSptZ9qro8yZA86BvyPvlBT1h9ELQA=
 =StAc
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:43:38 -07:00
Jakub Kicinski
800e68c44f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

tools/testing/selftests/net/config
  62199e3f1658 ("selftests: net: Add VXLAN MDB test")
  3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:04:28 -07:00
Jesper Dangaard Brouer
67f245c2ec mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via mapping table.

The mlx5 hardware can also identify and RSS hash IPSEC.  This indicate
hash includes SPI (Security Parameters Index) as part of IPSEC hash.

Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.

Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13 11:15:10 -07:00
Jesper Dangaard Brouer
0cd917a4a8 xdp: rss hash types representation
The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.

Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.

The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.

This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.

The RSS type implementations for each driver comes as separate patches.

Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13 11:15:10 -07:00
Johannes Berg
d00800a289 wifi: mac80211: add flush_sta method
Some drivers like iwlwifi might have per-STA queues, so we
may want to flush/drop just those queues rather than all
when removing a station. Add a separate method for that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-04-13 16:32:31 +02:00
Christian Ehrig
c50e96099e bpf,fou: Add bpf_skb_{set,get}_fou_encap kfuncs
Add two new kfuncs that allow a BPF tc-hook, installed on an ipip
device in collect-metadata mode, to control FOU encap parameters on a
per-packet level. The set of kfuncs is registered with the fou module.

The bpf_skb_set_fou_encap kfunc is supposed to be used in tandem and after
a successful call to the bpf_skb_set_tunnel_key bpf-helper. UDP source and
destination ports can be controlled by passing a struct bpf_fou_encap. A
source port of zero will auto-assign a source port. enum bpf_fou_encap_type
is used to specify if the egress path should FOU or GUE encap the packet.

On the ingress path bpf_skb_get_fou_encap can be used to read UDP source
and destination ports from the receiver's point of view and allows for
packet multiplexing across different destination ports within a single
BPF program and ipip device.

Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>
Link: https://lore.kernel.org/r/e17c94a646b63e78ce0dbf3f04b2c33dc948a32d.1680874078.git.cehrig@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12 16:40:39 -07:00
Christian Ehrig
ac931d4cde ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices
Today ipip devices in collect-metadata mode don't allow for sending FOU
or GUE encapsulated packets. This patch lifts the restriction by adding
a struct ip_tunnel_encap to the tunnel metadata.

On the egress path, the members of this struct can be set by the
bpf_skb_set_fou_encap kfunc via a BPF tc-hook. Instead of dropping packets
wishing to use additional UDP encapsulation, ip_md_tunnel_xmit now
evaluates the contents of this struct and adds the corresponding FOU or
GUE header. Furthermore, it is making sure that additional header bytes
are taken into account for PMTU discovery.

On the ingress path, an ipip device in collect-metadata mode will fill this
struct and a BPF tc-hook can obtain the information via a call to the
bpf_skb_get_fou_encap kfunc.

The minor change to ip_tunnel_encap, which now takes a pointer to
struct ip_tunnel_encap instead of struct ip_tunnel, allows us to control
FOU encap type and parameters on a per packet-level.

Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>
Link: https://lore.kernel.org/r/cfea47de655d0f870248abf725932f851b53960a.1680874078.git.cehrig@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12 16:40:39 -07:00
Jakub Kicinski
301f227fc8 net: piggy back on the memory barrier in bql when waking queues
Drivers call netdev_tx_completed_queue() right before
netif_txq_maybe_wake(). If BQL is enabled netdev_tx_completed_queue()
should issue a memory barrier, so we can depend on that separating
the stop check from the consumer index update, instead of adding
another barrier in netif_txq_maybe_wake().

This matters more than the barriers on the xmit path, because
the wake condition is almost always true. So we issue the
consumer side barrier often.

Wrap netdev_tx_completed_queue() in a local helper to issue
the barrier even if BQL is disabled. Keep the same semantics
as netdev_tx_completed_queue() (barrier only if bytes != 0)
to make it clear that the barrier is conditional.

Plus since macro gets pkt/byte counts as arguments now -
we can skip waking if there were no packets completed.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:18 -07:00
Jakub Kicinski
c91c46de6b net: provide macros for commonly copied lockless queue stop/wake code
A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.
The original code dates back all the way to e1000 and Linux 2.6.19.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

The macros do not prevent false wake ups, the extra barrier
required to close that race is not worth it. See discussion in:
https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:18 -07:00
Luiz Augusto von Dentz
b62e72200e Bluetooth: Fix printing errors if LE Connection times out
This fixes errors like bellow when LE Connection times out since that
is actually not a controller error:

 Bluetooth: hci0: Opcode 0x200d failed: -110
 Bluetooth: hci0: request failed to create LE connection: err -110

Instead the code shall properly detect if -ETIMEDOUT is returned and
send HCI_OP_LE_CREATE_CONN_CANCEL to give up on the connection.

Link: https://github.com/bluez/bluez/issues/340
Fixes: 8e8b92ee60de ("Bluetooth: hci_sync: Add hci_le_create_conn_sync")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-10 10:21:33 -07:00
Vladimir Oltean
5a17818682 net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub
There was a sort of rush surrounding commit 88c0a6b503b7 ("net: create a
netdev notifier for DSA to reject PTP on DSA master"), due to a desire
to convert DSA's attempt to deny TX timestamping on a DSA master to
something that doesn't block the kernel-wide API conversion from
ndo_eth_ioctl() to ndo_hwtstamp_set().

What was required was a mechanism that did not depend on ndo_eth_ioctl(),
and what was provided was a mechanism that did not depend on
ndo_eth_ioctl(), while at the same time introducing something that
wasn't absolutely necessary - a new netdev notifier.

There have been objections from Jakub Kicinski that using notifiers in
general when they are not absolutely necessary creates complications to
the control flow and difficulties to maintainers who look at the code.
So there is a desire to not use notifiers.

In addition to that, the notifier chain gets called even if there is no
DSA in the system and no one is interested in applying any restriction.

Take the model of udp_tunnel_nic_ops and introduce a stub mechanism,
through which net/core/dev_ioctl.c can call into DSA even when
CONFIG_NET_DSA=m.

Compared to the code that existed prior to the notifier conversion, aka
what was added in commits:
- 4cfab3566710 ("net: dsa: Add wrappers for overloaded ndo_ops")
- 3369afba1e46 ("net: Call into DSA netdevice_ops wrappers")

this is different because we are not overloading any struct
net_device_ops of the DSA master anymore, but rather, we are exposing a
rather specific functionality which is orthogonal to which API is used
to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set().

Also, what is similar is that both approaches use function pointers to
get from built-in code to DSA.

There is no point in replicating the function pointers towards
__dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr).
Instead, it is sufficient to introduce a singleton struct dsa_stubs,
built into the kernel, which contains a single function pointer to
__dsa_master_hwtstamp_validate().

I find this approach preferable to what we had originally, because
dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through
struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any
attempts to add any data encapsulation and hide DSA data structures from
the outside world.

Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-09 15:35:49 +01:00
Hangbin Liu
4598380f9c bonding: fix ns validation on backup slaves
When arp_validate is set to 2, 3, or 6, validation is performed for
backup slaves as well. As stated in the bond documentation, validation
involves checking the broadcast ARP request sent out via the active
slave. This helps determine which slaves are more likely to function in
the event of an active slave failure.

However, when the target is an IPv6 address, the NS message sent from
the active interface is not checked on backup slaves. Additionally,
based on the bond_arp_rcv() rule b, we must reverse the saddr and daddr
when checking the NS message.

Note that when checking the NS message, the destination address is a
multicast address. Therefore, we must convert the target address to
solicited multicast in the bond_get_targets_ip6() function.

Prior to the fix, the backup slaves had a mii status of "down", but
after the fix, all of the slaves' mii status was updated to "UP".

Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets")
Reviewed-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-07 08:47:20 +01:00
Jakub Kicinski
d9c960675a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/google/gve/gve.h
  3ce934558097 ("gve: Secure enough bytes in the first TX desc for all TCP pkts")
  75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format")
https://lore.kernel.org/all/20230406104927.45d176f5@canb.auug.org.au/
https://lore.kernel.org/all/c5872985-1a95-0bc8-9dcc-b6f23b439e9d@tessares.net/

Adjacent changes:

net/can/isotp.c
  051737439eae ("can: isotp: fix race between isotp_sendsmg() and isotp_release()")
  96d1c81e6a04 ("can: isotp: add module parameter for maximum pdu size")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-06 12:01:20 -07:00
Kal Conley
d769ccaf95 xsk: Fix unaligned descriptor validation
Make sure unaligned descriptors that straddle the end of the UMEM are
considered invalid. Currently, descriptor validation is broken for
zero-copy mode which only checks descriptors at page granularity.
For example, descriptors in zero-copy mode that overrun the end of the
UMEM but not a page boundary are (incorrectly) considered valid. The
UMEM boundary check needs to happen before the page boundary and
contiguity checks in xp_desc_crosses_non_contig_pg(). Do this check in
xp_unaligned_validate_desc() instead like xp_check_unaligned() already
does.

Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230405235920.7305-2-kal.conley@dectris.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-04-06 09:53:05 -07:00
Kuniyuki Iwashima
0a78cf7264 raw: Fix NULL deref in raw_get_next().
Dae R. Jeong reported a NULL deref in raw_get_next() [0].

It seems that the repro was running these sequences in parallel so
that one thread was iterating on a socket that was being freed in
another netns.

  unshare(0x40060200)
  r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00')
  socket$inet_icmp_raw(0x2, 0x3, 0x1)
  pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f)

After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we
use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW
sockets.  However, we should use spinlock for slow paths to avoid
the NULL deref.

Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object
is not reused during iteration in the grace period.  In fact, the
lockless readers do not check the nulls marker with get_nulls_value().
So, SOCK_RAW should use hlist instead of hlist_nulls.

Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(),
let's convert hlist_nulls to hlist and use sk_for_each_rcu() for
fast paths and sk_for_each() and spinlock for /proc/net/raw.

[0]:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
RIP: 0010:sock_net include/net/sock.h:649 [inline]
RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225
 seq_read+0x224/0x320 fs/seq_file.c:162
 pde_read fs/proc/inode.c:316 [inline]
 proc_reg_read+0x23f/0x330 fs/proc/inode.c:328
 vfs_read+0x31e/0xd30 fs/read_write.c:468
 ksys_pread64 fs/read_write.c:665 [inline]
 __do_sys_pread64 fs/read_write.c:675 [inline]
 __se_sys_pread64 fs/read_write.c:672 [inline]
 __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x478d29
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29
RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000
R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740
R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
RIP: 0010:sock_net include/net/sock.h:649 [inline]
RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Dae R. Jeong <threeearcat@gmail.com>
Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-04 18:56:58 -07:00
Vladimir Oltean
88c0a6b503 net: create a netdev notifier for DSA to reject PTP on DSA master
The fact that PTP 2-step TX timestamping is broken on DSA switches if
the master also timestamps the same packets is documented by commit
f685e609a301 ("net: dsa: Deny PTP on master if switch supports it").
We attempt to help the users avoid shooting themselves in the foot by
making DSA reject the timestamping ioctls on an interface that is a DSA
master, and the switch tree beneath it contains switches which are aware
of PTP.

The only problem is that there isn't an established way of intercepting
ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
stack by creating a struct dsa_netdevice_ops with overlaid function
pointers that are manually checked from the relevant call sites. There
used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
one left.

There is an ongoing effort to migrate driver-visible hardware timestamping
control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
model, but DSA actively prevents that migration, since dsa_master_ioctl()
is currently coded to manually call the master's legacy ndo_eth_ioctl(),
and so, whenever a network device driver would be converted to the new
API, DSA's restrictions would be circumvented, because any device could
be used as a DSA master.

The established way for unrelated modules to react on a net device event
is via netdevice notifiers. So we create a new notifier which gets
called whenever there is an attempt to change hardware timestamping
settings on a device.

Finally, there is another reason why a netdev notifier will be a good
idea, besides strictly DSA, and this has to do with PHY timestamping.

With ndo_eth_ioctl(), all MAC drivers must manually call
phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
otherwise they must pass this ioctl to the PHY driver via
phy_mii_ioctl().

With the new ndo_hwtstamp_set() API, it will be desirable to simply not
make any calls into the MAC device driver when timestamping should be
performed at the PHY level.

But there exist drivers, such as the lan966x switch, which need to
install packet traps for PTP regardless of whether they are the layer
that provides the hardware timestamps, or the PHY is. That would be
impossible to support with the new API.

The proposal there, too, is to introduce a netdev notifier which acts as
a better cue for switching drivers to add or remove PTP packet traps,
than ndo_hwtstamp_set(). The one introduced here "almost" works there as
well, except for the fact that packet traps should only be installed if
the PHY driver succeeded to enable hardware timestamping, whereas here,
we need to deny hardware timestamping on the DSA master before it
actually gets enabled. This is why this notifier is called "PRE_", and
the notifier that would get used for PHY timestamping and packet traps
would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.

In expectation of future netlink UAPI, we also pass a non-NULL extack
pointer to the netdev notifier, and we make DSA populate it with an
informative reason for the rejection. To avoid making it go to waste, we
make the ioctl-based dev_set_hwtstamp() create a fake extack and print
the message to the kernel log.

Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-03 10:04:27 +01:00
Jakub Kicinski
ce7928f7cf Major stack changes:
* TC offload support for drivers below mac80211
  * reduced neighbor report (RNR) handling for AP mode
  * mac80211 mesh fast-xmit and fast-rx support
  * support for another mesh A-MSDU format
    (seems nobody got the spec right)
 
 Major driver changes:
 
 Kalle moved the drivers that were just plain C files
 in drivers/net/wireless/ to legacy/ and virtual/ dirs.
 
 hwsim
  * multi-BSSID support
  * some FTM support
 
 ath11k
  * MU-MIMO parameters support
  * ack signal support for management packets
 
 rtl8xxxu
  * support for RTL8710BU aka RTL8188GU chips
 
 rtw89
  * support for various newer firmware APIs
 
 ath10k
  * enabled threaded NAPI on WCN3990
 
 iwlwifi
  * lots of work for multi-link/EHT (wifi7)
  * hardware timestamping support for some devices/firwmares
  * TX beacon protection on newer hardware
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmQl9gkACgkQ10qiO8sP
 aACnTw/+KHEjBePPNSiVOziE6tlKSIruabrsSbBBWWx4btFBT0YAnqiwf4Zyp67r
 VaRWPxRNqjriKMPYDn/bpCrRgkqeN50iCV9n+/xh4TaiLokea0IuA/KDLNM+vBPf
 0VJBpNdW8tKLM0YluGWuoWUAlkjujBzjHaW0sKoII/PLh6evLGEIhqdveE8052Bc
 P3UMu3+FHXg+E1+1tpn/AK+b6bGQeBN3KG3h7uocwR4QE6xROdFPAmXDohbJ98pG
 6EmCDuuFiYPwkKu3GsxmKFNbq7NatPoFtghBPQi4Sn9wGgWWlJcK7hlNTDDW5O7c
 ZqyqY8BiUnh9A1ZcUCKLiHYx+mjxiE2P5wlfDbiZGm7k2JkrBu+AebRvtD4ie/UF
 db8T321kZJ5XKI89yQT2DXzfrvhbw2D85eFudQpIyRguoPKicICLWOyw84+esNTI
 2vqnbPVnYrLQ/X7SzKcdIh74fcSvL8Dj0EiUAzvhPDbgIvexiYDetBXVss8tXBsn
 bpCRND05tpCqqW+LyulglmYTAbhZdCYOy4DAB+g6dtxHOR8A0eLbUHocm0zlPvqd
 sc4pYpXHx4x+X3FPHpfxOqKUO87P7SvJ++d3Y3e09/ObLPSkn65ihsGJpMN/fj1e
 BkJ2yF0KUfJMNhSl5p8VELT3XZevT0EIEKMvwoIgSsWPR8cxqoI=
 =Zi6X
 -----END PGP SIGNATURE-----

Merge tag 'wireless-next-2023-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Major stack changes:

 * TC offload support for drivers below mac80211
 * reduced neighbor report (RNR) handling for AP mode
 * mac80211 mesh fast-xmit and fast-rx support
 * support for another mesh A-MSDU format
   (seems nobody got the spec right)

Major driver changes:

Kalle moved the drivers that were just plain C files
in drivers/net/wireless/ to legacy/ and virtual/ dirs.

hwsim
 * multi-BSSID support
 * some FTM support

ath11k
 * MU-MIMO parameters support
 * ack signal support for management packets

rtl8xxxu
 * support for RTL8710BU aka RTL8188GU chips

rtw89
 * support for various newer firmware APIs

ath10k
 * enabled threaded NAPI on WCN3990

iwlwifi
 * lots of work for multi-link/EHT (wifi7)
 * hardware timestamping support for some devices/firwmares
 * TX beacon protection on newer hardware

* tag 'wireless-next-2023-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (181 commits)
  wifi: clean up erroneously introduced file
  wifi: iwlwifi: mvm: correctly use link in iwl_mvm_sta_del()
  wifi: iwlwifi: separate AP link management queues
  wifi: iwlwifi: mvm: free probe_resp_data later
  wifi: iwlwifi: bump FW API to 75 for AX devices
  wifi: iwlwifi: mvm: move max_agg_bufsize into host TLC lq_sta
  wifi: iwlwifi: mvm: send full STA during HW restart
  wifi: iwlwifi: mvm: rework active links counting
  wifi: iwlwifi: mvm: update mac config when assigning chanctx
  wifi: iwlwifi: mvm: use the correct link queue
  wifi: iwlwifi: mvm: clean up mac_id vs. link_id in MLD sta
  wifi: iwlwifi: mvm: fix station link data leak
  wifi: iwlwifi: mvm: initialize max_rc_amsdu_len per-link
  wifi: iwlwifi: mvm: use appropriate link for rate selection
  wifi: iwlwifi: mvm: use the new lockdep-checking macros
  wifi: iwlwifi: mvm: remove chanctx WARN_ON
  wifi: iwlwifi: mvm: avoid sending MAC context for idle
  wifi: iwlwifi: mvm: remove only link-specific AP keys
  wifi: iwlwifi: mvm: skip inactive links
  wifi: iwlwifi: mvm: adjust iwl_mvm_scan_respect_p2p_go_iter() for MLO
  ...
====================

Link: https://lore.kernel.org/r/20230330205612.921134-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-30 23:52:20 -07:00
Kuniyuki Iwashima
8cdc3223e7 ipv6: Remove in6addr_any alternatives.
Some code defines the IPv6 wildcard address as a local variable and
use it with memcmp() or ipv6_addr_equal().

Let's use in6addr_any and ipv6_addr_any() instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-29 08:22:52 +01:00
Bobby Eshleman
634f1a7110 vsock: support sockmap
This patch adds sockmap support for vsock sockets. It is intended to be
usable by all transports, but only the virtio and loopback transports
are implemented.

SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported.

Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-29 08:19:38 +01:00
Thomas Gleixner
bc9d3a9f2a net: dst: Switch to rcuref_t reference counting
Under high contention dst_entry::__refcnt becomes a significant bottleneck.

atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.

Switch the reference count to rcuref_t which results in a significant
performance gain. Rename the reference count member to __rcuref to reflect
the change.

The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.

Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28 18:52:28 -07:00
Wangyang Guo
d288a162dd net: dst: Prevent false sharing vs. dst_entry:: __refcnt
dst_entry::__refcnt is highly contended in scenarios where many connections
happen from and to the same IP. The reference count is an atomic_t, so the
reference count operations have to take the cache-line exclusive.

Aside of the unavoidable reference count contention there is another
significant problem which is caused by that: False sharing.

perf top identified two affected read accesses. dst_entry::lwtstate and
rtable::rt_genid.

dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
a seperate cacheline vs. the read mostly members located at the beginning
of the struct.

That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also

  dst_entry::lwtstate

which is located after the reference count and in the same cache line. This
member is read after a reference count has been acquired.

struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
size of 112 bytes, which means that the struct members of rtable which
follow the dst member share the same cache line as dst_entry::__refcnt.
Especially

  rtable::rt_genid

is also read by the contexts which have a reference count acquired
already.

When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall. This was found when analysing the
memtier benchmark in 1:100 mode, which amplifies the problem extremly.

Move the rt[6i]_uncached[_list] members out of struct rtable and struct
rt6_info into struct dst_entry to provide padding and move the lwtstate
member after that so it ends up in the same cache line.

The resulting improvement depends on the micro-architecture and the number
of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
benchmark.

[ tglx: Rearrange struct ]

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-28 18:52:22 -07:00
Eric Van Hensbergen
46c30cb8f5
9p: Add additional debug flags and open modes
Add some additional debug flags to assist with debugging
cache changes.  Also add some additional open modes so we
can track cache state in fids more directly.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-03-27 02:33:48 +00:00
Jakub Kicinski
dc0a7b5200 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
  6e9d51b1a5cb ("net/mlx5e: Initialize link speed to zero")
  1bffcea42926 ("net/mlx5e: Add devlink hairpin queues parameters")
https://lore.kernel.org/all/20230324120623.4ebbc66f@canb.auug.org.au/
https://lore.kernel.org/all/20230321211135.47711-1-saeed@kernel.org/

Adjacent changes:

drivers/net/phy/phy.c
  323fe43cf9ae ("net: phy: Improved PHY error reporting in state machine")
  4203d84032e2 ("net: phy: Ensure state transitions are processed from phy_stop()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-24 10:10:20 -07:00
Jaewan Kim
5097f84437 wifi: nl80211: make nl80211_send_chandef non-static
Expose nl80211_send_chandef functionality for mac80211_hwsim or vendor
netlink can use it where needed.

Signed-off-by: Jaewan Kim <jaewan@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://lore.kernel.org/r/20230322131637.2633968-3-jaewan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-03-24 11:45:51 +01:00
Aloka Dixit
dbbb27e183 cfg80211: support RNR for EMA AP
As per IEEE Std 802.11ax-2021, 11.1.3.8.3 Discovery of a nontransmitted
BSSID profile, an EMA AP that transmits a Beacon frame carrying a partial
list of nontransmitted BSSID profiles should include in the frame
a Reduced Neighbor Report element carrying information for at least the
nontransmitted BSSIDs that are not present in the Multiple BSSID element
carried in that frame.
Add new nested attribute NL80211_ATTR_EMA_RNR_ELEMS to support the above.
Number of RNR elements must be more than or equal to the number of
MBSSID elements. This attribute can be used only when EMA is enabled.
Userspace is responsible for splitting the RNR into multiple elements such
that each element excludes the non-transmitting profiles already included
in the MBSSID element (%NL80211_ATTR_MBSSID_ELEMS) at the same index.
Each EMA beacon will be generated by adding MBSSID and RNR elements
at the same index. If the userspace provides more RNR elements than the
number of MBSSID elements then these will be added in every EMA beacon.

Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com>
Link: https://lore.kernel.org/r/20230323113801.6903-2-quic_alokad@quicinc.com
[Johannes: validate elements]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-03-24 11:12:48 +01:00
Jakub Kicinski
e346df60b8 Merge branch 'main' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:

====================
netfilter updates for net-next

This pull request contains changes for the *net-next* tree.

1. Change IPv6 stack to keep conntrack references until ipsec policy
   checks are done, like ipv4, from Madhu Koriginja.
   This update was missed when IPv6 NAT support was added 10 years ago.

2. get rid of old 'compat' structure layout in nf_nat_redirect
   core and move the conversion to the only user that needs the
   old layout for abi reasons. From Jeremy Sowden.

3. Compact some common code paths in nft_redir, also from Jeremy.

4. Time to remove the 'default y' knob so iptables 32bit compat interface
   isn't compiled in by default anymore, from myself.

5. Move ip(6)tables builtin icmp matches to the udptcp one.
   This has the advantage that icmp/icmpv6 match doesn't load the
   iptables/ip6tables modules anymore when iptables-nft is used.
   Also from myself.

* 'main' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: keep conntrack reference until IPsecv6 policy checks are done
  xtables: move icmp/icmpv6 logic to xt_tcpudp
  netfilter: xtables: disable 32bit compat interface by default
  netfilter: nft_masq: deduplicate eval call-backs
  netfilter: nft_redir: use `struct nf_nat_range2` throughout and deduplicate eval call-backs
====================

Link: https://lore.kernel.org/r/20230322210802.6743-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-23 21:55:24 -07:00
Kui-Feng Lee
8fb1a76a0f net: Update an existing TCP congestion control algorithm.
This feature lets you immediately transition to another congestion
control algorithm or implementation with the same name.  Once a name
is updated, new connections will apply this new algorithm.

The purpose is to update a customized algorithm implemented in BPF
struct_ops with a new version on the flight.  The following is an
example of using the userspace API implemented in later BPF patches.

   link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
   .......
   err = bpf_link__update_map(link, skel->maps.ca_update_2);

We first load and register an algorithm implemented in BPF struct_ops,
then swap it out with a new one using the same name. After that, newly
created connections will apply the updated algorithm, while older ones
retain the previous version already applied.

This patch also takes this chance to refactor the ca validation into
the new tcp_validate_congestion_control() function.

Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230323032405.3735486-3-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 22:53:00 -07:00
Jakub Kicinski
e4d264e87a Extend packet offload to fully support libreswan
The following patches are an outcome of Raed's work to add packet
 offload support to libreswan [1].
 
 The series includes:
  * Priority support to IPsec policies
  * Statistics per-SA (visible through "ip -s xfrm state ..." command)
  * Support to IKE policy holes
  * Fine tuning to acquire logic.
 
 Thanks
 
 [1] https://github.com/libreswan/libreswan/pull/986
 Link: https://lore.kernel.org/all/cover.1678714336.git.leon@kernel.org
 Signed-off-by: Leon Romanovsky <leon@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCZBgoggAKCRAp8NhrnBAZ
 sQPvAP4/muXSlz26gU632wL+wjpFkvM5fX1iu8AwmdhOtX+gAAD/edtCVXj4QxD4
 0xGMjwrz4tH5WqdUzqDQjhZrBYNWmAk=
 =oNEQ
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-libreswan-mlx5' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Leon Romanovsky says:

====================
Extend packet offload to fully support libreswan

The following patches are an outcome of Raed's work to add packet
offload support to libreswan [1].

The series includes:
 * Priority support to IPsec policies
 * Statistics per-SA (visible through "ip -s xfrm state ..." command)
 * Support to IKE policy holes
 * Fine tuning to acquire logic.

[1] https://github.com/libreswan/libreswan/pull/986
Link: https://lore.kernel.org/all/cover.1678714336.git.leon@kernel.org

* tag 'ipsec-libreswan-mlx5' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5e: Update IPsec per SA packets/bytes count
  net/mlx5e: Use one rule to count all IPsec Tx offloaded traffic
  net/mlx5e: Support IPsec acquire default SA
  net/mlx5e: Allow policies with reqid 0, to support IKE policy holes
  xfrm: copy_to_user_state fetch offloaded SA packets/bytes statistics
  xfrm: add new device offload acquire flag
  net/mlx5e: Use chains for IPsec policy priority offload
  net/mlx5: fs_core: Allow ignore_flow_level on TX dest
  net/mlx5: fs_chains: Refactor to detach chains from tc usage
====================

Link: https://lore.kernel.org/r/20230320094722.1009304-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-22 21:56:33 -07:00
Kiran K
294d749b5d Bluetooth: btintel: Iterate only bluetooth device ACPI entries
Current flow interates over entire ACPI table entries looking for
Bluetooth Per Platform Antenna Gain(PPAG) entry. This patch iterates
over ACPI entries relvant to Bluetooth device only.

Fixes: c585a92b2f9c ("Bluetooth: btintel: Set Per Platform Antenna Gain(PPAG)")
Signed-off-by: Kiran K <kiran.k@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-03-22 16:05:55 -07:00
Jeremy Sowden
6f56ad1b92 netfilter: nft_redir: use struct nf_nat_range2 throughout and deduplicate eval call-backs
`nf_nat_redirect_ipv4` takes a `struct nf_nat_ipv4_multi_range_compat`,
but converts it internally to a `struct nf_nat_range2`.  Change the
function to take the latter, factor out the code now shared with
`nf_nat_redirect_ipv6`, move the conversion to the xt_REDIRECT module,
and update the ipv4 range initialization in the nft_redir module.

Replace a bare hex constant for 127.0.0.1 with a macro.

Remove `WARN_ON`.  `nf_nat_setup_info` calls `nf_ct_is_confirmed`:

	/* Can't setup nat info for confirmed ct. */
	if (nf_ct_is_confirmed(ct))
		return NF_ACCEPT;

This means that `ct` cannot be null or the kernel will crash, and
implies that `ctinfo` is `IP_CT_NEW` or `IP_CT_RELATED`.

nft_redir has separate ipv4 and ipv6 call-backs which share much of
their code, and an inet one switch containing a switch that calls one of
the others based on the family of the packet.  Merge the ipv4 and ipv6
ones into the inet one in order to get rid of the duplicate code.

Const-qualify the `priv` pointer since we don't need to write through
it.

Assign `priv->flags` to the range instead of OR-ing it in.

Set the `NF_NAT_RANGE_PROTO_SPECIFIED` flag once during init, rather
than on every eval.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-03-22 21:48:59 +01:00
Aloka Dixit
bd54f3c290 wifi: mac80211: generate EMA beacons in AP mode
Add APIs to generate an array of beacons for an EMA AP (enhanced
multiple BSSID advertisements), each including a single MBSSID element.
EMA profile periodicity equals the count of elements.

- ieee80211_beacon_get_template_ema_list() - Generate and return all
EMA beacon templates. Drivers must call ieee80211_beacon_free_ema_list()
to free the memory. No change in the prototype for the existing API,
ieee80211_beacon_get_template(), which should be used for non-EMA AP.

- ieee80211_beacon_get_template_ema_index() - Generate a beacon which
includes the multiple BSSID element at the given index. Drivers can use
this function in a loop until NULL is returned which indicates end of
available MBSSID elements.

- ieee80211_beacon_free_ema_list() - free the memory allocated for the
list of EMA beacon templates.

Modify existing functions ieee80211_beacon_get_ap(),
ieee80211_get_mbssid_beacon_len() and ieee80211_beacon_add_mbssid()
to accept a new parameter for EMA index.

Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com>
Co-developed-by: John Crispin <john@phrozen.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20221206005040.3177-2-quic_alokad@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-03-22 13:54:57 +01:00
Felix Fietkau
fe4a6d2db3 wifi: mac80211: implement support for yet another mesh A-MSDU format
MT7996 hardware supports mesh A-MSDU subframes in hardware, but uses a
big-endian length field

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Link: https://lore.kernel.org/r/20230314095956.62085-7-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-03-22 13:31:19 +01:00
Felix Fietkau
61587f1556 wifi: mac80211: add support for letting drivers register tc offload support
On newer MediaTek SoCs (e.g. MT7986), WLAN->WLAN or WLAN->Ethernet flows can
be offloaded by the SoC. In order to support that, the .ndo_setup_tc op is
needed.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20230321091248.30947-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-03-22 13:20:03 +01:00
Eric Dumazet
09eed1192c neighbour: switch to standard rcu, instead of rcu_bh
rcu_bh is no longer a win, especially for objects freed
with standard call_rcu().

Switch neighbour code to no longer disable BH when not necessary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-21 21:32:18 -07:00
Raed Salem
e0aeb9b90a xfrm: add new device offload acquire flag
During XFRM acquire flow, a default SA is created to be updated later,
once acquire netlink message is handled in user space. When the relevant
policy is offloaded this default SA is also offloaded to IPsec offload
supporting driver, however this SA does not have context suitable for
offloading in HW, nor is interesting to offload to HW, consequently needs
a special driver handling apart from other offloaded SA(s).
Add a special flag that marks such SA so driver can handle it correctly.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Link: https://lore.kernel.org/r/f5da0834d8c6b82ab9ba38bd4a0c55e71f0e3dab.1678714336.git.leon@kernel.org
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20 11:29:33 +02:00
Eric Dumazet
e9d9da9154 tcp: preserve const qualifier in tcp_sk()
We can change tcp_sk() to propagate its argument const qualifier,
thanks to container_of_const().

We have two places where a const sock pointer has to be upgraded
to a write one. We have been using const qualifier for lockless
listeners to clearly identify points where writes could happen.

Add tcp_sk_rw() helper to better document these.

tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check()
and tcp_rack_reo_wnd() get an additional const qualififer
for their @tp local variables.

smc_check_reset_syn_req() also needs a similar change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-18 12:23:34 +00:00