Commit Graph

1218395 Commits

Author SHA1 Message Date
Phil Sutter
1baf0152f7 netfilter: nf_tables: audit log object reset once per table
When resetting multiple objects at once (via dump request), emit a log
message per table (or filled skb) and resurrect the 'entries' parameter
to contain the number of objects being logged for.

To test the skb exhaustion path, perform some bulk counter and quota
adds in the kselftest.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 13:43:40 +02:00
Geert Uytterhoeven
2915240edd neighbor: tracing: Move pin6 inside CONFIG_IPV6=y section
When CONFIG_IPV6=n, and building with W=1:

    In file included from include/trace/define_trace.h:102,
		     from include/trace/events/neigh.h:255,
		     from net/core/net-traces.c:51:
    include/trace/events/neigh.h: In function ‘trace_event_raw_event_neigh_create’:
    include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not used [-Werror=unused-but-set-variable]
       42 |                 struct in6_addr *pin6;
	  |                                  ^~~~
    include/trace/trace_events.h:402:11: note: in definition of macro ‘DECLARE_EVENT_CLASS’
      402 |         { assign; }                                                     \
	  |           ^~~~~~
    include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’
       44 |                              PARAMS(assign),                   \
	  |                              ^~~~~~
    include/trace/events/neigh.h:23:1: note: in expansion of macro ‘TRACE_EVENT’
       23 | TRACE_EVENT(neigh_create,
	  | ^~~~~~~~~~~
    include/trace/events/neigh.h:41:9: note: in expansion of macro ‘TP_fast_assign’
       41 |         TP_fast_assign(
	  |         ^~~~~~~~~~~~~~
    In file included from include/trace/define_trace.h:103,
		     from include/trace/events/neigh.h:255,
		     from net/core/net-traces.c:51:
    include/trace/events/neigh.h: In function ‘perf_trace_neigh_create’:
    include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not used [-Werror=unused-but-set-variable]
       42 |                 struct in6_addr *pin6;
	  |                                  ^~~~
    include/trace/perf.h:51:11: note: in definition of macro ‘DECLARE_EVENT_CLASS’
       51 |         { assign; }                                                     \
	  |           ^~~~~~
    include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’
       44 |                              PARAMS(assign),                   \
	  |                              ^~~~~~
    include/trace/events/neigh.h:23:1: note: in expansion of macro ‘TRACE_EVENT’
       23 | TRACE_EVENT(neigh_create,
	  | ^~~~~~~~~~~
    include/trace/events/neigh.h:41:9: note: in expansion of macro ‘TP_fast_assign’
       41 |         TP_fast_assign(
	  |         ^~~~~~~~~~~~~~

Indeed, the variable pin6 is declared and initialized unconditionally,
while it is only used and needlessly re-initialized when support for
IPv6 is enabled.

Fix this by dropping the unused variable initialization, and moving the
variable declaration inside the existing section protected by a check
for CONFIG_IPV6.

Fixes: fc651001d2 ("neighbor: Add tracepoint to __neigh_create")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Simon Horman <horms@kernel.org> # build-tested
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 11:16:43 +01:00
Phil Sutter
c4eee56e14 net: skb_find_text: Ignore patterns extending past 'to'
Assume that caller's 'to' offset really represents an upper boundary for
the pattern search, so patterns extending past this offset are to be
rejected.

The old behaviour also was kind of inconsistent when it comes to
fragmentation (or otherwise non-linear skbs): If the pattern started in
between 'to' and 'from' offsets but extended to the next fragment, it
was not found if 'to' offset was still within the current fragment.

Test the new behaviour in a kselftest using iptables' string match.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: f72b948dcb ("[NET]: skb_find_text ignores to argument")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 11:09:55 +01:00
Bagas Sanjaya
1db34aa58d Revert "net: wwan: iosm: enable runtime pm support for 7560"
Runtime power management support breaks Intel LTE modem where dmesg dump
showes timeout errors:

```
[   72.027442] iosm 0000:01:00.0: msg timeout
[   72.531638] iosm 0000:01:00.0: msg timeout
[   73.035414] iosm 0000:01:00.0: msg timeout
[   73.540359] iosm 0000:01:00.0: msg timeout
```

Furthermore, when shutting down with `poweroff` and modem attached, the
system rebooted instead of powering down as expected. The modem works
again only after power cycling.

Revert runtime power management support for IOSM driver as introduced by
commit e4f5073d53 ("net: wwan: iosm: enable runtime pm support for
7560").

Fixes: e4f5073d53 ("net: wwan: iosm: enable runtime pm support for 7560")
Reported-by: Martin <mwolf@adiumentum.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217996
Link: https://lore.kernel.org/r/267abf02-4b60-4a2e-92cd-709e3da6f7d3@gmail.com/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 11:08:52 +01:00
David S. Miller
37fb1c81d2 netfilter next pull request 2023-10-18
-----BEGIN PGP SIGNATURE-----
 
 iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmUvltYNHGZ3QHN0cmxl
 bi5kZQAKCRBwkajZrV/2AGbwEADT9pI6dIyEJt0xMte+Ld41E+lfBgLXxv/cAFQo
 5NjP5SF75iMh0PoRrw1f9DJJNGo4C2OhcWcltosYLdE2xoDNjYw/J2oi/WdYHLe7
 tEFZWB0KfOBl17Gh0l9Gue8MyVKKamNDuMvfd/MQB9af7VQ6lPi/hlWyjAk4hTGu
 6Ly1asHM1kXFt6Z0mKMRQomXKZ46mEVWX8a11wtfLnd3vmCqrHlKwGKlNVi9U5ir
 aQnitw60OmCvbYLBdjDugPibSTeoha674vxVtjWuwq+1uQrkx0js/ASKPGrT2a33
 yo2/nuprdZc135YCXAlccb4TwgKTMT1R1pf9JY5fhSIPKfY49yR0JKUckXpieefy
 t/VAJWVsVvXg27XF1LCommMOhbM0c+WE/BpJ8ASTSt+79nXhLcISDirkgUcGvUKZ
 LNp1sa3QqwR7auTTzd9aKIWvZIN82MLOJJpYkhNOqYheCAm9Y2O1kClQB7t7Zi3r
 K7L9clpexYWcJQmDox2SwU7nVWeqJP6e2Y2f4W+B9aKuiyHNF/B/pEE5LO6hqZ8C
 WJ+TZZVaxDTxtVtC2W3tFFhtd/ZzuIPryXNL127XN83L0iQZ+STDe6CzTXyysD3h
 14iCr4pY4u5y5w0jqilQYdMCyCK2LUKmUoZUKGpwSlDQtEZU6BZsFnE12C6+v9N+
 BXr7aA==
 =/3kX
 -----END PGP SIGNATURE-----

Merge tag 'nf-next-23-10-18' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter next pull request 2023-10-18

This series contains initial netfilter skb drop_reason support, from
myself.

First few patches fix up a few spots to make sure we won't trip
when followup patches embed error numbers in the upper bits
(we already do this in some places).

Then, nftables and bridge netfilter get converted to call kfree_skb_reason
directly to let tooling pinpoint exact location of packet drops,
rather than the existing NF_DROP catchall in nf_hook_slow().

I would like to eventually convert all netfilter modules, but as some
callers cannot deal with NF_STOLEN (notably act_ct), more preparation
work is needed for this.

Last patch gets rid of an ugly 'de-const' cast in nftables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 11:05:21 +01:00
Gavrilov Ilia
1d30162f35 net: pktgen: Fix interface flags printing
Device flags are displayed incorrectly:
1) The comparison (i == F_FLOW_SEQ) is always false, because F_FLOW_SEQ
is equal to (1 << FLOW_SEQ_SHIFT) == 2048, and the maximum value
of the 'i' variable is (NR_PKT_FLAG - 1) == 17. It should be compared
with FLOW_SEQ_SHIFT.

2) Similarly to the F_IPSEC flag.

3) Also add spaces to the print end of the string literal "spi:%u"
to prevent the output from merging with the flag that follows.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: 99c6d3d20d ("pktgen: Remove brute-force printing of flags")
Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 10:12:30 +01:00
David S. Miller
810799a066 Merge branch 'ethtool-forced-speed'
Paul Greenwalt says:

====================
ethtool: Add link mode maps for forced speeds

The following patch set was initially a part of [1]. As the purpose of the
original series was to add the support of the new hardware to the intel ice
driver, the refactoring of advertised link modes mapping was extracted to a
new set.

The patch set adds a common mechanism for mapping Ethtool forced speeds
with Ethtool supported link modes, which can be used in drivers code.

[1] https://lore.kernel.org/netdev/20230823180633.2450617-1-pawel.chmielewski@intel.com

Changelog:
v4->v5:
Separated ethtool and qede changes into two patches, fixed indentation,
and moved ethtool_forced_speed_maps_init() from ioctl.c to ethtool.h

v3->v4:
Moved the macro for setting fields into the common header file

v2->v3:
Fixed whitespaces, added missing line at end of file

v1->v2:
Fixed formatting, typo, moved declaration of iterator to loop line.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:36:36 +01:00
Pawel Chmielewski
982b0192db ice: Refactor finding advertised link speed
Refactor ice_get_link_ksettings to using forced speed to link modes
mapping.

Suggested-by : Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:36:35 +01:00
Paul Greenwalt
a5b65cd2a3 qede: Refactor qede_forced_speed_maps_init()
Refactor qede_forced_speed_maps_init() to use commen implementation
ethtool_forced_speed_maps_init().

The qede driver was compile tested only.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:36:35 +01:00
Paul Greenwalt
26c5334d34 ethtool: Add forced speed to supported link modes maps
The need to map Ethtool forced speeds to Ethtool supported link modes is
common among drivers. To support this, add a common structure for forced
speed maps and a function to init them.  This is solution was originally
introduced in commit 1d4e4ecccb ("qede: populate supported link modes
maps on module init") for qede driver.

ethtool_forced_speed_maps_init() should be called during driver init
with an array of struct ethtool_forced_speed_map to populate the mapping.

Definitions for maps themselves are left in the driver code, as the sets
of supported link modes may vary between the devices.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:36:35 +01:00
Florian Westphal
2560016721 netfilter: nf_tables: de-constify set commit ops function argument
The set backend using this already has to work around this via ugly
cast, don't spread this pattern.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
cf8b7c1a5b netfilter: bridge: convert br_netfilter to NF_DROP_REASON
errno is 0 because these hooks are called from prerouting and forward.
There is no socket that the errno would ever be propagated to.

Other netfilter modules (e.g. nf_nat, conntrack, ...) can be converted
in a similar way.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
e0d4593140 netfilter: make nftables drops visible in net dropmonitor
net_dropmonitor blames core.c:nf_hook_slow.
Add NF_DROP_REASON() helper and use it in nft_do_chain().

The helper releases the skb, so exact drop location becomes
available. Calling code will observe the NF_STOLEN verdict
instead.

Adjust nf_hook_slow so we can embed an erro value wih
NF_STOLEN verdicts, just like we do for NF_DROP.

After this, drop in nftables can be pinpointed to a drop due
to a rule or the chain policy.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
35c038b0a4 netfilter: nf_nat: mask out non-verdict bits when checking return value
Same as previous change: we need to mask out the non-verdict bits, as
upcoming patches may embed an errno value in NF_STOLEN verdicts too.

NF_DROP could already do this, but not all called functions do this.

Checks that only test ret vs NF_ACCEPT are fine, the 'errno parts'
are always 0 for those.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
6291b3a67a netfilter: conntrack: convert nf_conntrack_update to netfilter verdicts
This function calls helpers that can return nf-verdicts, but then
those get converted to -1/0 as thats what the caller expects.

Theoretically NF_DROP could have an errno number set in the upper 24
bits of the return value. Or any of those helpers could return
NF_STOLEN, which would result in use-after-free.

This is fine as-is, the called functions don't do this yet.

But its better to avoid possible future problems if the upcoming
patchset to add NF_DROP_REASON() support gains further users, so remove
the 0/-1 translation from the picture and pass the verdicts down to
the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
4d26ab0086 netfilter: nf_tables: mask out non-verdict bits when checking return value
nftables trace infra must mask out the non-verdict bit parts of the
return value, else followup changes that 'return errno << 8 | NF_STOLEN'
will cause breakage.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
Florian Westphal
e15e502710 netfilter: xt_mangle: only check verdict part of return value
These checks assume that the caller only returns NF_DROP without
any errno embedded in the upper bits.

This is fine right now, but followup patches will start to propagate
such errors to allow kfree_skb_drop_reason() in the called functions,
those would then indicate 'errno << 8 | NF_STOLEN'.

To not break things we have to mask those parts out.

Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18 10:26:43 +02:00
David S. Miller
a0a8602247 Merge branch 'devlink-deadlock'
Jiri Pirko says:

====================
devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock

devlink_port_fill() may be called sometimes with RTNL lock held.
When putting the nested port function devlink instance attrs,
current code takes nested devlink instance lock. In that case lock
ordering is wrong.

Patch #1 is a dependency of patch #2.
Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could
         called without devlink instance lock.
Patch #3 takes device reference for devlink instance making sure that
         device does not disappear before devlink_release() is called.
Patch #4 benefits from the preparations done in patches #2 and #3 and
         removes the problematic nested devlink lock aquisition.
Patched #5-#7 improve documentation to reflect this issue so it is
              avoided in the future.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:02 +01:00
Jiri Pirko
5d77371e8c devlink: document devlink_rel_nested_in_notify() function
Add a documentation for devlink_rel_nested_in_notify() describing the
devlink instance locking consequences.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
bb11cf9b2c Documentation: devlink: add a note about RTNL lock into locking section
Add a note describing the locking order of taking RTNL lock with devlink
instance lock.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
b6f23b319a Documentation: devlink: add nested instance section
Add a part talking about nested devlink instances describing
the helpers and locking ordering.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
b5f4e37133 devlink: don't take instance lock for nested handle put
Lockdep reports following issue:

WARNING: possible circular locking dependency detected
------------------------------------------------------
devlink/8191 is trying to acquire lock:
ffff88813f32c250 (&devlink->lock_key#14){+.+.}-{3:3}, at: devlink_rel_devlink_handle_put+0x11e/0x2d0

                           but task is already holding lock:
ffffffff8511eca8 (rtnl_mutex){+.+.}-{3:3}, at: unregister_netdev+0xe/0x20

                           which lock already depends on the new lock.

                           the existing dependency chain (in reverse order) is:

                           -> #3 (rtnl_mutex){+.+.}-{3:3}:
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       register_netdevice_notifier_net+0x13/0x30
       mlx5_lag_add_mdev+0x51c/0xa00 [mlx5_core]
       mlx5_load+0x222/0xc70 [mlx5_core]
       mlx5_init_one_devl_locked+0x4a0/0x1310 [mlx5_core]
       mlx5_init_one+0x3b/0x60 [mlx5_core]
       probe_one+0x786/0xd00 [mlx5_core]
       local_pci_probe+0xd7/0x180
       pci_device_probe+0x231/0x720
       really_probe+0x1e4/0xb60
       __driver_probe_device+0x261/0x470
       driver_probe_device+0x49/0x130
       __driver_attach+0x215/0x4c0
       bus_for_each_dev+0xf0/0x170
       bus_add_driver+0x21d/0x590
       driver_register+0x133/0x460
       vdpa_match_remove+0x89/0xc0 [vdpa]
       do_one_initcall+0xc4/0x360
       do_init_module+0x22d/0x760
       load_module+0x51d7/0x6750
       init_module_from_file+0xd2/0x130
       idempotent_init_module+0x326/0x5a0
       __x64_sys_finit_module+0xc1/0x130
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           -> #2 (mlx5_intf_mutex){+.+.}-{3:3}:
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       mlx5_register_device+0x3e/0xd0 [mlx5_core]
       mlx5_init_one_devl_locked+0x8fa/0x1310 [mlx5_core]
       mlx5_devlink_reload_up+0x147/0x170 [mlx5_core]
       devlink_reload+0x203/0x380
       devlink_nl_cmd_reload+0xb84/0x10e0
       genl_family_rcv_msg_doit+0x1cc/0x2a0
       genl_rcv_msg+0x3c9/0x670
       netlink_rcv_skb+0x12c/0x360
       genl_rcv+0x24/0x40
       netlink_unicast+0x435/0x6f0
       netlink_sendmsg+0x7a0/0xc70
       sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1c8/0x290
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           -> #1 (&dev->lock_key#8){+.+.}-{3:3}:
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       mlx5_init_one_devl_locked+0x45/0x1310 [mlx5_core]
       mlx5_devlink_reload_up+0x147/0x170 [mlx5_core]
       devlink_reload+0x203/0x380
       devlink_nl_cmd_reload+0xb84/0x10e0
       genl_family_rcv_msg_doit+0x1cc/0x2a0
       genl_rcv_msg+0x3c9/0x670
       netlink_rcv_skb+0x12c/0x360
       genl_rcv+0x24/0x40
       netlink_unicast+0x435/0x6f0
       netlink_sendmsg+0x7a0/0xc70
       sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1c8/0x290
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           -> #0 (&devlink->lock_key#14){+.+.}-{3:3}:
       check_prev_add+0x1af/0x2300
       __lock_acquire+0x31d7/0x4eb0
       lock_acquire+0x1c3/0x500
       __mutex_lock+0x14c/0x1b20
       devlink_rel_devlink_handle_put+0x11e/0x2d0
       devlink_nl_port_fill+0xddf/0x1b00
       devlink_port_notify+0xb5/0x220
       __devlink_port_type_set+0x151/0x510
       devlink_port_netdevice_event+0x17c/0x220
       notifier_call_chain+0x97/0x240
       unregister_netdevice_many_notify+0x876/0x1790
       unregister_netdevice_queue+0x274/0x350
       unregister_netdev+0x18/0x20
       mlx5e_vport_rep_unload+0xc5/0x1c0 [mlx5_core]
       __esw_offloads_unload_rep+0xd8/0x130 [mlx5_core]
       mlx5_esw_offloads_rep_unload+0x52/0x70 [mlx5_core]
       mlx5_esw_offloads_unload_rep+0x85/0xc0 [mlx5_core]
       mlx5_eswitch_unload_sf_vport+0x41/0x90 [mlx5_core]
       mlx5_devlink_sf_port_del+0x120/0x280 [mlx5_core]
       genl_family_rcv_msg_doit+0x1cc/0x2a0
       genl_rcv_msg+0x3c9/0x670
       netlink_rcv_skb+0x12c/0x360
       genl_rcv+0x24/0x40
       netlink_unicast+0x435/0x6f0
       netlink_sendmsg+0x7a0/0xc70
       sock_sendmsg+0xc5/0x190
       __sys_sendto+0x1c8/0x290
       __x64_sys_sendto+0xdc/0x1b0
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0

                           other info that might help us debug this:

Chain exists of:
                             &devlink->lock_key#14 --> mlx5_intf_mutex --> rtnl_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(rtnl_mutex);
                               lock(mlx5_intf_mutex);
                               lock(rtnl_mutex);
  lock(&devlink->lock_key#14);

Problem is taking the devlink instance lock of nested instance when RTNL
is already held.

To fix this, don't take the devlink instance lock when putting nested
handle. Instead, rely on the preparations done by previous two patches
to be able to access device pointer and obtain netns id without devlink
instance lock held.

Fixes: c137743bce ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
a380687200 devlink: take device reference for devlink object
In preparation to allow to access device pointer without devlink
instance lock held, make sure the device pointer is usable until
devlink_release() is called.

Fixes: c137743bce ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
c503bc7df6 devlink: call peernet2id_alloc() with net pointer under RCU read lock
peernet2id_alloc() allows to be called lockless with peer net pointer
obtained in RCU critical section and makes sure to return ns ID if net
namespaces is not being removed concurrently. Benefit from
read_pnet_rcu() helper addition, use it to obtain net pointer under RCU
read lock and pass it to peernet2id_alloc() to get ns ID.

Fixes: c137743bce ("devlink: introduce object and nested devlink relationship infra")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jiri Pirko
2034d90ae4 net: treat possible_net_t net pointer as an RCU one and add read_pnet_rcu()
Make the net pointer stored in possible_net_t structure annotated as
an RCU pointer. Change the access helpers to treat it as such.
Introduce read_pnet_rcu() helper to allow caller to dereference
the net pointer under RCU read lock.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-18 09:23:01 +01:00
Jakub Kicinski
ee2a35fedb mlx5-updates-2023-10-10
1) Adham Faris, Increase max supported channels number to 256
 
 2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes
 
 3) Shay Drory, Replace global mlx5_intf_lock with
    HCA devcom component lock
 
 4) Wei Zhang, Optimize SF creation flow
 
 During SF creation, HCA state gets changed from INVALID to
 IN_USE step by step. Accordingly, FW sends vhca event to
 driver to inform about this state change asynchronously.
 Each vhca event is critical because all related SW/FW
 operations are triggered by it.
 
 Currently there is only a single mlx5 general event handler
 which not only handles vhca event but many other events.
 This incurs huge bottleneck because all events are forced
 to be handled in serial manner.
 
 Moreover, all SFs share same table_lock which inevitably
 impacts each other when they are created in parallel.
 
 This series will solve this issue by:
 
 1. A dedicated vhca event handler is introduced to eliminate
    the mutual impact with other mlx5 events.
 2. Max FW threads work queues are employed in the vhca event
    handler to fully utilize FW capability.
 3. Redesign SF active work logic to completely remove
    table_lock.
 
 With above optimization, SF creation time is reduced by 25%,
 i.e. from 80s to 60s when creating 100 SFs.
 
 Patches summary:
 
 Patch 1 - implement dedicated vhca event handler with max FW
           cmd threads of work queues.
 Patch 2 - remove table_lock by redesigning SF active work
           logic.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmUqzPEACgkQSD+KveBX
 +j52rwf/c/LnKjPdcWztDIj4YQys9VfyWm9rkYmuiQ9XTYjzY7Y9RTW9XvrJSTdA
 GsD8EmWnqysr0HyRqksuhR6QHydU+fOrOIYxA0OeIbqXXmhIoDqnqTPQftK1q8Cq
 xZ4wDWowwoOGs5BXYsj6m+53ukyB91/dVHS8qiqL6SzDhm6pEEqjeoum7bxM1lKF
 HxLHNLWp3wolLn361Qlvd7SyOcDu3/c7+DYUJ04Hc2TcJEc7G5ZkBtqMNUZ599Z8
 06lXXj1FgIGtqzuddHLpS71gvp5yvtiw/2ujmIX7YuH8zHDqGRhKUYiJ1eRk80iw
 aYPESXC0OsQDWgNFc13f5eTQgc2p/Q==
 =Qaae
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2023-10-10

1) Adham Faris, Increase max supported channels number to 256

2) Leon Romanovsky, Allow IPsec soft/hard limits in bytes

3) Shay Drory, Replace global mlx5_intf_lock with
   HCA devcom component lock

4) Wei Zhang, Optimize SF creation flow

During SF creation, HCA state gets changed from INVALID to
IN_USE step by step. Accordingly, FW sends vhca event to
driver to inform about this state change asynchronously.
Each vhca event is critical because all related SW/FW
operations are triggered by it.

Currently there is only a single mlx5 general event handler
which not only handles vhca event but many other events.
This incurs huge bottleneck because all events are forced
to be handled in serial manner.

Moreover, all SFs share same table_lock which inevitably
impacts each other when they are created in parallel.

This series will solve this issue by:

1. A dedicated vhca event handler is introduced to eliminate
   the mutual impact with other mlx5 events.
2. Max FW threads work queues are employed in the vhca event
   handler to fully utilize FW capability.
3. Redesign SF active work logic to completely remove
   table_lock.

With above optimization, SF creation time is reduced by 25%,
i.e. from 80s to 60s when creating 100 SFs.

Patches summary:

Patch 1 - implement dedicated vhca event handler with max FW
          cmd threads of work queues.
Patch 2 - remove table_lock by redesigning SF active work
          logic.

* tag 'mlx5-updates-2023-10-10' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: Allow IPsec soft/hard limits in bytes
  net/mlx5e: Increase max supported channels number to 256
  net/mlx5e: Preparations for supporting larger number of channels
  net/mlx5e: Refactor mlx5e_rss_init() and mlx5e_rss_free() API's
  net/mlx5e: Refactor mlx5e_rss_set_rxfh() and mlx5e_rss_get_rxfh()
  net/mlx5e: Refactor rx_res_init() and rx_res_free() APIs
  net/mlx5e: Use PTR_ERR_OR_ZERO() to simplify code
  net/mlx5: Use PTR_ERR_OR_ZERO() to simplify code
  net/mlx5: fix config name in Kconfig parameter documentation
  net/mlx5: Remove unused declaration
  net/mlx5: Replace global mlx5_intf_lock with HCA devcom component lock
  net/mlx5: Refactor LAG peer device lookout bus logic to mlx5 devcom
  net/mlx5: Avoid false positive lockdep warning by adding lock_class_key
  net/mlx5: Redesign SF active work to remove table_lock
  net/mlx5: Parallelize vhca event handling
====================

Link: https://lore.kernel.org/r/20231014171908.290428-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 18:27:27 -07:00
Jakub Kicinski
f6c7b42243 ipsec-2023-10-17
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmUuRZ4ACgkQrB3Eaf9P
 W7e7eg//fgQux/sJoq2Qf7T8cvVt3NSpOLXB43NRsIb9nUyMNN/cYTtypYM/+0FM
 f0zxmwtUkj9Mx/IL2nNzK/h8lMtJeKfy14tt8lAX2ux5oJltCcTiLgdp3C4HWxOh
 9DrXMGIryr0apedkcStSzhRoJBd8giFViAQZE8NwO5EKG8WLJNVvw0YzlNgM0WOk
 5y/sN1uXeUmUCYDkOGcdt6yIyV+GIo/nEg/XOY8LkaQDzKveDypydj0dPGL/Dj8U
 MhczTCf1WdbTHh0dmITIdX/yZ8/bPNfV3EzAtAaYgceVh1DSq8/F9buRXz2oxxvh
 r8Igv+640+SoWEtIsVoXHx7KIZ7LckasrJv1IoKRXGVV25LjnR+bxGVQNyUjdd6+
 Cb11Q/m66fp/k7B+++b6eFVlKvr9O4jBogb3kfGpqMiwm6LNM+CJicdAfM8/GEgM
 ejnVNSEaChhdgGvrXVbKhI2ACatwbPrt6d4PN0d9fpUbZJnuBZl4QHElLNSBznjO
 xJUF8LvPjFGx6vj3YFnBg5f3uNH35nR2jQxPsx+yWdBEFqbxBt7yKr/pftwP1gxD
 ZLHF8JbOT4J+96ClDRFnlWFdY+Phq0fg5S5WhPhNGQ4rj9rsjAS8ZMcHWZeA7iZ2
 0w5P/DlFmmnw6CC+Xr+wQJlgB1tZ2sdC8nAK6gxQJ3eFVv4wnhk=
 =yHtO
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2023-10-17

1) Fix a slab-use-after-free in xfrm_policy_inexact_list_reinsert.
   From Dong Chenchen.

2) Fix data-races in the xfrm interfaces dev->stats fields.
   From Eric Dumazet.

3) Fix a data-race in xfrm_gen_index.
   From Eric Dumazet.

4) Fix an inet6_dev refcount underflow.
   From Zhang Changzhong.

5) Check the return value of pskb_trim in esp_remove_trailer
   for esp4 and esp6. From Ma Ke.

6) Fix a data-race in xfrm_lookup_with_ifid.
   From Eric Dumazet.

* tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: fix a data-race in xfrm_lookup_with_ifid()
  net: ipv4: fix return value check in esp_remove_trailer
  net: ipv6: fix return value check in esp_remove_trailer
  xfrm6: fix inet6_dev refcount underflow problem
  xfrm: fix a data-race in xfrm_gen_index()
  xfrm: interface: use DEV_STATS_INC()
  net: xfrm: skip policies marked as dead while reinserting policies
====================

Link: https://lore.kernel.org/r/20231017083723.1364940-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 18:21:13 -07:00
Justin Stitt
d4b14c1da5 hamradio: replace deprecated strncpy with strscpy_pad
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

We expect both hi.data.modename and hi.data.drivername to be
NUL-terminated based on its usage with sprintf:
|       sprintf(hi.data.modename, "%sclk,%smodem,fclk=%d,bps=%d%s",
|               bc->cfg.intclk ? "int" : "ext",
|               bc->cfg.extmodem ? "ext" : "int", bc->cfg.fclk, bc->cfg.bps,
|               bc->cfg.loopback ? ",loopback" : "");

Note that this data is copied out to userspace with:
|       if (copy_to_user(data, &hi, sizeof(hi)))
... however, the data was also copied FROM the user here:
|       if (copy_from_user(&hi, data, sizeof(hi)))

Considering the above, a suitable replacement is strscpy_pad() as it
guarantees NUL-termination on the destination buffer while also
NUL-padding (which is good+wanted behavior when copying data to
userspace).

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231016-strncpy-drivers-net-hamradio-baycom_epp-c-v2-1-39f72a72de30@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:59:55 -07:00
Jakub Kicinski
5294df643b docs: netlink: clean up after deprecating version
Jiri moved version to legacy specs in commit 0f07415ebb ("netlink:
specs: don't allow version to be specified for genetlink").
Update the documentation.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231016214540.1822392-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:59:51 -07:00
Jakub Kicinski
9fea94d3a8 tools: ynl: fix converting flags to names after recent cleanup
I recently cleaned up specs to not specify enum-as-flags
when target enum is already defined as flags.
YNL Python library did not convert flags, unfortunately,
so this caused breakage for Stan and Willem.

Note that the nlspec.py abstraction already hides the differences
between flags and enums (value vs user_value), so the changes
are pretty trivial.

Fixes: 0629f22ec1 ("ynl: netdev: drop unnecessary enum-as-flags")
Reported-and-tested-by: Willem de Bruijn <willemb@google.com>
Reported-and-tested-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/all/ZS10NtQgd_BJZ3RU@google.com/
Link: https://lore.kernel.org/r/20231016213937.1820386-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:59:46 -07:00
Dan Carpenter
c53647a5df net: usb: smsc95xx: Fix an error code in smsc95xx_reset()
Return a negative error code instead of success.

Fixes: 2f7ca802bd ("net: Add SMSC LAN9500 USB2.0 10/100 ethernet adapter driver")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/147927f0-9ada-45cc-81ff-75a19dd30b76@moroto.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:53:36 -07:00
Jakub Kicinski
9fe1450f6d Merge branch 'net-remove-last-of-the-phylink-validate-methods-and-clean-up'
Russell King says:

====================
net: remove last of the phylink validate methods and clean up

This four patch series removes the last of the phylink MAC .validate
methods which can be found in the Freescale fman driver. fman has a
requirement that half duplex may not be supported in RGMII mode,
which is currently handled in its .validate method.

In order to keep this functionality when removing the .validate method,
we need to replace that with equivalent functionality, for which I
propose the optional .mac_get_caps method in the first patch.

The advantage of this approach over the .validate callback is that MAC
drivers only have to deal with the MAC_* capabilities, and don't need
to call back into phylink functions to do the masking of the ethtool
linkmodes etc - which then becomes internal to phylink. This can be
seen in the fourth patch where we make a load of these methods static.
====================

Link: https://lore.kernel.org/r/ZS1Z5DDfHyjMryYu@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:53 -07:00
Russell King (Oracle)
743f639762 net: phylink: remove a bunch of unused validation methods
Remove exports for phylink_caps_to_linkmodes(),
phylink_get_capabilities(), phylink_validate_mask_caps() and
phylink_generic_validate(). Also, as phylink_generic_validate() is no
longer called, we can remove its implementation as well.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPkK-009wip-W9@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:53 -07:00
Russell King (Oracle)
da5f6b80ad net: phylink: remove .validate() method
The MAC .validate() method is no longer used, so remove it from the
phylink_mac_ops structure, and remove the callsite in
phylink_validate_mac_and_pcs().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPkF-009wij-QM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:53 -07:00
Russell King (Oracle)
2141297d42 net: fman: convert to .mac_get_caps()
Convert fman to use the .mac_get_caps() method rather than the
.validate() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Sean Anderson <sean.anderson@seco.com>
Link: https://lore.kernel.org/r/E1qsPkA-009wid-Kv@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:19 -07:00
Russell King (Oracle)
b6f9774719 net: phylink: provide mac_get_caps() method
Provide a new method, mac_get_caps() to get the MAC capabilities for
the specified interface mode. This is for MACs which have special
requirements, such as not supporting half-duplex in certain interface
modes, and will replace the validate() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1qsPk5-009wiX-G5@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:51:02 -07:00
Jakub Kicinski
73b24e7ce8 eth: bnxt: fix backward compatibility with older devices
Recent FW interface update bumped the size of struct hwrm_func_cfg_input
above 128B which is the max some devices support.

Probe on Stratus (BCM957452) with FW 20.8.3.11 fails with:

   bnxt_en ...: Unable to reserve tx rings
   bnxt_en ...: 2nd rings reservation failed.
   bnxt_en ...: Not enough rings available.

Once probe is fixed other errors pop up:

   bnxt_en ...: Failed to set async event completion ring.

This is because __hwrm_send() rejects requests larger than
bp->hwrm_max_ext_req_len with -E2BIG. Since the driver doesn't
actually access any of the new fields, yet, trim the length.
It should be safe.

Similar workaround exists for backing_store_cfg_input.
Although that one mins() to a constant of 256, not 128
we'll effectively use here. Michael explains: "the backing
store cfg command is supported by relatively newer firmware
that will accept 256 bytes at least."

To make debugging easier in the future add a warning
for oversized requests.

Fixes: 754fbf604f ("bnxt_en: Update firmware interface to 1.10.2.171")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231016171640.1481493-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:50:55 -07:00
Eric Dumazet
cbfbfe3aee tun: prevent negative ifindex
After commit 956db0a13b ("net: warn about attempts to register
negative ifindex") syzbot is able to trigger the following splat.

Negative ifindex are not supported.

WARNING: CPU: 1 PID: 6003 at net/core/dev.c:9596 dev_index_reserve+0x104/0x210
Modules linked in:
CPU: 1 PID: 6003 Comm: syz-executor926 Not tainted 6.6.0-rc4-syzkaller-g19af4a4ed414 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : dev_index_reserve+0x104/0x210
lr : dev_index_reserve+0x100/0x210
sp : ffff800096a878e0
x29: ffff800096a87930 x28: ffff0000d04380d0 x27: ffff0000d04380f8
x26: ffff0000d04380f0 x25: 1ffff00012d50f20 x24: 1ffff00012d50f1c
x23: dfff800000000000 x22: ffff8000929c21c0 x21: 00000000ffffffea
x20: ffff0000d04380e0 x19: ffff800096a87900 x18: ffff800096a874c0
x17: ffff800084df5008 x16: ffff80008051f9c4 x15: 0000000000000001
x14: 1fffe0001a087198 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : ffff0000d41c9bc0 x7 : 0000000000000000 x6 : 0000000000000000
x5 : ffff800091763d88 x4 : 0000000000000000 x3 : ffff800084e04748
x2 : 0000000000000001 x1 : 00000000fead71c7 x0 : 0000000000000000
Call trace:
dev_index_reserve+0x104/0x210
register_netdevice+0x598/0x1074 net/core/dev.c:10084
tun_set_iff+0x630/0xb0c drivers/net/tun.c:2850
__tun_chr_ioctl+0x788/0x2af8 drivers/net/tun.c:3118
tun_chr_ioctl+0x38/0x4c drivers/net/tun.c:3403
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl fs/ioctl.c:857 [inline]
__arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:857
__invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
el0_svc+0x58/0x16c arch/arm64/kernel/entry-common.c:678
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595
irq event stamp: 11348
hardirqs last enabled at (11347): [<ffff80008a716574>] __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:151 [inline]
hardirqs last enabled at (11347): [<ffff80008a716574>] _raw_spin_unlock_irqrestore+0x38/0x98 kernel/locking/spinlock.c:194
hardirqs last disabled at (11348): [<ffff80008a627820>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:436
softirqs last enabled at (11138): [<ffff8000887ca53c>] spin_unlock_bh include/linux/spinlock.h:396 [inline]
softirqs last enabled at (11138): [<ffff8000887ca53c>] release_sock+0x15c/0x1b0 net/core/sock.c:3531
softirqs last disabled at (11136): [<ffff8000887ca41c>] spin_lock_bh include/linux/spinlock.h:356 [inline]
softirqs last disabled at (11136): [<ffff8000887ca41c>] release_sock+0x3c/0x1b0 net/core/sock.c:3518

Fixes: fb7589a162 ("tun: Add ability to create tun device with given index")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20231016180851.3560092-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:44:51 -07:00
Icenowy Zheng
278be83601 LoongArch: Disable WUC for pgprot_writecombine() like ioremap_wc()
Currently the code disables WUC only disables it for ioremap_wc(), which
is only used when mapping writecombine pages like ioremap() (mapped to
the kernel space). But for VRAM mapped in TTM/GEM, it is mapped with a
crafted pgprot by the pgprot_writecombine() function, in which case WUC
isn't disabled now.

Disable WUC for pgprot_writecombine() (fallback to SUC) if needed, like
ioremap_wc().

This improves the AMDGPU driver's stability (solves some misrendering)
on Loongson-3A5000/3A6000 machines.

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18 08:42:52 +08:00
Huacai Chen
477a0ebec1 LoongArch: Replace kmap_atomic() with kmap_local_page() in copy_user_highpage()
Replace kmap_atomic()/kunmap_atomic() calls with kmap_local_page()/
kunmap_local() in copy_user_highpage() which can be invoked from both
preemptible and atomic context [1].

[1] https://lore.kernel.org/all/20201029222652.302358281@linutronix.de/

Suggested-by: Deepak R Varma <drv@mailo.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18 08:42:52 +08:00
Huacai Chen
449c2756c2 LoongArch: Export symbol invalid_pud_table for modules building
Export symbol invalid_pud_table for modules building (such as the KVM
module) if 4-level page tables enabled. Otherwise we get:

ERROR: modpost: "invalid_pud_table" [arch/loongarch/kvm/kvm.ko] undefined!

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18 08:42:52 +08:00
Tiezhu Yang
00c2ca84c6 LoongArch: Use SYM_CODE_* to annotate exception handlers
As described in include/linux/linkage.h,

  FUNC -- C-like functions (proper stack frame etc.)
  CODE -- non-C code (e.g. irq handlers with different, special stack etc.)

  SYM_FUNC_{START, END} -- use for global functions
  SYM_CODE_{START, END} -- use for non-C (special) functions

So use SYM_CODE_* to annotate exception handlers.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-10-18 08:42:52 +08:00
Jakub Kicinski
99e79b677b Merge branch 'bridge-add-a-limit-on-learned-fdb-entries'
Johannes Nixdorf says:

====================
bridge: Add a limit on learned FDB entries

Introduce a limit on the amount of learned FDB entries on a bridge,
configured by netlink with a build time default on bridge creation in
the kernel config.

For backwards compatibility the kernel config default is disabling the
limit (0).

Without any limit a malicious actor may OOM a kernel by spamming packets
with changing MAC addresses on their bridge port, so allow the bridge
creator to limit the number of entries.

Currently the manual entries are identified by the bridge flags
BR_FDB_LOCAL or BR_FDB_ADDED_BY_USER, atomically bundled under the new
flag BR_FDB_DYNAMIC_LEARNED. This means the limit also applies to
entries created with BR_FDB_ADDED_BY_EXT_LEARN but none of BR_FDB_LOCAL
or BR_FDB_ADDED_BY_USER, e.g. ones added by SWITCHDEV_FDB_ADD_TO_BRIDGE.

Link to the corresponding iproute2 changes:
https://lore.kernel.org/r/20230919-fdb_limit-v4-1-b4d2dc4df30f@avm.de

v4: https://lore.kernel.org/r/20230919-fdb_limit-v4-0-39f0293807b8@avm.de/
v3: https://lore.kernel.org/r/20230905-fdb_limit-v3-0-7597cd500a82@avm.de/
v2: https://lore.kernel.org/netdev/20230619071444.14625-1-jnixdorf-oss@avm.de/
v1: https://lore.kernel.org/netdev/20230515085046.4457-1-jnixdorf-oss@avm.de/
====================

Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-0-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:04 -07:00
Johannes Nixdorf
6f84090333 selftests: forwarding: bridge_fdb_learning_limit: Add a new selftest
Add a suite covering the fdb_n_learned and fdb_max_learned bridge
features, touching all special cases in accounting at least once.

Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-5-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:02 -07:00
Johannes Nixdorf
19297c3ab2 net: bridge: Set strict_start_type for br_policy
Set any new attributes added to br_policy to be parsed strictly, to
prevent userspace from passing garbage.

Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-4-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:02 -07:00
Johannes Nixdorf
ddd1ad6882 net: bridge: Add netlink knobs for number / max learned FDB entries
The previous patch added accounting and a limit for the number of
dynamically learned FDB entries per bridge. However it did not provide
means to actually configure those bounds or read back the count. This
patch does that.

Two new netlink attributes are added for the accounting and limit of
dynamically learned FDB entries:
 - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for
   a single bridge.
 - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for
   the bridge.

The new attributes are used like this:

 # ip link add name br up type bridge fdb_max_learned 256
 # ip link add name v1 up master br type veth peer v2
 # ip link set up dev v2
 # mausezahn -a rand -c 1024 v2
 0.01 seconds (90877 packets per second
 # bridge fdb | grep -v permanent | wc -l
 256
 # ip -d link show dev br
 13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...]
     [...] fdb_n_learned 256 fdb_max_learned 256

Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-3-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:01 -07:00
Johannes Nixdorf
bdb4dfda3b net: bridge: Track and limit dynamically learned FDB entries
A malicious actor behind one bridge port may spam the kernel with packets
with a random source MAC address, each of which will create an FDB entry,
each of which is a dynamic allocation in the kernel.

There are roughly 2^48 different MAC addresses, further limited by the
rhashtable they are stored in to 2^31. Each entry is of the type struct
net_bridge_fdb_entry, which is currently 128 bytes big. This means the
maximum amount of memory allocated for FDB entries is 2^31 * 128B =
256GiB, which is too much for most computers.

Mitigate this by maintaining a per bridge count of those automatically
generated entries in fdb_n_learned, and a limit in fdb_max_learned. If
the limit is hit new entries are not learned anymore.

For backwards compatibility the default setting of 0 disables the limit.

User-added entries by netlink or from bridge or bridge port addresses
are never blocked and do not count towards that limit.

Introduce a new fdb entry flag BR_FDB_DYNAMIC_LEARNED to keep track of
whether an FDB entry is included in the count. The flag is enabled for
dynamically learned entries, and disabled for all other entries. This
should be equivalent to BR_FDB_ADDED_BY_USER and BR_FDB_LOCAL being unset,
but contrary to the two flags it can be toggled atomically.

Atomicity is required here, as there are multiple callers that modify the
flags, but are not under a common lock (br_fdb_update is the exception
for br->hash_lock, br_fdb_external_learn_add for RTNL).

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-2-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:01 -07:00
Johannes Nixdorf
cbf51acbc5 net: bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entry
In preparation of the following fdb limit for dynamically learned entries,
allow fdb_create to detect that the entry was added by the user. This
way it can skip applying the limit in this case.

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Johannes Nixdorf <jnixdorf-oss@avm.de>
Link: https://lore.kernel.org/r/20231016-fdb_limit-v5-1-32cddff87758@avm.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:39:01 -07:00
Neal Cardwell
1c2709cfff tcp: fix excessive TLP and RACK timeouts from HZ rounding
We discovered from packet traces of slow loss recovery on kernels with
the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
when receiving a SACKed sequence range, the RACK reordering timer was
firing after about 16ms rather than the desired value of roughly
min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
exact same issue.

This commit fixes the TLP transmit timer and RACK reordering timer
floor calculation to more closely match the intended 2ms floor even on
kernels with HZ=250. It does this by adding in a new
TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
instead of the current approach of converting to jiffies and then
adding th TCP_TIMEOUT_MIN value of 2 jiffies.

Our testing has verified that on kernels with HZ=1000, as expected,
this does not produce significant changes in behavior, but on kernels
with the default HZ=250 the latency improvement can be large. For
example, our tests show that for HZ=250 kernels at low RTTs this fix
roughly halves the latency for the RACK reorder timer: instead of
mostly firing at 16ms it mostly fires at 8ms.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: bb4d991a28 ("tcp: adjust tail loss probe timeout")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17 17:25:42 -07:00
Linus Torvalds
06dc10eae5 fbdev fixes and cleanups for 6.6-rc7:
various minor fixes, cleanups and annotations for atyfb, sa1100fb,
 omapfb, uvesafb and mmp.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCZS7h/AAKCRD3ErUQojoP
 X041AQCov386rrMGpY3v4qmU2X0EIAeQjlgs8hElA3YWIWyg2gEA3bt/PFvod8wV
 Vhl9gQQJropICSb1zbUZ0aa0rhil6QQ=
 =5LoM
 -----END PGP SIGNATURE-----

Merge tag 'fbdev-for-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes and cleanups from Helge Deller:
 "Various minor fixes, cleanups and annotations for atyfb, sa1100fb,
  omapfb, uvesafb and mmp"

* tag 'fbdev-for-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbdev: core: syscopyarea: fix sloppy typing
  fbdev: core: cfbcopyarea: fix sloppy typing
  fbdev: uvesafb: Call cn_del_callback() at the end of uvesafb_exit()
  fbdev: uvesafb: Remove uvesafb_exec() prototype from include/video/uvesafb.h
  fbdev: sa1100fb: mark sa1100fb_init() static
  fbdev: omapfb: fix some error codes
  fbdev: atyfb: only use ioremap_uc() on i386 and ia64
  fbdev: mmp: Annotate struct mmp_path with __counted_by
  fbdev: mmp: Annotate struct mmphw_ctrl with __counted_by
2023-10-17 17:14:22 -07:00