8575 Commits

Author SHA1 Message Date
Yonghong Song
3144bfa507 bpf: Fix a compilation failure with clang lto build
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the
following compilation error will appear:

  $ make LLVM=1 LLVM_IAS=1 -j
  ...
  ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined
  cgroup_storage_map_btf_ids:;
  ^
  make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1

In local_storage.c, we have
  BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
Commit c4bcfb38a95e ("bpf: Implement cgroup storage available to
non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE
definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker
complains with lto build.

Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times
for sk, inode, task and cgrp local storages. Let us define a single global one
with a different name than cgroup_storage_map_btf_ids, which also fixed
the lto compilation error.

Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221130052147.1591625-1-yhs@fb.com
2022-11-30 17:13:25 -08:00
Pengcheng Yang
a351d6087b bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
flag from the msg->flags. If apply_bytes is used and it is larger than
the current data being processed, sk_psock_msg_verdict() will not be
called when sendmsg() is called again. At this time, the msg->flags is 0,
and we lost the BPF_F_INGRESS flag.

So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
redirection.

Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.com
2022-12-01 01:07:32 +01:00
Joel Fernandes (Google)
483c26ff63 net: Use call_rcu_hurry() for dst_release()
In a networking test on ChromeOS, kernels built with the new
CONFIG_RCU_LAZY=y Kconfig option fail a networking test in the teardown
phase.

This failure may be reproduced as follows: ip netns del <name>

The CONFIG_RCU_LAZY=y Kconfig option was introduced by earlier commits
in this series for the benefit of certain battery-powered systems.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them.  This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing.  This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.

This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.

Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory.  If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order.  Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.

However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked.  It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu().  The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU.  After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.

Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.

Returning to the test failure, use of ftrace showed that this failure
cause caused by the aadded delays due to this new lazy behavior of
call_rcu() in kernels built with CONFIG_RCU_LAZY=y.

Therefore, make dst_release() use call_rcu_hurry() in order to revert
to the old test-failure-free behavior.

[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: David Ahern <dsahern@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: <netdev@vger.kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-30 13:17:25 -08:00
Jakub Kicinski
f2bb566f5c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/lib/bpf/ringbuf.c
  927cbb478adf ("libbpf: Handle size overflow for ringbuf mmap")
  b486d19a0ab0 ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29 13:04:52 -08:00
Jiri Pirko
7666dbec72 net: devlink: add WARN_ON_ONCE to check return value of unregister_netdevice_notifier_net() call
As the return value is not 0 only in case there is no such notifier
block registered, add a WARN_ON_ONCE() to yell about it.

Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20221125100255.1786741-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-29 13:37:53 +01:00
Miquel Raynal
4c47867bc7 of: net: export of_get_mac_address_nvmem()
Export

	of_get_mac_addr_nvmem()

and rename it to

	of_get_mac_address_nvmem()

in order to fit the convention followed by the existing exported helpers
of the same kind.

This way, OF compatible drivers using eg. fwnode_get_mac_address() can
do a direct call to it instead of calling of_get_mac_address() just for
the nvmem step, avoiding to repeat an expensive DT lookup which has
already been done once.

Eventually, fwnode_get_mac_address() should probably be updated to
perform the nvmem lookup directly, but as of today, nvmem cells seem not
to be supported by ACPI yet which would defeat this kind of extension.

Suggested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-29 10:45:53 +01:00
Jakub Kicinski
d6dc62fca6 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY4AC5QAKCRDbK58LschI
 g1e0AQCfAqduTy7mYd02jDNCV0wLphNp9FbPiP9OrQT37ABpKAEA1ulj1X59bX3d
 HnZdDKuatcPZT9MV5hDLM7MFJ9GjOA4=
 =fNmM
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
bpf-next 2022-11-25

We've added 101 non-merge commits during the last 11 day(s) which contain
a total of 109 files changed, 8827 insertions(+), 1129 deletions(-).

The main changes are:

1) Support for user defined BPF objects: the use case is to allocate own
   objects, build own object hierarchies and use the building blocks to
   build own data structures flexibly, for example, linked lists in BPF,
   from Kumar Kartikeya Dwivedi.

2) Add bpf_rcu_read_{,un}lock() support for sleepable programs,
   from Yonghong Song.

3) Add support storing struct task_struct objects as kptrs in maps,
   from David Vernet.

4) Batch of BPF map documentation improvements, from Maryam Tahhan
   and Donald Hunter.

5) Improve BPF verifier to propagate nullness information for branches
   of register to register comparisons, from Eduard Zingerman.

6) Fix cgroup BPF iter infra to hold reference on the start cgroup,
   from Hou Tao.

7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted
   given it is not the case for them, from Alexei Starovoitov.

8) Improve BPF verifier's realloc handling to better play along with dynamic
   runtime analysis tools like KASAN and friends, from Kees Cook.

9) Remove legacy libbpf mode support from bpftool,
   from Sahid Orentino Ferdjaoui.

10) Rework zero-len skb redirection checks to avoid potentially breaking
    existing BPF test infra users, from Stanislav Fomichev.

11) Two small refactorings which are independent and have been split out
    of the XDP queueing RFC series, from Toke Høiland-Jørgensen.

12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen.

13) Documentation on how to run BPF CI without patch submission,
    from Daniel Müller.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================

Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-28 19:42:17 -08:00
Greg Kroah-Hartman
23680f0b7d driver core: make struct class.dev_uevent() take a const *
The dev_uevent() in struct class should not be modifying the device that
is passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russ Weight <russell.h.weight@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Raed Salem <raeds@nvidia.com>
Cc: Chen Zhongjin <chenzhongjin@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Avihai Horon <avihaih@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Wang Yufen <wangyufen@huawei.com>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-pm@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221123122523.1332370-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-24 17:12:15 +01:00
Jakub Kicinski
06ccc8ec70 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
ipsec 2022-11-23

1) Fix "disable_policy" on ipv4 early demuxP Packets after
   the initial packet in a flow might be incorectly dropped
   on early demux if there are no matching policies.
   From Eyal Birger.

2) Fix a kernel warning in case XFRM encap type is not
   available. From Eyal Birger.

3) Fix ESN wrap around for GSO to avoid a double usage of a
    sequence number. From Christian Langrock.

4) Fix a send_acquire race with pfkey_register.
   From Herbert Xu.

5) Fix a list corruption panic in __xfrm_state_delete().
   Thomas Jarosch.

6) Fix an unchecked return value in xfrm6_init().
   Chen Zhongjin.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Fix ignored return value in xfrm6_init()
  xfrm: Fix oops in __xfrm_state_delete()
  af_key: Fix send_acquire race with pfkey_register
  xfrm: replay: Fix ESN wrap around for GSO
  xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available
  xfrm: fix "disable_policy" on ipv4 early demux
====================

Link: https://lore.kernel.org/r/20221123093117.434274-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-23 19:18:59 -08:00
Moshe Shemesh
815bc3ac75 devlink: remove redundant health state set to error
Reporter health_state is set twice to error in devlink_health_report().
Remove second time as it is redundant.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/1668933412-5498-1-git-send-email-moshe@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-22 20:41:38 -08:00
Greg Kroah-Hartman
02a476d932 kobject: make kobject_get_ownership() take a constant kobject *
The call, kobject_get_ownership(), does not modify the kobject passed
into it, so make it const.  This propagates down into the kobj_type
function callbacks so make the kobject passed into them also const,
ensuring that nothing in the kobject is being changed here.

This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20221121094649.1556002-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-22 17:34:29 +01:00
Eric Dumazet
32634819ad net: fix __sock_gen_cookie()
I was mistaken how atomic64_try_cmpxchg(&sk_cookie, &res, new)
is working.

I was assuming @res would contain the final sk_cookie value,
regardless of the success of our cmpxchg()

We could do something like:

if (atomic64_try_cmpxchg(&sk_cookie, &res, new)
	res = new;

But we can avoid a conditional and read sk_cookie again.

atomic64_cmpxchg(&sk_cookie, res, new);
res = atomic64_read(&sk_cookie);

Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527347 ("Error handling issues")
Fixes: 4ebf802cf1c6 ("net: __sock_gen_cookie() cleanup")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221118043843.3703186-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-21 20:36:30 -08:00
Stanislav Fomichev
114039b342 bpf: Move skb->len == 0 checks into __bpf_redirect
To avoid potentially breaking existing users.

Both mac/no-mac cases have to be amended; mac_header >= network_header
is not enough (verified with a new test, see next patch).

Fixes: fd1894224407 ("bpf: Don't redirect packets with invalid pkt_len")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221121180340.1983627-1-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-21 12:47:04 -08:00
Greg Kroah-Hartman
05df6ab8eb Merge 6.1-rc6 into driver-core-next
We need the kernfs changes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-21 10:21:53 +01:00
Maryam Tahhan
d1e91173cd bpf, docs: DEVMAPs and XDP_REDIRECT
Add documentation for BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
including kernel version introduced, usage and examples.

Add documentation that describes XDP_REDIRECT.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221115144921.165483-1-mtahhan@redhat.com
2022-11-18 23:16:31 +01:00
Daniel Xu
52d1aa8b82 netfilter: conntrack: Fix data-races around ct mark
nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-18 15:21:00 +01:00
Eric Dumazet
fd896e38e5 net: fix napi_disable() logic error
Dan reported a new warning after my recent patch:

New smatch warnings:
net/core/dev.c:6409 napi_disable() error: uninitialized symbol 'new'.

Indeed, we must first wait for STATE_SCHED and STATE_NPSVC to be cleared,
to make sure @new variable has been initialized properly.

Fixes: 4ffa1d1c6842 ("net: adopt try_cmpxchg() in napi_{enable|disable}()")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 12:04:39 +00:00
Thomas Zeitlhofer
8207f253a0 net: neigh: decrement the family specific qlen
Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 10:29:50 +00:00
Michal Wilczynski
f2fc15e271 devlink: Allow to set up parent in devl_rate_leaf_create()
Currently the driver is able to create leaf nodes for the devlink-rate,
but is unable to set parent for them. This wasn't as issue before the
possibility to export hierarchy from the driver. After adding the export
feature, in order for the driver to supply correct hierarchy, it's
necessary for it to be able to supply a parent name to
devl_rate_leaf_create().

Introduce a new parameter 'parent_name' in devl_rate_leaf_create().

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:27 -08:00
Michal Wilczynski
04d674f04e devlink: Allow for devlink-rate nodes parent reassignment
Currently it's not possible to reassign the parent of the node using one
command. As the previous commit introduced a way to export entire
hierarchy from the driver, being able to modify and reassign parents
become important. This way user might easily change QoS settings without
interrupting traffic.

Example command:
devlink port function rate set pci/0000:4b:00.0/1 parent node_custom_1

This reassigns leaf node parent to node_custom_1.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:26 -08:00
Michal Wilczynski
caba177d7f devlink: Enable creation of the devlink-rate nodes from the driver
Intel 100G card internal firmware hierarchy for Hierarchicial QoS is very
rigid and can't be easily removed. This requires an ability to export
default hierarchy to allow user to modify it. Currently the driver is
only able to create the 'leaf' nodes, which usually represent the vport.
This is not enough for HQoS implemented in Intel hardware.

Introduce new function devl_rate_node_create() that allows for creation
of the devlink-rate nodes from the driver.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:26 -08:00
Michal Wilczynski
6e2d7e84fc devlink: Introduce new attribute 'tx_weight' to devlink-rate
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_weight' needs to be introduced. This attribute allows
for usage of Weighted Fair Queuing arbitration scheme among siblings.
This arbitration scheme can be used simultaneously with the strict
priority.

Introduce new attribute in devlink-rate that will allow for configuration
of Weighted Fair Queueing. New attribute is optional.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:26 -08:00
Michal Wilczynski
cd50223683 devlink: Introduce new attribute 'tx_priority' to devlink-rate
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_priority' needs to be introduced. This attribute allows
for usage of strict priority arbiter among siblings. This arbitration
scheme attempts to schedule nodes based on their priority as long as the
nodes remain within their bandwidth limit.

Introduce new attribute in devlink-rate that will allow for configuration
of strict priority. New attribute is optional.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:25 -08:00
Jason A. Donenfeld
e8a533cbeb treewide: use get_random_u32_inclusive() when possible
These cases were done with this Coccinelle:

@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)

@@
expression H;
expression L;
expression E;
@@
  get_random_u32_inclusive(L,
  H
- + E
- - E
  )

@@
expression H;
expression L;
expression E;
@@
  get_random_u32_inclusive(L,
  H
- - E
- + E
  )

@@
expression H;
expression L;
expression E;
expression F;
@@
  get_random_u32_inclusive(L,
  H
- - E
  + F
- + E
  )

@@
expression H;
expression L;
expression E;
expression F;
@@
  get_random_u32_inclusive(L,
  H
- + E
  + F
- - E
  )

And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:18:02 +01:00
Jason A. Donenfeld
8032bf1233 treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:15:15 +01:00
Eric Dumazet
6c1c509778 net: add atomic_long_t to net_device_stats fields
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.

Most performance critical paths already use per-cpu
variables, or per-queue ones.

It is reasonable (and more correct) to use atomic operations
for the slow paths.

This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.

netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64

Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 12:48:44 +00:00
Eric Dumazet
4ebf802cf1 net: __sock_gen_cookie() cleanup
Adopt atomic64_try_cmpxchg() and remove the loop,
to make the intent more obvious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 12:42:01 +00:00
Eric Dumazet
4ffa1d1c68 net: adopt try_cmpxchg() in napi_{enable|disable}()
This makes code a bit cleaner.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 12:42:01 +00:00
Eric Dumazet
1462160c74 net: adopt try_cmpxchg() in napi_schedule_prep() and napi_complete_done()
This makes the code slightly more efficient.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 12:42:01 +00:00
Eric Dumazet
6af645a5b2 net: net_{enable|disable}_timestamp() optimizations
Adopting atomic_try_cmpxchg() makes the code cleaner.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 12:42:00 +00:00
Eric Dumazet
57fc05e8e8 net: mm_account_pinned_pages() optimization
Adopt atomic_long_try_cmpxchg() in mm_account_pinned_pages()
as it is slightly more efficient.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 12:42:00 +00:00
Vladimir Oltean
8c55facecd net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down
RFC 2863 says:

   The lowerLayerDown state is also a refinement on the down state.
   This new state indicates that this interface runs "on top of" one or
   more other interfaces (see ifStackTable) and that this interface is
   down specifically because one or more of these lower-layer interfaces
   are down.

DSA interfaces are virtual network devices, stacked on top of the DSA
master, but they have a physical MAC, with a PHY that reports a real
link status.

But since DSA (perhaps improperly) uses an iflink to describe the
relationship to its master since commit c084080151e1 ("dsa: set ->iflink
on slave interfaces to the ifindex of the parent"), default_operstate()
will misinterpret this to mean that every time the carrier of a DSA
interface is not ok, it is because of the master being not ok.

In fact, since commit c0a8a9c27493 ("net: dsa: automatically bring user
ports down when master goes down"), DSA cannot even in theory be in the
lowerLayerDown state, because it just calls dev_close_many(), thereby
going down, when the master goes down.

We could revert the commit that creates an iflink between a DSA user
port and its master, especially since now we have an alternative
IFLA_DSA_MASTER which has less side effects. But there may be tooling in
use which relies on the iflink, which has existed since 2009.

We could also probably do something local within DSA to overwrite what
rfc2863_policy() did, in a way similar to hsr_set_operstate(), but this
seems like a hack.

What seems appropriate is to follow the iflink, and check the carrier
status of that interface as well. If that's down too, yes, keep
reporting lowerLayerDown, otherwise just down.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 09:45:00 +00:00
Kuniyuki Iwashima
ba6aac1516 udp: Access &udp_table via net.
We will soon introduce an optional per-netns hash table
for UDP.

This means we cannot use udp_table directly in most places.

Instead, access it via net->ipv4.udp_table.

The access will be valid only while initialising udp_table
itself and creating/destroying each netns.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-16 09:43:35 +00:00
Felix Fietkau
570d0a588d net: dsa: add support for DSA rx offloading via metadata dst
If a metadata dst is present with the type METADATA_HW_PORT_MUX on a dsa cpu
port netdev, assume that it carries the port number and that there is no DSA
tag present in the skb data.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-15 20:22:07 -08:00
Toke Høiland-Jørgensen
32637e3300 bpf: Expand map key argument of bpf_redirect_map to u64
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.

Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 09:00:27 -08:00
Kumar Kartikeya Dwivedi
6728aea721 bpf: Refactor btf_struct_access
Instead of having to pass multiple arguments that describe the register,
pass the bpf_reg_state into the btf_struct_access callback. Currently,
all call sites simply reuse the btf and btf_id of the reg they want to
check the access of. The only exception to this pattern is the callsite
in check_ptr_to_map_access, hence for that case create a dummy reg to
simulate PTR_TO_BTF_ID access.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Steen Hegelund
70ea86a0df net: flow_offload: add support for ARP frame matching
This adds a new flow_rule_match_arp function that allows drivers
to be able to dissect ARP frames.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-14 11:24:16 +00:00
Jakub Kicinski
f4c4ca70de bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEET63h6RnJhTJHuKTjXOwUVIRcSScFAmNu2EkACgkQXOwUVIRc
 SSebKhAA0ffmp5jJgEJpQYNABGLYIJcwKkBrGClDbMJLtwCjevGZJajT9fpbCLb1
 eK6EIhdfR0NTO+0KtUVkZ8WMa81OmLEJYdTNtJfNE23ENMpssiAWhlhDF8AoXeKv
 Bo3j719gn3Cw9PWXQoircH3wpj+5RMDnjxy4iYlA5yNrvzC7XVmssMF+WALvQnuK
 CGrfR57hxdgmphmasRqeCzEoriwihwPsG3k6eQN8rf7ZytLhs90tMVgT9L3Cd2u9
 DafA0Xl8mZdz2mHhThcJhQVq4MUymZj44ufuHDiOs1j6nhUlWToyQuvegPOqxKti
 uLGtZul0ls+3UP0Lbrv1oEGU/MWMxyDz4IBc0EVs0k3ItQbmSKs6r9WuPFGd96Sb
 GHk68qFVySeLGN0LfKe3rCHJ9ZoIOPYJg9qT8Rd5bOhetgGwSsxZTxUI39BxkFup
 CEqwIDnts1TMU37GDjj+vssKW91k4jEzMZVtRfsL3J36aJs28k/Ez4AqLXg6WU6u
 ADqFaejVPcXbN9rX90onIYxxiL28gZSeT+i8qOPELZtqTQmNWz+tC/ySVuWnD8Mn
 Nbs7PZ1IWiNZpsKS8pZnpd6j4mlBeJnwXkPKiFy+xHGuwRSRdYl6G9e5CtlRely/
 rwQ8DtaOpRYMrGhnmBEdAOCa9t/iqzrzHzjoigjJ7iAST4ToJ5s=
 =Y+/e
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Andrii Nakryiko says:

====================
bpf-next 2022-11-11

We've added 49 non-merge commits during the last 9 day(s) which contain
a total of 68 files changed, 3592 insertions(+), 1371 deletions(-).

The main changes are:

1) Veristat tool improvements to support custom filtering, sorting, and replay
   of results, from Andrii Nakryiko.

2) BPF verifier precision tracking fixes and improvements,
   from Andrii Nakryiko.

3) Lots of new BPF documentation for various BPF maps, from Dave Tucker,
   Donald Hunter, Maryam Tahhan, Bagas Sanjaya.

4) BTF dedup improvements and libbpf's hashmap interface clean ups, from
   Eduard Zingerman.

5) Fix veth driver panic if XDP program is attached before veth_open, from
   John Fastabend.

6) BPF verifier clean ups and fixes in preparation for follow up features,
   from Kumar Kartikeya Dwivedi.

7) Add access to hwtstamp field from BPF sockops programs,
   from Martin KaFai Lau.

8) Various fixes for BPF selftests and samples, from Artem Savkov,
   Domenico Cerasuolo, Kang Minchul, Rong Tao, Yang Jihong.

9) Fix redirection to tunneling device logic, preventing skb->len == 0, from
   Stanislav Fomichev.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  selftests/bpf: fix veristat's singular file-or-prog filter
  selftests/bpf: Test skops->skb_hwtstamp
  selftests/bpf: Fix incorrect ASSERT in the tcp_hdr_options test
  bpf: Add hwtstamp field for the sockops prog
  selftests/bpf: Fix xdp_synproxy compilation failure in 32-bit arch
  bpf, docs: Document BPF_MAP_TYPE_ARRAY
  docs/bpf: Document BPF map types QUEUE and STACK
  docs/bpf: Document BPF ARRAY_OF_MAPS and HASH_OF_MAPS
  docs/bpf: Document BPF_MAP_TYPE_CPUMAP map
  docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map
  libbpf: Hashmap.h update to fix build issues using LLVM14
  bpf: veth driver panics when xdp prog attached before veth_open
  selftests: Fix test group SKIPPED result
  selftests/bpf: Tests for btf_dedup_resolve_fwds
  libbpf: Resolve unambigous forward declarations
  libbpf: Hashmap interface update to allow both long and void* keys/values
  samples/bpf: Fix sockex3 error: Missing BPF prog type
  selftests/bpf: Fix u32 variable compared with less than zero
  Documentation: bpf: Escape underscore in BPF type name prefix
  selftests/bpf: Use consistent build-id type for liburandom_read.so
  ...
====================

Link: https://lore.kernel.org/r/20221111233733.1088228-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11 18:33:04 -08:00
Eric Dumazet
be3ed48683 net: gro: no longer use skb_vlan_tag_present()
We can remove a conditional test in gro_list_prepare()
by comparing vlan_all fields of the two skbs.

Notes:

While comparing the vlan_proto is not strictly needed,
because part of the following compare_ether_header() call,
using 32bit word is actually faster than using 16bit values.

napi_reuse_skb() makes sure to clear skb->vlan_all,
as it already calls __vlan_hwaccel_clear_tag()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11 18:18:05 -08:00
Eric Dumazet
354259fa73 net: remove skb->vlan_present
skb->vlan_present seems redundant.

We can instead derive it from this boolean expression:

vlan_present = skb->vlan_proto != 0 || skb->vlan_tci != 0

Add a new union, to access both fields in a single load/store
when possible.

	union {
		u32	vlan_all;
		struct {
		__be16	vlan_proto;
		__u16	vlan_tci;
		};
	};

This allows following patch to remove a conditional test in GRO stack.

Note:
  We move remcsum_offload to keep TC_AT_INGRESS_MASK
  and SKB_MONO_DELIVERY_TIME_MASK unchanged.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11 18:18:05 -08:00
Martin KaFai Lau
9bb053490f bpf: Add hwtstamp field for the sockops prog
The bpf-tc prog has already been able to access the
skb_hwtstamps(skb)->hwtstamp.  This patch extends the same hwtstamp
access to the sockops prog.

In sockops, the skb is also available to the bpf prog during
the BPF_SOCK_OPS_PARSE_HDR_OPT_CB event.  There is a use case
that the hwtstamp will be useful to the sockops prog to better
measure the one-way-delay when the sender has put the tx
timestamp in the tcp header option.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221107230420.4192307-2-martin.lau@linux.dev
2022-11-11 13:18:14 -08:00
Jakub Kicinski
966a9b4903 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/pch_can.c
  ae64438be192 ("can: dev: fix skb drop check")
  1dd1b521be85 ("can: remove obsolete PCH CAN driver")
https://lore.kernel.org/all/20221110102509.1f7d63cc@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10 17:43:53 -08:00
Ido Schimmel
1fb22ed671 devlink: Fix warning when unregistering a port
When a devlink port is unregistered, its type is expected to be unset or
otherwise a WARNING is generated [1]. This was supposed to be handled by
cited commit by clearing the type upon 'NETDEV_PRE_UNINIT'.

The assumption was that no other events can be generated for the netdev
after this event, but this proved to be wrong. After the event is
generated, netdev_wait_allrefs_any() will rebroadcast a
'NETDEV_UNREGISTER' until the netdev's reference count drops to 1. This
causes devlink to set the port type back to Ethernet.

Fix by only setting and clearing the port type upon 'NETDEV_POST_INIT'
and 'NETDEV_PRE_UNINIT', respectively. For all other events, preserve
the port type.

[1]
WARNING: CPU: 0 PID: 11 at net/core/devlink.c:9998 devl_port_unregister+0x2f6/0x390 net/core/devlink.c:9998
Modules linked in:
CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc3-next-20221107-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Workqueue: netns cleanup_net
RIP: 0010:devl_port_unregister+0x2f6/0x390 net/core/devlink.c:9998
[...]
Call Trace:
 <TASK>
 __nsim_dev_port_del+0x1bb/0x240 drivers/net/netdevsim/dev.c:1433
 nsim_dev_port_del_all drivers/net/netdevsim/dev.c:1443 [inline]
 nsim_dev_reload_destroy+0x171/0x510 drivers/net/netdevsim/dev.c:1660
 nsim_dev_reload_down+0x6b/0xd0 drivers/net/netdevsim/dev.c:968
 devlink_reload+0x1c2/0x6b0 net/core/devlink.c:4501
 devlink_pernet_pre_exit+0x104/0x1c0 net/core/devlink.c:12609
 ops_pre_exit_list net/core/net_namespace.c:159 [inline]
 cleanup_net+0x451/0xb10 net/core/net_namespace.c:594
 process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
 worker_thread+0x665/0x1080 kernel/workqueue.c:2436
 kthread+0x2e4/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
 </TASK>

Fixes: 02a68a47eade ("net: devlink: track netdev with devlink_port assigned")
Reported-by: syzbot+85e47e1a08b3e159b159@syzkaller.appspotmail.com
Reported-by: syzbot+c2ca18f0fccdd1f09c66@syzkaller.appspotmail.com
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20221110085150.520800-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10 13:15:04 -08:00
Richard Gobert
e081ecf084 gro: avoid checking for a failed search
After searching for a protocol handler in dev_gro_receive, checking for
failure is redundant. Skip the failure code after finding the
corresponding handler.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221108123320.GA59373@debian
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-11-10 11:26:47 +01:00
Ido Schimmel
2640a82bbc devlink: Add packet traps for 802.1X operation
Add packet traps for 802.1X operation. The "eapol" control trap is used
to trap EAPOL packets and is required for the correct operation of the
control plane. The "locked_port" drop trap can be enabled to gain
visibility into packets that were dropped by the device due to the
locked bridge port check.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-09 19:06:14 -08:00
Jiri Pirko
15feb56e30 net: devlink: move netdev notifier block to dest namespace during reload
The notifier block tracking netdev changes in devlink is registered
during devlink_alloc() per-net, it is then unregistered
in devlink_free(). When devlink moves from net namespace to another one,
the notifier block needs to move along.

Fix this by adding forgotten call to move the block.

Reported-by: Ido Schimmel <idosch@idosch.org>
Fixes: 02a68a47eade ("net: devlink: track netdev with devlink_port assigned")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-09 13:45:59 -08:00
Jiri Pirko
3e52fba03a net: introduce a helper to move notifier block to different namespace
Currently, net_dev() netdev notifier variant follows the netdev with
per-net notifier from namespace to namespace. This is implemented
by move_netdevice_notifiers_dev_net() helper.

For devlink it is needed to re-register per-net notifier during
devlink reload. Introduce a new helper called
move_netdevice_notifier_net() and share the unregister/register code
with existing move_netdevice_notifiers_dev_net() helper.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-09 13:45:59 -08:00
Greg Kroah-Hartman
fa627348cf driver core: class: make namespace and get_ownership take const *
The callbacks in struct class namespace() and get_ownership() do not
modify the struct device passed to them, so mark the pointer as constant
and fix up all callbacks in the kernel to have the correct function
signature.

This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221001165426.2690912-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-09 15:49:32 +01:00
David S. Miller
3ca6c3b43c rxrpc changes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmNq0Q8ACgkQ+7dXa6fL
 C2toRxAAmmvce10i3hcS6ke0PvB4gPu6ZSuaQWO3KxpP9jz6lV7M+cfFOh9N3neG
 uEe6ms4Kzt/BJIBm+aMdXW84648sV5vOqdrNGBOb2cJikaiTkj9x730klSdwOVr2
 epEELoj/IEWZZz/d9U05uq26VUtnxsc/Enzkq/GIaENSVauYWaZXrHdKzrzUZYjk
 gEbspFSpQEJqu5slRl2XGos4tMHHvTIkehoLH9KM4YmC5WGf1kKYz/6v38PIhc/9
 mEBsUqQlTVsUPNcOXWBY24HJKY91CBgowhbTQIxyJNydHPJYPVJ8U5nNp1g1CYmu
 URdvvX8IyIR0zX2RcVlc9vnWQ+p5NoTjxjwc1iKjnBsofCmqDucie6Iz2vis7Zl6
 6s6N1FZSYQTX0fbBbf00efWaG/3I/ynRhcW+zM9NcozHzpRxyuptDlKSOVORXRG7
 gy7+sID2y5dLqCg9ukTIx1y9Njt+uryosBOajCMaaAy0VgXEsETFO8UxbodUAu6N
 ubmPwGO42bY//c+fJWRAjT9tjhzp2fWK4rgrgd3VG4cYrjq2W21EMwyjzilVp2dM
 ZlvWoWJptIqEhPtWU8nf3i759XE+FOWKt9ns1FupKB+0msht1p2HBj88bue8TrKk
 CcV1dY9cohNzgRFXvXcgSLvSCioT31Q//mGmXWLif7teOXIUN4A=
 =q04p
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20221108' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

rxrpc changes

David Howells says:

====================
rxrpc: Increasing SACK size and moving away from softirq, part 1

AF_RXRPC has some issues that need addressing:

 (1) The SACK table has a maximum capacity of 255, but for modern networks
     that isn't sufficient.  This is hard to increase in the upstream code
     because of the way the application thread is coupled to the softirq
     and retransmission side through a ring buffer.  Adjustments to the rx
     protocol allows a capacity of up to 8192, and having a ring
     sufficiently large to accommodate that would use an excessive amount
     of memory as this is per-call.

 (2) Processing ACKs in softirq mode causes the ACKs get conflated, with
     only the most recent being considered.  Whilst this has the upside
     that the retransmission algorithm only needs to deal with the most
     recent ACK, it causes DATA transmission for a call to be very bursty
     because DATA packets cannot be transmitted in softirq mode.  Rather
     transmission must be delegated to either the application thread or a
     workqueue, so there tend to be sudden bursts of traffic for any
     particular call due to scheduling delays.

 (3) All crypto in a single call is done in series; however, each DATA
     packet is individually encrypted so encryption and decryption of large
     calls could be parallelised if spare CPU resources are available.

This is the first of a number of sets of patches that try and address them.
The overall aims of these changes include:

 (1) To get rid of the TxRx ring and instead pass the packets round in
     queues (eg. sk_buff_head).  On the Tx side, each ACK packet comes with
     a SACK table that can be parsed as-is, so there's no particular need
     to maintain our own; we just have to refer to the ACK.

     On the Rx side, we do need to maintain a SACK table with one bit per
     entry - but only if packets go missing - and we don't want to have to
     perform a complex transformation to get the information into an ACK
     packet.

 (2) To try and move almost all processing of received packets out of the
     softirq handler and into a high-priority kernel I/O thread.  Only the
     transferral of packets would be left there.  I would still use the
     encap_rcv hook to receive packets as there's a noticeable performance
     drop from letting the UDP socket put the packets into its own queue
     and then getting them out of there.

 (3) To make the I/O thread also do all the transmission.  The app thread
     would be responsible for packaging the data into packets and then
     buffering them for the I/O thread to transmit.  This would make it
     easier for the app thread to run ahead of the I/O thread, and would
     mean the I/O thread is less likely to have to wait around for a new
     packet to come available for transmission.

 (4) To logically partition the socket/UAPI/KAPI side of things from the
     I/O side of things.  The local endpoint, connection, peer and call
     objects would belong to the I/O side.  The socket side would not then
     touch the private internals of calls and suchlike and would not change
     their states.  It would only look at the send queue, receive queue and
     a way to pass a message to cause an abort.

 (5) To remove as much locking, synchronisation, barriering and atomic ops
     as possible from the I/O side.  Exclusion would be achieved by
     limiting modification of state to the I/O thread only.  Locks would
     still need to be used in communication with the UDP socket and the
     AF_RXRPC socket API.

 (6) To provide crypto offload kernel threads that, when there's slack in
     the system, can see packets that need crypting and provide
     parallelisation in dealing with them.

 (7) To remove the use of system timers.  Since each timer would then send
     a poke to the I/O thread, which would then deal with it when it had
     the opportunity, there seems no point in using system timers if,
     instead, a list of timeouts can be sensibly consulted.  An I/O thread
     only then needs to schedule with a timeout when it is idle.

 (8) To use zero-copy sendmsg to send packets.  This would make use of the
     I/O thread being the sole transmitter on the socket to manage the
     dead-reckoning sequencing of the completion notifications.  There is a
     problem with zero-copy, though: the UDP socket doesn't handle running
     out of option memory very gracefully.

With regard to this first patchset, the changes made include:

 (1) Some fixes, including a fallback for proc_create_net_single_write(),
     setting ack.bufferSize to 0 in ACK packets and a fix for rxrpc
     congestion management, which shouldn't be saving the cwnd value
     between calls.

 (2) Improvements in rxrpc tracepoints, including splitting the timer
     tracepoint into a set-timer and a timer-expired trace.

 (3) Addition of a new proc file to display some stats.

 (4) Some code cleanups, including removing some unused bits and
     unnecessary header inclusions.

 (5) A change to the recently added UDP encap_err_rcv hook so that it has
     the same signature as {ip,ipv6}_icmp_error(), and then just have rxrpc
     point its UDP socket's hook directly at those.

 (6) Definition of a new struct, rxrpc_txbuf, that is used to hold
     transmissible packets of DATA and ACK type in a single 2KiB block
     rather than using an sk_buff.  This allows the buffer to be on a
     number of queues simultaneously more easily, and also guarantees that
     the entire block is in a single unit for zerocopy purposes and that
     the data payload is aligned for in-place crypto purposes.

 (7) ACK txbufs are allocated at proposal and queued for later transmission
     rather than being stored in a single place in the rxrpc_call struct,
     which means only a single ACK can be pending transmission at a time.
     The queue is then drained at various points.  This allows the ACK
     generation code to be simplified.

 (8) The Rx ring buffer is removed.  When a jumbo packet is received (which
     comprises a number of ordinary DATA packets glued together), it used
     to be pointed to by the ring multiple times, with an annotation in a
     side ring indicating which subpacket was in that slot - but this is no
     longer possible.  Instead, the packet is cloned once for each
     subpacket, barring the last, and the range of data is set in the skb
     private area.  This makes it easier for the subpackets in a jumbo
     packet to be decrypted in parallel.

 (9) The Tx ring buffer is removed.  The side annotation ring that held the
     SACK information is also removed.  Instead, in the event of packet
     loss, the SACK data attached an ACK packet is parsed.

(10) Allocate an skcipher request when needed in the rxkad security class
     rather than caching one in the rxrpc_call struct.  This deals with a
     race between externally-driven call disconnection getting rid of the
     skcipher request and sendmsg/recvmsg trying to use it because they
     haven't seen the completion yet.  This is also needed to support
     parallelisation as the skcipher request cannot be used by two or more
     threads simultaneously.

(11) Call udp_sendmsg() and udpv6_sendmsg() directly rather than going
     through kernel_sendmsg() so that we can provide our own iterator
     (zerocopy explicitly doesn't work with a KVEC iterator).  This also
     lets us avoid the overhead of the security hook.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-09 14:03:49 +00:00
Andy Ren
bd039b5ea2 net/core: Allow live renaming when an interface is up
Allow a network interface to be renamed when the interface
is up.

As described in the netconsole documentation [1], when netconsole is
used as a built-in, it will bring up the specified interface as soon as
possible. As a result, user space will not be able to rename the
interface since the kernel disallows renaming of interfaces that are
administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set
by the kernel.

The original solution [2] to this problem was to add a new parameter to
the netconsole configuration parameters that allows renaming of
the interface used by netconsole while it is administratively up.
However, during the discussion that followed, it became apparent that we
have no reason to keep the current restriction and instead we should
allow user space to rename interfaces regardless of their administrative
state:

1. The restriction was put in place over 20 years ago when renaming was
only possible via IOCTL and before rtnetlink started notifying user
space about such changes like it does today.

2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version
5.2 and no regressions were reported.

3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about
the administrative state of interface.

Therefore, allow user space to rename running interfaces by removing the
restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in
possible triage by emitting a message to the kernel log that an
interface was renamed while UP.

[1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst
[2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/

Signed-off-by: Andy Ren <andy.ren@getcruise.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-09 13:08:12 +00:00