linux

iv/linux

Author	SHA1	Message	Date
Jakub Kicinski	af046fd169	Merge branch 'for-uring-ubufops' into HEAD Pavel Begunkov says: ==================== implement io_uring notification (ubuf_info) stacking (net part) To have per request buffer notifications each zerocopy io_uring send request allocates a new ubuf_info. However, as an skb can carry only one uarg, it may force the stack to create many small skbs hurting performance in many ways. The patchset implements notification, i.e. an io_uring's ubuf_info extension, stacking. It attempts to link ubuf_info's into a list, allowing to have multiple of them per skb. liburing/examples/send-zerocopy shows up 6 times performance improvement for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without the patchset it requires much larger sends to utilise all potential. bytes \| before \| after (Kqps) 1200 \| 195 \| 1023 4000 \| 193 \| 1386 8000 \| 154 \| 1058 ==================== Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-22 17:15:39 -07:00
Pavel Begunkov	65bada80de	net: add callback for setting a ubuf_info to skb At the moment an skb can only have one ubuf_info associated with it, which might be a performance problem for zerocopy sends in cases like TCP via io_uring. Add a callback for assigning ubuf_info to skb, this way we will implement smarter assignment later like linking ubuf_info together. Note, it's an optional callback, which should be compatible with skb_zcopy_set(), that's because the net stack might potentially decide to clone an skb and take another reference to ubuf_info whenever it wishes. Also, a correct implementation should always be able to bind to an skb without prior ubuf_info, otherwise we could end up in a situation when the send would not be able to progress. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-22 16:21:59 -07:00
Pavel Begunkov	7ab4f16f9e	net: extend ubuf_info callback to ops structure We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-22 16:21:35 -07:00
Thomas Weißschuh	bfa858f220	sysctl: treewide: constify ctl_table_header::ctl_table_arg To be able to constify instances of struct ctl_tables it is necessary to remove ways through which non-const versions are exposed from the sysctl core. One of these is the ctl_table_arg member of struct ctl_table_header. Constify this reference as a prerequisite for the full constification of struct ctl_table instances. No functional change. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-22 08:56:31 +01:00
Eric Dumazet	ba0f780694	neighbour: no longer hold RTNL in neigh_dump_info() neigh_dump_table() is already relying on RCU protection. pneigh_dump_table() is using its own protection (tbl->lock) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 12:39:20 +01:00
Eric Dumazet	7e4975f7e7	neighbour: fix neigh_dump_info() return value Change neigh_dump_table() and pneigh_dump_table() to either return 0 or -EMSGSIZE if not enough space was available in the skb. Then neigh_dump_info() can do the same. This allows NLMSG_DONE to be appended to the current skb at the end of a dump, saving a couple of recvmsg() system calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 12:39:20 +01:00
Eric Dumazet	f8f2eb9de6	neighbour: add RCU protection to neigh_tables[] In order to remove RTNL protection from neightbl_dump_info() and neigh_dump_info() later, we need to add RCU protection to neigh_tables[]. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 12:39:20 +01:00
Jason Xing	f7b60cce84	net: rps: locklessly access rflow->cpu This is the last member in struct rps_dev_flow which should be protected locklessly. So finish it. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 11:38:03 +01:00
Jason Xing	f00bf5dc83	net: rps: protect filter locklessly As we can see, rflow->filter can be written/read concurrently, so lockless access is needed. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 11:38:03 +01:00
Jason Xing	84b6823cd9	net: rps: protect last_qtail with rps_input_queue_tail_save() helper Removing one unnecessary reader protection and add another writer protection to finish the locklessly proctection job. Note: the removed READ_ONCE() is not needed because we only have to protect the locklessly reader in the different context (rps_may_expire_flow()). Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-19 11:38:03 +01:00
Jakub Kicinski	41e3ddb291	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/trace/events/rpcgss.h `386f4a7379` ("trace: events: cleanup deprecated strncpy uses") `a4833e3aba` ("SUNRPC: Fix rpcgss_context trace event acceptor field") Adjacent changes: drivers/net/ethernet/intel/ice/ice_tc_lib.c `2cca35f5dd` ("ice: Fix checking for unsupported keys on non-tunnel device") `784feaa65d` ("ice: Add support for PFCP hardware offload in switchdev") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-18 13:12:24 -07:00
Zheng Li	eabf425bc6	neighbour: guarantee the localhost connections be established successfully even the ARP table is full Inter-process communication on localhost should be established successfully even the ARP table is full, many processes on server machine use the localhost to communicate such as command-line interface (CLI), servers hope all CLI commands can be executed successfully even the arp table is full. Right now CLI commands got timeout when the arp table is full. Set the parameter of exempt_from_gc to be true for LOOPBACK net device to keep localhost neigh in arp table, not removed by gc. the steps of reproduced: server with "gc_thresh3 = 1024" setting, ping server from more than 1024 same netmask Lan IPv4 addresses, run "ssh localhost" on console interface, then the command will get timeout. Signed-off-by: Zheng Li <James.Z.Li@Dell.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240416095343.540-1-lizheng043@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-04-18 12:01:03 +02:00
Eric Dumazet	1514b06aff	netns: no longer hold RTNL in rtnl_net_dumpid() - rtnl_net_dumpid() is already fully RCU protected, RTNL is not needed there. - Fix return value at the end of a dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/20240416140739.967941-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-17 18:30:28 -07:00
Eric Dumazet	0f022d32c3	net/sched: Fix mirred deadlock on device recursion When the mirred action is used on a classful egress qdisc and a packet is mirrored or redirected to self we hit a qdisc lock deadlock. See trace below. [..... other info removed for brevity....] [ 82.890906] [ 82.890906] ============================================ [ 82.890906] WARNING: possible recursive locking detected [ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W [ 82.890906] -------------------------------------------- [ 82.890906] ping/418 is trying to acquire lock: [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: __dev_queue_xmit+0x1778/0x3550 [ 82.890906] [ 82.890906] but task is already holding lock: [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: __dev_queue_xmit+0x1778/0x3550 [ 82.890906] [ 82.890906] other info that might help us debug this: [ 82.890906] Possible unsafe locking scenario: [ 82.890906] [ 82.890906] CPU0 [ 82.890906] ---- [ 82.890906] lock(&sch->q.lock); [ 82.890906] lock(&sch->q.lock); [ 82.890906] [ 82.890906] * DEADLOCK * [ 82.890906] [..... other info removed for brevity....] Example setup (eth0->eth0) to recreate tc qdisc add dev eth0 root handle 1: htb default 30 tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \ action mirred egress redirect dev eth0 Another example(eth0->eth1->eth0) to recreate tc qdisc add dev eth0 root handle 1: htb default 30 tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \ action mirred egress redirect dev eth1 tc qdisc add dev eth1 root handle 1: htb default 30 tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \ action mirred egress redirect dev eth0 We fix this by adding an owner field (CPU id) to struct Qdisc set after root qdisc is entered. When the softirq enters it a second time, if the qdisc owner is the same CPU, the packet is dropped to break the loop. Reported-by: Mingshuai Ren <renmingshuai@huawei.com> Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/ Fixes: `3bcb846ca4` ("net: get rid of spin_trylock() in net_tx_action()") Fixes: `e578d9c025` ("net: sched: use counter to break reclassify loops") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-17 18:22:52 -07:00
Heiner Kallweit	9382b4f338	net: constify net_class AFAICS all users of net_class take a const struct class * argument. Therefore fully constify net_class. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-15 10:46:01 +01:00
Jason Xing	4d0470b9ad	net: save some cycles when doing skb_attempt_defer_free() Normally, we don't face these two exceptions very often meanwhile we have some chance to meet the condition where the current cpu id is the same as skb->alloc_cpu. One simple test that can help us see the frequency of this statement 'cpu == raw_smp_processor_id()': 1. running iperf -s and iperf -c [ip] -P [MAX CPU] 2. using BPF to capture skb_attempt_defer_free() I can see around 4% chance that happens to satisfy the statement. So moving this statement at the beginning can save some cycles in most cases. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-15 10:34:59 +01:00
Jakub Kicinski	3db3b62955	net: dev_addr_lists: move locking out of init/exit in kunit We lock and unlock rtnl in init/exit for convenience, but it started causing problems if the exit is handled by a different thread. To avoid having to futz with disabling locking assertions move the locking into the test cases. We don't use ASSERTs so it should be safe. ============= dev-addr-list-test (6 subtests) ============== [PASSED] dev_addr_test_basic [PASSED] dev_addr_test_sync_one [PASSED] dev_addr_test_add_del [PASSED] dev_addr_test_del_main [PASSED] dev_addr_test_add_set [PASSED] dev_addr_test_add_excl =============== [PASSED] dev-addr-list-test ================ Link: https://lore.kernel.org/all/20240403131936.787234-7-linux@roeck-us.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-15 10:26:35 +01:00
Wander Lairson Costa	f1e197a665	drop_monitor: replace spin_lock by raw_spin_lock trace_drop_common() is called with preemption disabled, and it acquires a spin_lock. This is problematic for RT kernels because spin_locks are sleeping locks in this configuration, which causes the following splat: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 449, name: rcuc/47 preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 2 5 locks held by rcuc/47/449: #0: ff1100086ec30a60 ((softirq_ctrl.lock)){+.+.}-{2:2}, at: __local_bh_disable_ip+0x105/0x210 #1: ffffffffb394a280 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0xbf/0x130 #2: ffffffffb394a280 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip+0x11c/0x210 #3: ffffffffb394a160 (rcu_callback){....}-{0:0}, at: rcu_do_batch+0x360/0xc70 #4: ff1100086ee07520 (&data->lock){+.+.}-{2:2}, at: trace_drop_common.constprop.0+0xb5/0x290 irq event stamp: 139909 hardirqs last enabled at (139908): [<ffffffffb1df2b33>] _raw_spin_unlock_irqrestore+0x63/0x80 hardirqs last disabled at (139909): [<ffffffffb19bd03d>] trace_drop_common.constprop.0+0x26d/0x290 softirqs last enabled at (139892): [<ffffffffb07a1083>] __local_bh_enable_ip+0x103/0x170 softirqs last disabled at (139898): [<ffffffffb0909b33>] rcu_cpu_kthread+0x93/0x1f0 Preemption disabled at: [<ffffffffb1de786b>] rt_mutex_slowunlock+0xab/0x2e0 CPU: 47 PID: 449 Comm: rcuc/47 Not tainted 6.9.0-rc2-rt1+ #7 Hardware name: Dell Inc. PowerEdge R650/0Y2G81, BIOS 1.6.5 04/15/2022 Call Trace: <TASK> dump_stack_lvl+0x8c/0xd0 dump_stack+0x14/0x20 __might_resched+0x21e/0x2f0 rt_spin_lock+0x5e/0x130 ? trace_drop_common.constprop.0+0xb5/0x290 ? skb_queue_purge_reason.part.0+0x1bf/0x230 trace_drop_common.constprop.0+0xb5/0x290 ? preempt_count_sub+0x1c/0xd0 ? _raw_spin_unlock_irqrestore+0x4a/0x80 ? __pfx_trace_drop_common.constprop.0+0x10/0x10 ? rt_mutex_slowunlock+0x26a/0x2e0 ? skb_queue_purge_reason.part.0+0x1bf/0x230 ? __pfx_rt_mutex_slowunlock+0x10/0x10 ? skb_queue_purge_reason.part.0+0x1bf/0x230 trace_kfree_skb_hit+0x15/0x20 trace_kfree_skb+0xe9/0x150 kfree_skb_reason+0x7b/0x110 skb_queue_purge_reason.part.0+0x1bf/0x230 ? __pfx_skb_queue_purge_reason.part.0+0x10/0x10 ? mark_lock.part.0+0x8a/0x520 ... trace_drop_common() also disables interrupts, but this is a minor issue because we could easily replace it with a local_lock. Replace the spin_lock with raw_spin_lock to avoid sleeping in atomic context. Signed-off-by: Wander Lairson Costa <wander@redhat.com> Reported-by: Hu Chunyu <chuhu@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-15 09:54:15 +01:00
Eric Dumazet	32affa5578	fib: rules: no longer hold RTNL in fib_nl_dumprule() - fib rules are already RCU protected, RTNL is not needed to get them. - Fix return value at the end of a dump, so that NLMSG_DONE can be appended to current skb, saving one recvmsg() system call. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240411133340.1332796-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-12 19:09:31 -07:00
Mina Almasry	a580ea994f	net: mirror skb frag ref/unref helpers Refactor some of the skb frag ref/unref helpers for improved clarity. Implement napi_pp_get_page() to be the mirror counterpart of napi_pp_put_page(). Implement skb_page_ref() to be the mirror of skb_page_unref(). Improve __skb_frag_ref() to become a mirror counterpart of __skb_frag_unref(). Previously unref could handle pp & non-pp pages, while the ref could only handle non-pp pages. Now both the ref & unref helpers can correctly handle both pp & non-pp pages. Now that __skb_frag_ref() can handle both pp & non-pp pages, remove skb_pp_frag_ref(), and use __skb_frag_ref() instead. This lets us remove pp specific handling from skb_try_coalesce. Additionally, since __skb_frag_ref() can now handle both pp & non-pp pages, a latent issue in skb_shift() should now be fixed. Previously this function would do a non-pp ref & pp unref on potential pp frags (fragfrom). After this patch, skb_shift() should correctly do a pp ref/unref on pp frags. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-11 19:29:23 -07:00
Mina Almasry	f6d827b180	net: move skb ref helpers to new header Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref() helpers. Many of the consumers of skbuff.h do not actually use any of the skb ref helpers, and we can speed up compilation a bit by minimizing this header file. Additionally in the later patch in the series we add page_pool support to skb_frag_ref(), which requires some page_pool dependencies. We can now add these dependencies to skbuff_ref.h instead of a very ubiquitous skbuff.h Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-11 19:29:22 -07:00
Pavel Begunkov	d8415a165c	net: use SKB_CONSUMED in skb_attempt_defer_free() skb_attempt_defer_free() is used to free already processed skbs, so pass SKB_CONSUMED as the reason in kfree_skb_napi_cache(). Suggested-by: Jason Xing <kerneljasonxing@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/bcf5dbdda79688b074ab7ae2238535840a6d3fc2.1712711977.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-10 19:27:32 -07:00
Pavel Begunkov	7cb31c46b9	net: cache for same cpu skb_attempt_defer_free Optimise skb_attempt_defer_free() when run by the same CPU the skb was allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can disable softirqs and put the buffer into cpu local caches. CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1% throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles, the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the optimisation is for the receive path, but the test spends >55% of CPU doing writes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/a887463fb219d973ec5ad275e31194812571f1f5.1712711977.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-10 19:27:32 -07:00
Mina Almasry	f58f3c9563	net: remove napi_frag_unref With the changes in the last patches, napi_frag_unref() is now reduandant. Remove it and use skb_page_unref directly. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20240408153000.2152844-4-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-09 18:20:32 -07:00
Eric Dumazet	4308811ba9	net: display more skb fields in skb_dump() Print these additional fields in skb_dump() to ease debugging. - mac_len - csum_start (in v2, at Willem suggestion) - csum_offset (in v2, at Willem suggestion) - priority - mark - alloc_cpu - vlan_all - encapsulation - inner_protocol - inner_mac_header - inner_network_header - inner_transport_header Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-08 14:06:31 +01:00
Eric Dumazet	db77cdc696	net: dqs: use sysfs_emit() in favor of sprintf() Commit `6025b9135f` ("net: dqs: add NIC stall detector based on BQL") added three sysfs files. Use the recommended sysfs_emit() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Breno Leitao <leitao@debian.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-08 11:04:08 +01:00
Maxime Chevallier	6916e461e7	net: phy: Introduce ethernet link topology representation Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. The PHY index can be re-used for PHYs that are persistent. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-06 18:25:14 +01:00
Jakub Kicinski	9f06f87fef	net: skbuff: generalize the skb->decrypted bit The ->decrypted bit can be reused for other crypto protocols. Remove the direct dependency on TLS, add helpers to clean up the ifdefs leaking out everywhere. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-06 17:34:31 +01:00
Jakub Kicinski	cf1ca1f66d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/ip_gre.c `17af420545` ("erspan: make sure erspan_base_hdr is present in skb->head") `5832c4a77d` ("ip_tunnel: convert __be16 tunnel flags to bitmaps") https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/ Adjacent changes: net/ipv6/ip6_fib.c `d21d40605b` ("ipv6: Fix infinite recursion in fib6_dump_done().") `5fc68320c1` ("ipv6: remove RTNL protection from inet6_dump_fib()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-04 18:01:07 -07:00
Jakub Kicinski	1cfa2f10f4	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZg7vEAAKCRDbK58LschI gxAeAQD18uHqGbcCCnOVnaETdi3bQcSBpobuclThpN1WdMILcwEA+5Vz9Lxmt6BH qF/vbVHAOeCZt3LTOJ/jx1GRVstVWQk= =+QKv -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-04-04 We've added 7 non-merge commits during the last 5 day(s) which contain a total of 9 files changed, 75 insertions(+), 24 deletions(-). The main changes are: 1) Fix x86 BPF JIT under retbleed=stuff which causes kernel panics due to incorrect destination IP calculation and incorrect IP for relocations, from Uros Bizjak and Joan Bruguera Micó. 2) Fix BPF arena file descriptor leaks in the verifier, from Anton Protopopov. 3) Defer bpf_link deallocation to after RCU grace period as currently running multi-{kprobes,uprobes} programs might still access cookie information from the link, from Andrii Nakryiko. 4) Fix a BPF sockmap lock inversion deadlock in map_delete_elem reported by syzkaller, from Jakub Sitnicki. 5) Fix resolve_btfids build with musl libc due to missing linux/types.h include, from Natanael Copa. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Prevent lock inversion deadlock in map delete elem x86/bpf: Fix IP for relocating call depth accounting x86/bpf: Fix IP after emitting call depth accounting bpf: fix possible file descriptor leaks in verifier tools/resolve_btfids: fix build with musl libc bpf: support deferring bpf_link dealloc to after RCU grace period bpf: put uprobe link's path and task in release callback ==================== Link: https://lore.kernel.org/r/20240404183258.4401-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-04 11:37:39 -07:00
Marcelo Tosatti	2f3c7195a7	net: enable timestamp static key if CPU For systems that use CPU isolation (via nohz_full), creating or destroying a socket with SO_TIMESTAMP, SO_TIMESTAMPNS or SO_TIMESTAMPING with flag SOF_TIMESTAMPING_RX_SOFTWARE will cause a static key to be enabled/disabled. This in turn causes undesired IPIs to isolated CPUs. So enable the static key unconditionally, if CPU isolation is enabled, thus avoiding the IPIs. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/ZgrUiLLtbEUf9SFn@tpad Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-03 19:14:53 -07:00
Alexander Lobakin	39806b96c8	page_pool: try direct bulk recycling Now that the checks for direct recycling possibility live inside the Page Pool core, reuse them when performing bulk recycling. page_pool_put_page_bulk() can be called from process context as well, page_pool_napi_local() takes care of this at the very beginning. Under high .ndo_xdp_xmit() traffic load, the win is 2-3% Pps assuming the sending driver uses xdp_return_frame_bulk() on Tx completion. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20240329165507.3240110-3-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-02 18:13:49 -07:00
Alexander Lobakin	4a96a4e807	page_pool: check for PP direct cache locality later Since we have pool->p.napi (Jakub) and pool->cpuid (Lorenzo) to check whether it's safe to use direct recycling, we can use both globally for each page instead of relying solely on @allow_direct argument. Let's assume that @allow_direct means "I'm sure it's local, don't waste time rechecking this" and when it's false, try the mentioned params to still recycle the page directly. If neither is true, we'll lose some CPU cycles, but then it surely won't be hotpath. On the other hand, paths where it's possible to use direct cache, but not possible to safely set @allow_direct, will benefit from this move. The whole propagation of @napi_safe through a dozen of skb freeing functions can now go away, which saves us some stack space. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20240329165507.3240110-2-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-02 18:13:49 -07:00
Jakub Sitnicki	ff91059932	bpf, sockmap: Prevent lock inversion deadlock in map delete elem syzkaller started using corpuses where a BPF tracing program deletes elements from a sockmap/sockhash map. Because BPF tracing programs can be invoked from any interrupt context, locks taken during a map_delete_elem operation must be hardirq-safe. Otherwise a deadlock due to lock inversion is possible, as reported by lockdep: CPU0 CPU1 ---- ---- lock(&htab->buckets[i].lock); local_irq_disable(); lock(&host->lock); lock(&htab->buckets[i].lock); <Interrupt> lock(&host->lock); Locks in sockmap are hardirq-unsafe by design. We expects elements to be deleted from sockmap/sockhash only in task (normal) context with interrupts enabled, or in softirq context. Detect when map_delete_elem operation is invoked from a context which is _not_ hardirq-unsafe, that is interrupts are disabled, and bail out with an error. Note that map updates are not affected by this issue. BPF verifier does not allow updating sockmap/sockhash from a BPF tracing program today. Fixes: `604326b41a` ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yue sun <samsun1006219@gmail.com> Reported-by: syzbot+bc922f476bd65abbd466@syzkaller.appspotmail.com Reported-by: syzbot+d4066896495db380182e@syzkaller.appspotmail.com Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+d4066896495db380182e@syzkaller.appspotmail.com Acked-by: John Fastabend <john.fastabend@gmail.com> Closes: https://syzkaller.appspot.com/bug?extid=d4066896495db380182e Closes: https://syzkaller.appspot.com/bug?extid=bc922f476bd65abbd466 Link: https://lore.kernel.org/bpf/20240402104621.1050319-1-jakub@cloudflare.com	2024-04-02 16:31:05 +02:00
Eric Dumazet	c62fdf5b11	net: rps: add rps_input_queue_head_add() helper process_backlog() can batch increments of sd->input_queue_head, saving some memory bandwidth. Also add READ_ONCE()/WRITE_ONCE() annotations around sd->input_queue_head accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:32 +01:00
Eric Dumazet	36b83ffcf2	net: rps: change input_queue_tail_incr_save() input_queue_tail_incr_save() is incrementing the sd queue_tail and save it in the flow last_qtail. Two issues here : - no lock protects the write on last_qtail, we should use appropriate annotations. - We can perform this write after releasing the per-cpu backlog lock, to decrease this lock hold duration (move away the cache line miss) Also move input_queue_head_incr() and rps helpers to include/net/rps.h, while adding rps_ prefix to better reflect their role. v2: Fixed a build issue (Jakub and kernel build bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:32 +01:00
Eric Dumazet	f7efd01fe2	net: enqueue_to_backlog() cleanup We can remove a goto and a label by reversing a condition. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:32 +01:00
Eric Dumazet	a7ae7b0b2e	net: make softnet_data.dropped an atomic_t If under extreme cpu backlog pressure enqueue_to_backlog() has to drop a packet, it could do this without dirtying a cache line and potentially slowing down the target cpu. Move sd->dropped into a separate cache line, and make it atomic. In non pressure mode, this field is not touched, no need to consume valuable space in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:32 +01:00
Eric Dumazet	95e48d862a	net: enqueue_to_backlog() change vs not running device If the device attached to the packet given to enqueue_to_backlog() is not running, we drop the packet. But we accidentally increase sd->dropped, giving false signals to admins: sd->dropped should be reserved to cpu backlog pressure, not to temporary glitches at device dismantles. While we are at it, perform the netif_running() test before we get the rps lock, and use REASON_DEV_READY drop reason instead of NOT_SPECIFIED. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:31 +01:00
Eric Dumazet	2fe50a4d72	net: move dev_xmit_recursion() helpers to net/core/dev.h Move dev_xmit_recursion() and friends to net/core/dev.h They are only used from net/core/dev.c and net/core/filter.c. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:31 +01:00
Eric Dumazet	b9495b564d	net: move kick_defer_list_purge() to net/core/dev.h kick_defer_list_purge() is defined in net/core/dev.c and used from net/core/skubff.c Because we need softnet_data, include <linux/netdevice.h> from net/core/dev.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 11:28:31 +01:00
Michal Swiatkowski	6dd514f481	pfcp: always set pfcp metadata In PFCP receive path set metadata needed by flower code to do correct classification based on this metadata. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 10:49:28 +01:00
Alexander Lobakin	5b2be2ab76	net: net_test: add tests for IP tunnel flags conversion helpers Now that there are helpers for converting IP tunnel flags between the old __be16 format and the bitmap format, make sure they work as expected by adding a couple of tests to the networking testing suite. The helpers are all inline, so no dependencies on the related CONFIG_* (or a standalone module) are needed. Cover three possible cases: 1. No bits past BIT(15) are set, VTI/SIT bits are not set. This conversion is almost a direct assignment. 2. No bits past BIT(15) are set, but VTI/SIT bit is set. During the conversion, it must be transformed into BIT(16) in the bitmap, but still compatible with the __be16 format. 3. The bitmap has bits past BIT(15) set (not the VTI/SIT one). The result will be truncated. Note that currently __IP_TUNNEL_FLAG_NUM is 17 (incl. special), which means that the result of this case is currently semi-false-positive. When BIT(17) is finally here, it will be adjusted accordingly. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 10:49:28 +01:00
Alexander Lobakin	5832c4a77d	ip_tunnel: convert __be16 tunnel flags to bitmaps Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT have been defined as __be16. Now all of those 16 bits are occupied and there's no more free space for new flags. It can't be simply switched to a bigger container with no adjustments to the values, since it's an explicit Endian storage, and on LE systems (__be16)0x0001 equals to (__be64)0x0001000000000000. We could probably define new 64-bit flags depending on the Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on LE, but that would introduce an Endianness dependency and spawn a ton of Sparse warnings. To mitigate them, all of those places which were adjusted with this change would be touched anyway, so why not define stuff properly if there's no choice. Define IP_TUNNEL__BIT counterparts as a bit number instead of the value already coded and a fistful of <16 <-> bitmap> converters and helpers. The two flags which have a different bit position are SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as __cpu_to_be16(), but as (__force __be16), i.e. had different positions on LE and BE. Now they both have strongly defined places. Change all __be16 fields which were used to store those flags, to IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) -> unsigned long[1] for now, and replace all TUNNEL_ occurrences to their bitmap counterparts. Use the converters in the places which talk to the userspace, hardware (NFP) or other hosts (GRE header). The rest must explicitly use the new flags only. This must be done at once, otherwise there will be too many conversions throughout the code in the intermediate commits. Finally, disable the old __be16 flags for use in the kernel code (except for the two 'irregular' flags mentioned above), to prevent any accidental (mis)use of them. For the userspace, nothing is changed, only additions were made. Most noticeable bloat-o-meter difference (.text): vmlinux: 307/-1 (306) gre.ko: 62/0 (62) ip_gre.ko: 941/-217 (724) [] ip_tunnel.ko: 390/-900 (-510) [] ip_vti.ko: 138/0 (138) ip6_gre.ko: 534/-18 (516) [] ip6_tunnel.ko: 118/-10 (108) [] gre_flags_to_tnl_flags() grew, but still is inlined [] ip_tunnel_find() got uninlined, hence such decrease The average code size increase in non-extreme case is 100-200 bytes per module, mostly due to sizeof(long) > sizeof(__be16), as %__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers are able to expand the majority of bitmap_() calls here into direct operations on scalars. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-04-01 10:49:28 +01:00
Johannes Berg	e8058a49e6	netlink: introduce type-checking attribute iteration There are, especially with multi-attr arrays, many cases of needing to iterate all attributes of a specific type in a netlink message or a nested attribute. Add specific macros to support that case. Also convert many instances using this spatch: @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) == ATTR) { ... -} } @@ iterator nla_for_each_attr; iterator name nla_for_each_attr_type; identifier nla; expression head, len, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_attr(nla, head, len, rem) +nla_for_each_attr_type(nla, ATTR, head, len, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } @@ identifier nla; iterator nla_for_each_nested; iterator name nla_for_each_nested_type; expression attr, rem; expression ATTR; type T; identifier x; @@ -nla_for_each_nested(nla, attr, rem) +nla_for_each_nested_type(nla, ATTR, attr, rem) { <... T x; ...> -if (nla_type(nla) != ATTR) continue; ... } Although I had to undo one bad change this made, and I also adjusted some other code for whitespace and to use direct variable initialization now. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20240328203144.b5a6c895fb80.I1869b44767379f204998ff44dd239803f39c23e0@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 15:06:02 -07:00
Eric Dumazet	1abe267f17	net: add sk_wake_async_rcu() helper While looking at UDP receive performance, I saw sk_wake_async() was no longer inlined. This matters at least on AMD Zen1-4 platforms (see SRSO) This might be because rcu_read_lock() and rcu_read_unlock() are no longer nops in recent kernels ? Add sk_wake_async_rcu() variant, which must be called from contexts already holding rcu lock. As SOCK_FASYNC is deprecated in modern days, use unlikely() to give a hint to the compiler. sk_wake_async_rcu() is properly inlined from __udp_enqueue_schedule_skb() and sock_def_readable(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240328144032.1864988-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 15:03:11 -07:00
Eric Dumazet	5086f0fe46	net: do not consume a cacheline for system_page_pool There is no reason to consume a full cacheline to store system_page_pool. We can eventually move it to softnet_data later for full locality control. Fixes: `2b0cfa6e49` ("net: add generic percpu page_pool allocator") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Bianconi <lorenzo@kernel.org> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20240328173448.2262593-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 12:27:05 -07:00
Kuniyuki Iwashima	42f298c06b	af_unix: Link struct unix_edge when queuing skb. Just before queuing skb with inflight fds, we call scm_stat_add(), which is a good place to set up the preallocated struct unix_vertex and struct unix_edge in UNIXCB(skb).fp. Then, we call unix_add_edges() and construct the directed graph as follows: 1. Set the inflight socket's unix_sock to unix_edge.predecessor. 2. Set the receiver's unix_sock to unix_edge.successor. 3. Set the preallocated vertex to inflight socket's unix_sock.vertex. 4. Link inflight socket's unix_vertex.entry to unix_unvisited_vertices. 5. Link unix_edge.vertex_entry to the inflight socket's unix_vertex.edges. Let's say we pass the fd of AF_UNIX socket A to B and the fd of B to C. The graph looks like this: +-------------------------+ \| unix_unvisited_vertices \| <-------------------------. +-------------------------+ \| + \| \| +--------------+ +--------------+ \| +--------------+ \| \| unix_sock A \| <---. .---> \| unix_sock B \| <-\|-. .---> \| unix_sock C \| \| +--------------+ \| \| +--------------+ \| \| \| +--------------+ \| .-+ \| vertex \| \| \| .-+ \| vertex \| \| \| \| \| vertex \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| '-> \| unix_vertex \| \| \| '-> \| unix_vertex \| \| \| \| \| +--------------+ \| \| +--------------+ \| \| \| `---> \| entry \| +---------> \| entry \| +-' \| \| \|--------------\| \| \| \|--------------\| \| \| \| edges \| <-. \| \| \| edges \| <-. \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| \| \| \| \| \| .----------------------' \| \| .----------------------' \| \| \| \| \| \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| \| \| unix_edge \| \| \| \| \| unix_edge \| \| \| \| +--------------+ \| \| \| +--------------+ \| \| `-> \| vertex_entry \| \| \| `-> \| vertex_entry \| \| \| \|--------------\| \| \| \|--------------\| \| \| \| predecessor \| +---' \| \| predecessor \| +---' \| \|--------------\| \| \|--------------\| \| \| successor \| +-----' \| successor \| +-----' +--------------+ +--------------+ Henceforth, we denote such a graph as A -> B (-> C). Now, we can express all inflight fd graphs that do not contain embryo sockets. We will support the particular case later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:50 -07:00
Kuniyuki Iwashima	29b64e3540	af_unix: Allocate struct unix_edge for each inflight AF_UNIX fd. As with the previous patch, we preallocate to skb's scm_fp_list an array of struct unix_edge in the number of inflight AF_UNIX fds. There we just preallocate memory and do not use immediately because sendmsg() could fail after this point. The actual use will be in the next patch. When we queue skb with inflight edges, we will set the inflight socket's unix_sock as unix_edge->predecessor and the receiver's unix_sock as successor, and then we will link the edge to the inflight socket's unix_vertex.edges. Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup() so that MSG_PEEK does not change the shape of the directed graph. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:44 -07:00
Kuniyuki Iwashima	1fbfdfaa59	af_unix: Allocate struct unix_vertex for each inflight AF_UNIX fd. We will replace the garbage collection algorithm for AF_UNIX, where we will consider each inflight AF_UNIX socket as a vertex and its file descriptor as an edge in a directed graph. This patch introduces a new struct unix_vertex representing a vertex in the graph and adds its pointer to struct unix_sock. When we send a fd using the SCM_RIGHTS message, we allocate struct scm_fp_list to struct scm_cookie in scm_fp_copy(). Then, we bump each refcount of the inflight fds' struct file and save them in scm_fp_list.fp. After that, unix_attach_fds() inexplicably clones scm_fp_list of scm_cookie and sets it to skb. (We will remove this part after replacing GC.) Here, we add a new function call in unix_attach_fds() to preallocate struct unix_vertex per inflight AF_UNIX fd and link each vertex to skb's scm_fp_list.vertices. When sendmsg() succeeds later, if the socket of the inflight fd is still not inflight yet, we will set the preallocated vertex to struct unix_sock.vertex and link it to a global list unix_unvisited_vertices under spin_lock(&unix_gc_lock). If the socket is already inflight, we free the preallocated vertex. This is to avoid taking the lock unnecessarily when sendmsg() could fail later. In the following patch, we will similarly allocate another struct per edge, which will finally be linked to the inflight socket's unix_vertex.edges. And then, we will count the number of edges as unix_vertex.out_degree. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-03-29 08:27:33 -07:00

1 2 3 4 5 ...

9260 Commits