linux

iv/linux

Author	SHA1	Message	Date
Eric Dumazet	1b3ef46cb7	net: remove dev_base_lock dev_base_lock is not needed anymore, all remaining users also hold RTNL. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:14 +00:00
Eric Dumazet	e51b962438	net: remove dev_base_lock from register_netdevice() and friends. RTNL already protects writes to dev->reg_state, we no longer need to hold dev_base_lock to protect the readers. unlist_netdevice() second argument can be removed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:14 +00:00
Eric Dumazet	2dd4d828d6	net: remove dev_base_lock from do_setlink() We hold RTNL here, and dev->link_mode readers already are using READ_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:14 +00:00
Eric Dumazet	6a2968ee1e	net: add netdev_set_operstate() helper dev_base_lock is going away, add netdev_set_operstate() helper so that hsr does not have to know core internals. Remove dev_base_lock acquisition from rfc2863_policy() v3: use an "unsigned int" for dev->operstate, so that try_cmpxchg() can work on all arches. ( https://lore.kernel.org/oe-kbuild-all/202402081918.OLyGaea3-lkp@intel.com/ ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	e154bb7a6e	net-sysfs: convert netstat_show() to RCU dev_get_stats() can be called from RCU, there is no need to acquire dev_base_lock. Change dev_isalive() comment to reflect we no longer use dev_base_lock from net/core/net-sysfs.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	004d138364	net-sysfs: convert dev->operstate reads to lockless ones operstate_show() can omit dev_base_lock acquisition only to read dev->operstate. Annotate accesses to dev->operstate. Writers still acquire dev_base_lock for mutual exclusion. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	c7d52737e7	net-sysfs: use dev_addr_sem to remove races in address_show() Using dev_base_lock is not preventing from reading garbage. Use dev_addr_sem instead. v4: place dev_addr_sem extern in net/core/dev.h (Jakub Kicinski) Link: https://lore.kernel.org/netdev/20240212175845.10f6680a@kernel.org/ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	12692e3df2	net-sysfs: convert netdev_show() to RCU Make clear dev_isalive() can be called with RCU protection. Then convert netdev_show() to RCU, to remove dev_base_lock dependency. Also add RCU to broadcast_show(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	4d42b37def	net: convert dev->reg_state to u8 Prepares things so that dev->reg_state reads can be lockless, by adding WRITE_ONCE() on write side. READ_ONCE()/WRITE_ONCE() do not support bitfields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	a6473fe9b6	dev: annotate accesses to dev->link Following patch will read dev->link locklessly, annotate the write from do_setlink(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	f694eee9e1	ip_tunnel: annotate data-races around t->parms.link t->parms.link is read locklessly, annotate these reads and opposite writes accordingly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
Eric Dumazet	1c07dbb0cc	net: annotate data-races around dev->name_assign_type name_assign_type_show() runs locklessly, we should annotate accesses to dev->name_assign_type. Alternative would be to grab devnet_rename_sem semaphore from name_assign_type_show(), but this would not bring more accuracy. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 11:20:13 +00:00
David S. Miller	e1a00373e1	linux-can-next-for-6.9-20240213 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEUEC6huC2BN0pvD5fKDiiPnotvG8FAmXLUSQTHG1rbEBwZW5n dXRyb25peC5kZQAKCRAoOKI+ei28b6VMB/0eqFcC233/c60/7iEbxXTGG52qs4mc 4LeTs57+4Msfibq7M81ZzBuZoMqFluFELunYT5gDPXgnSn4AWXyCv9ciYCW8vort Z/2wcSNUMdOIbmKZhdc96gnqXuE6fNMx/eYTsn34HBkMkM7BfxZSIH3pZsys+eGw JrVwhT2aBVKG5ji4YPZF/RuqHwuM00GLMs9G9GR6yw9JiCwI1n+Jjru/6zwJprpi NAyLhJGgvgp+twLID2jH2Gy6Mqs/ZrXMyxPMqycbYOtZ4oQJOfTkg1SXzT/J3GsY VFWvhGWrADSx7CnISuS9VXsoWpe5nZ7yMhFBOtKME3Gh3qmhQegPIMY3 =w4J5 -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-6.9-20240213' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== linux-can-next-for-6.9-20240213 this is a pull request of 23 patches for net-next/master. The first patch is by Nicolas Maier and targets the CAN Broadcast Manager (bcm), it adds message flags to distinguish between own local and remote traffic. Oliver Hartkopp contributes a patch for the CAN ISOTP protocol that adds dynamic flow control parameters. Stefan Mätje's patch series add support for the esd PCIe/402 CAN interface family. Markus Schneider-Pargmann contributes 14 patches for the m_can to optimize for the SPI attached tcan4x5x controller. A patch by Vincent Mailhol replaces Wolfgang Grandegger by Vincent Mailhol as the CAN drivers Co-Maintainer. Jimmy Assarsson's patch add support for the Kvaser M.2 PCIe 4xCAN adapter. A patch by Daniil Dulov removed a redundant NULL check in the softing driver. Oliver Hartkopp contributes a patch to add CANXL virtual CAN network identifier support. A patch by myself removes Naga Sureshkumar Relli as the maintainer of the xilinx_can driver, as their email bounces. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-14 10:00:35 +00:00
Lorenzo Bianconi	27accb3cc0	veth: rely on skb_pp_cow_data utility routine Rely on skb_pp_cow_data utility routine and remove duplicated code. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/029cc14cce41cb242ee7efdcf32acc81f1ce4e9f.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 19:22:30 -08:00
Lorenzo Bianconi	e6d5dbdd20	xdp: add multi-buff support for xdp running in generic mode Similar to native xdp, do not always linearize the skb in netif_receive_generic_xdp routine but create a non-linear xdp_buff to be processed by the eBPF program. This allow to add multi-buffer support for xdp running in generic mode. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/1044d6412b1c3e95b40d34993fd5f37cd2f319fd.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 19:22:30 -08:00
Lorenzo Bianconi	4d2bb0bfe8	xdp: rely on skb pointer reference in do_xdp_generic and netif_receive_generic_xdp Rely on skb pointer reference instead of the skb pointer in do_xdp_generic and netif_receive_generic_xdp routine signatures. This is a preliminary patch to add multi-buff support for xdp running in generic mode where we will need to reallocate the skb to avoid linearization and we will need to make it visible to do_xdp_generic() caller. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c09415b1f48c8620ef4d76deed35050a7bddf7c2.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 19:22:30 -08:00
Lorenzo Bianconi	2b0cfa6e49	net: add generic percpu page_pool allocator Introduce generic percpu page_pools allocator. Moreover add page_pool_create_percpu() and cpuid filed in page_pool struct in order to recycle the page in the page_pool "hot" cache if napi_pp_put_page() is running on the same cpu. This is a preliminary patch to add xdp multi-buff support for xdp running in generic mode. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/80bc4285228b6f4220cd03de1999d86e46e3fcbd.1707729884.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 19:22:30 -08:00
Eric Dumazet	0bef512012	net: add netdev_lockdep_set_classes() to virtual drivers Based on a syzbot report, it appears many virtual drivers do not yet use netdev_lockdep_set_classes(), triggerring lockdep false positives. WARNING: possible recursive locking detected 6.8.0-rc4-next-20240212-syzkaller #0 Not tainted syz-executor.0/19016 is trying to acquire lock: ffff8880162cb298 (_xmit_ETHER#2){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880162cb298 (_xmit_ETHER#2){+.-.}-{2:2}, at: __netif_tx_lock include/linux/netdevice.h:4452 [inline] ffff8880162cb298 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 but task is already holding lock: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: __netif_tx_lock include/linux/netdevice.h:4452 [inline] ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(_xmit_ETHER#2); lock(_xmit_ETHER#2); * DEADLOCK * May be due to missing lock nesting notation 9 locks held by syz-executor.0/19016: #0: ffffffff8f385208 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock net/core/rtnetlink.c:79 [inline] #0: ffffffff8f385208 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x82c/0x1040 net/core/rtnetlink.c:6603 #1: ffffc90000a08c00 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0xc0/0x600 kernel/time/timer.c:1697 #2: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline] #2: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline] #2: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1360 net/ipv4/ip_output.c:228 #3: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: local_bh_disable include/linux/bottom_half.h:20 [inline] #3: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: rcu_read_lock_bh include/linux/rcupdate.h:802 [inline] #3: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x2c4/0x3b10 net/core/dev.c:4284 #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: spin_trylock include/linux/spinlock.h:361 [inline] #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: qdisc_run_begin include/net/sch_generic.h:195 [inline] #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_xmit_skb net/core/dev.c:3771 [inline] #4: ffff8880416e3258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x1262/0x3b10 net/core/dev.c:4325 #5: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #5: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: __netif_tx_lock include/linux/netdevice.h:4452 [inline] #5: ffff8880223db4d8 (_xmit_ETHER#2){+.-.}-{2:2}, at: sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 #6: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline] #6: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline] #6: ffffffff8e131520 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1360 net/ipv4/ip_output.c:228 #7: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: local_bh_disable include/linux/bottom_half.h:20 [inline] #7: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: rcu_read_lock_bh include/linux/rcupdate.h:802 [inline] #7: ffffffff8e131580 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x2c4/0x3b10 net/core/dev.c:4284 #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: spin_trylock include/linux/spinlock.h:361 [inline] #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: qdisc_run_begin include/net/sch_generic.h:195 [inline] #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_xmit_skb net/core/dev.c:3771 [inline] #8: ffff888014d9d258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+...}-{2:2}, at: __dev_queue_xmit+0x1262/0x3b10 net/core/dev.c:4325 stack backtrace: CPU: 1 PID: 19016 Comm: syz-executor.0 Not tainted 6.8.0-rc4-next-20240212-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 check_deadlock kernel/locking/lockdep.c:3062 [inline] validate_chain+0x15c1/0x58e0 kernel/locking/lockdep.c:3856 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137 lock_acquire+0x1e4/0x530 kernel/locking/lockdep.c:5754 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] __netif_tx_lock include/linux/netdevice.h:4452 [inline] sch_direct_xmit+0x1c4/0x5f0 net/sched/sch_generic.c:340 __dev_xmit_skb net/core/dev.c:3784 [inline] __dev_queue_xmit+0x1912/0x3b10 net/core/dev.c:4325 neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0xe66/0x1360 net/ipv4/ip_output.c:235 iptunnel_xmit+0x540/0x9b0 net/ipv4/ip_tunnel_core.c:82 ip_tunnel_xmit+0x20ee/0x2960 net/ipv4/ip_tunnel.c:831 erspan_xmit+0x9de/0x1460 net/ipv4/ip_gre.c:720 __netdev_start_xmit include/linux/netdevice.h:4989 [inline] netdev_start_xmit include/linux/netdevice.h:5003 [inline] xmit_one net/core/dev.c:3555 [inline] dev_hard_start_xmit+0x242/0x770 net/core/dev.c:3571 sch_direct_xmit+0x2b6/0x5f0 net/sched/sch_generic.c:342 __dev_xmit_skb net/core/dev.c:3784 [inline] __dev_queue_xmit+0x1912/0x3b10 net/core/dev.c:4325 neigh_output include/net/neighbour.h:542 [inline] ip_finish_output2+0xe66/0x1360 net/ipv4/ip_output.c:235 igmpv3_send_cr net/ipv4/igmp.c:723 [inline] igmp_ifc_timer_expire+0xb71/0xd90 net/ipv4/igmp.c:813 call_timer_fn+0x17e/0x600 kernel/time/timer.c:1700 expire_timers kernel/time/timer.c:1751 [inline] __run_timers+0x621/0x830 kernel/time/timer.c:2038 run_timer_softirq+0x67/0xf0 kernel/time/timer.c:2051 __do_softirq+0x2bc/0x943 kernel/softirq.c:554 invoke_softirq kernel/softirq.c:428 [inline] __irq_exit_rcu+0xf2/0x1c0 kernel/softirq.c:633 irq_exit_rcu+0x9/0x30 kernel/softirq.c:645 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1076 [inline] sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1076 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702 RIP: 0010:resched_offsets_ok kernel/sched/core.c:10127 [inline] RIP: 0010:__might_resched+0x16f/0x780 kernel/sched/core.c:10142 Code: 00 4c 89 e8 48 c1 e8 03 48 ba 00 00 00 00 00 fc ff df 48 89 44 24 38 0f b6 04 10 84 c0 0f 85 87 04 00 00 41 8b 45 00 c1 e0 08 <01> d8 44 39 e0 0f 85 d6 00 00 00 44 89 64 24 1c 48 8d bc 24 a0 00 RSP: 0018:ffffc9000ee069e0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8880296a9e00 RDX: dffffc0000000000 RSI: ffff8880296a9e00 RDI: ffffffff8bfe8fa0 RBP: ffffc9000ee06b00 R08: ffffffff82326877 R09: 1ffff11002b5ad1b R10: dffffc0000000000 R11: ffffed1002b5ad1c R12: 0000000000000000 R13: ffff8880296aa23c R14: 000000000000062a R15: 1ffff92001dc0d44 down_write+0x19/0x50 kernel/locking/rwsem.c:1578 kernfs_activate fs/kernfs/dir.c:1403 [inline] kernfs_add_one+0x4af/0x8b0 fs/kernfs/dir.c:819 __kernfs_create_file+0x22e/0x2e0 fs/kernfs/file.c:1056 sysfs_add_file_mode_ns+0x24a/0x310 fs/sysfs/file.c:307 create_files fs/sysfs/group.c:64 [inline] internal_create_group+0x4f4/0xf20 fs/sysfs/group.c:152 internal_create_groups fs/sysfs/group.c:192 [inline] sysfs_create_groups+0x56/0x120 fs/sysfs/group.c:218 create_dir lib/kobject.c:78 [inline] kobject_add_internal+0x472/0x8d0 lib/kobject.c:240 kobject_add_varg lib/kobject.c:374 [inline] kobject_init_and_add+0x124/0x190 lib/kobject.c:457 netdev_queue_add_kobject net/core/net-sysfs.c:1706 [inline] netdev_queue_update_kobjects+0x1f3/0x480 net/core/net-sysfs.c:1758 register_queue_kobjects net/core/net-sysfs.c:1819 [inline] netdev_register_kobject+0x265/0x310 net/core/net-sysfs.c:2059 register_netdevice+0x1191/0x19c0 net/core/dev.c:10298 bond_newlink+0x3b/0x90 drivers/net/bonding/bond_netlink.c:576 rtnl_newlink_create net/core/rtnetlink.c:3506 [inline] __rtnl_newlink net/core/rtnetlink.c:3726 [inline] rtnl_newlink+0x158f/0x20a0 net/core/rtnetlink.c:3739 rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6606 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543 netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367 netlink_sendmsg+0xa3c/0xd70 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 __sys_sendto+0x3a4/0x4f0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2199 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7fc3fa87fa9c Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240212140700.2795436-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 18:45:06 -08:00
Eric Dumazet	c74e103991	net: bridge: use netdev_lockdep_set_classes() br_set_lockdep_class() is missing many details. Use generic netdev_lockdep_set_classes() to not worry anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240212140700.2795436-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 18:45:06 -08:00
Eric Dumazet	9a3c93af54	vlan: use netdev_lockdep_set_classes() vlan uses vlan_dev_set_lockdep_class() which lacks qdisc_tx_busylock initialization. Use generic netdev_lockdep_set_classes() to not worry anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240212140700.2795436-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 18:45:06 -08:00
Eric Dumazet	3e41af9076	rtnetlink: use xarray iterator to implement rtnl_dump_ifinfo() Adopt net->dev_by_index as I did in commit `0e0939c0ad` ("net-procfs: use xarray iterator to implement /proc/net/dev") This makes sure an existing device is always visible in the dump, regardless of concurrent insertions/deletions. v2: added suggestions from Jakub Kicinski and Ido Schimmel, thanks for the help ! Link: https://lore.kernel.org/all/20240209142441.6c56435b@kernel.org/ Link: https://lore.kernel.org/all/ZckR-XOsULLI9EHc@shredder/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20240211214404.1882191-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 18:34:50 -08:00
Eric Dumazet	f383ced24d	vlan: use xarray iterator to implement /proc/net/vlan/config Adopt net->dev_by_index as I did in commit `0e0939c0ad` ("net-procfs: use xarray iterator to implement /proc/net/dev") Not only this removes quadratic behavior, it also makes sure an existing vlan device is always visible in the dump, regardless of concurrent net->dev_base_head changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20240211214404.1882191-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-13 18:34:49 -08:00
Stephen Hemminger	32c7eec21c	net: sched: codel replace GPLv2/BSD boilerplate The prologue to codel is using BSD-3 clause and GPL-2 boiler plate language. Replace it by using SPDX. The automated treewide scan in commit `d2912cb15b` ("treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500") did not pickup dual licensed code. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Dave Taht <dave.taht@gmail.com> Link: https://lore.kernel.org/r/20240211172532.6568-1-stephen@networkplumber.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-02-13 13:45:19 +01:00
Oliver Hartkopp	c83c22ec14	can: canxl: add virtual CAN network identifier support CAN XL data frames contain an 8-bit virtual CAN network identifier (VCID). A VCID value of zero represents an 'untagged' CAN XL frame. To receive and send these optional VCIDs via CAN_RAW sockets a new socket option CAN_RAW_XL_VCID_OPTS is introduced to define/access VCID content: - tx: set the outgoing VCID value by the kernel (one fixed 8-bit value) - tx: pass through VCID values from the user space (e.g. for traffic replay) - rx: apply VCID receive filter (value/mask) to be passed to the user space With the 'tx pass through' option CAN_RAW_XL_VCID_TX_PASS all valid VCID values can be sent, e.g. to replay full qualified CAN XL traffic. The VCID value provided for the CAN_RAW_XL_VCID_TX_SET option will override the VCID value in the struct canxl_frame.prio defined for CAN_RAW_XL_VCID_TX_PASS when both flags are set. With a rx_vcid_mask of zero all possible VCID values (0x00 - 0xFF) are passed to the user space when the CAN_RAW_XL_VCID_RX_FILTER flag is set. Without this flag only untagged CAN XL frames (VCID = 0x00) are delivered to the user space (default). The 8-bit VCID is stored inside the CAN XL prio element (only in CAN XL frames!) to not interfere with other CAN content or the CAN filters provided by the CAN_RAW sockets and kernel infrastruture. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20240212213550.18516-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2024-02-13 11:47:13 +01:00
Harshit Mogalapalli	86fe596b58	net: sched: Remove NET_ACT_IPT from Kconfig After this commit `ba24ea1291` ("net/sched: Retire ipt action") NET_ACT_IPT is not needed anymore as the action is retired and the code is removed. Clean the Kconfig part as well. Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240209180656.867546-1-harshit.m.mogalapalli@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-02-13 11:24:35 +01:00
Guillaume Nault	a3522a2edb	ipv4: Set the routing scope properly in ip_route_output_ports(). Set scope automatically in ip_route_output_ports() (using the socket SOCK_LOCALROUTE flag). This way, callers don't have to overload the tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does. For callers that don't pass a struct sock, this doesn't change anything as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL. Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with the RTO_ONLINK flag now becomes unnecessary. In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk parameter is a kernel socket, which doesn't have any configuration path for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports() will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c doesn't need to be modified. Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as these macros are now unused. The objective is to eventually remove RTO_ONLINK entirely to allow converting ->flowi4_tos to dscp_t. This will ensure proper isolation between the DSCP and ECN bits, thus minimising the risk of introducing bugs where TOS values interfere with ECN. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-12 17:33:05 -08:00
Oliver Hartkopp	e1aa35e163	can: isotp: support dynamic flow control parameters The ISO15765-2 standard supports to take the PDUs communication parameters blocksize (BS) and Separation Time minimum (STmin) either from the first received flow control (FC) "static" or from every received FC "dynamic". Add a new CAN_ISOTP_DYN_FC_PARMS flag to support dynamic FC parameters. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20231208165729.3011-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2024-02-12 16:55:40 +01:00
Nicolas Maier	fec846fa7e	can: bcm: add recvmsg flags for own, local and remote traffic CAN RAW sockets allow userspace to tell if a received CAN frame comes from the same socket, another socket on the same host, or another host. See commit `1e55659ce6` ("can-raw: add msg_flags to distinguish local traffic"). However, this feature is missing in CAN BCM sockets. Add the same feature to CAN BCM sockets. When reading a received frame (opcode RX_CHANGED) using recvmsg, two flags in msg->msg_flags may be set following the previous convention (from CAN RAW), to distinguish between 'own', 'local' and 'remote' CAN traffic. Update the documentation to reflect this change. Signed-off-by: Nicolas Maier <nicolas.maier.dev@gmail.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20240120081018.2319-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2024-02-12 16:55:17 +01:00
Eric Dumazet	1ebb85f9c0	netfilter: conntrack: expedite rcu in nf_conntrack_cleanup_net_list nf_conntrack_cleanup_net_list() is calling synchronize_net() while RTNL is not held. This effectively calls synchronize_rcu(). synchronize_rcu() is much slower than synchronize_rcu_expedited(), and cleanup_net() is currently single threaded. In many workloads we want cleanup_net() to be faster, in order to free memory and various sysfs and procfs entries as fast as possible. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 12:17:03 +00:00
Eric Dumazet	78c3253f27	net: use synchronize_rcu_expedited in cleanup_net() cleanup_net() is calling synchronize_rcu() right before acquiring RTNL. synchronize_rcu() is much slower than synchronize_rcu_expedited(), and cleanup_net() is currently single threaded. In many workloads we want cleanup_net() to be fast, in order to free memory and various sysfs and procfs entries as fast as possible. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 12:17:03 +00:00
Eric Dumazet	2cd0c51e3b	ipv4/fib: use synchronize_net() when holding RTNL tnode_free() should use synchronize_net() instead of syncronize_rcu() to release RTNL sooner. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 12:17:03 +00:00
Eric Dumazet	48ebf6ebbc	bridge: vlan: use synchronize_net() when holding RTNL br_vlan_flush() and nbp_vlan_flush() should use synchronize_net() instead of syncronize_rcu() to release RTNL sooner. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 12:17:02 +00:00
Eric Dumazet	4cd582ffa5	net: use synchronize_net() in dev_change_name() dev_change_name() holds RTNL, we better use synchronize_net() instead of plain synchronize_rcu(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 12:17:02 +00:00
Eric Dumazet	17ef8efc00	ipv6: mcast: remove one synchronize_net() barrier in ipv6_mc_down() As discussed in the past (commit `2d3916f318` ("ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()")) I think the synchronize_net() call in ipv6_mc_down() is not needed. Under load, synchronize_net() can last between 200 usec and 5 ms. KASAN seems to agree as well. Fixes: `f185de28d9` ("mld: add new workqueues for process mld events") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Taehee Yoo <ap420073@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 12:17:02 +00:00
Kui-Feng Lee	768e06a8bc	net/ipv6: set expires in modify_prefix_route() if RTF_EXPIRES is set. Make the decision to set or clean the expires of a route based on the RTF_EXPIRES flag, rather than the value of the "expires" argument. This patch doesn't make difference logically, but make inet6_addr_modify() and modify_prefix_route() consistent. The function inet6_addr_modify() is the only caller of modify_prefix_route(), and it passes the RTF_EXPIRES flag and an expiration value. The RTF_EXPIRES flag is turned on or off based on the value of valid_lft. The RTF_EXPIRES flag is turned on if valid_lft is a finite value (not infinite, not 0xffffffff). Even if valid_lft is 0, the RTF_EXPIRES flag remains on. The expiration value being passed is equal to the valid_lft value if the flag is on. However, if the valid_lft value is infinite, the expiration value becomes 0 and the RTF_EXPIRES flag is turned off. Despite this, modify_prefix_route() decides to set the expiration value if the received expiration value is not zero. This mixing of infinite and zero cases creates an inconsistency. Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 10:24:12 +00:00
Kui-Feng Lee	5eb902b8e7	net/ipv6: Remove expired routes with a separated list of routes. FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 10:24:12 +00:00
Kui-Feng Lee	60df43d3a7	net/ipv6: Remove unnecessary clean. The route here is newly created. It is unnecessary to call fib6_clean_expires() on it. Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 10:24:12 +00:00
Kui-Feng Lee	129e406e18	net/ipv6: set expires in rt6_add_dflt_router(). Pass the duration of a lifetime (in seconds) to the function rt6_add_dflt_router() so that it can properly set the expiration time. The function ndisc_router_discovery() is the only one that calls rt6_add_dflt_router(), and it will later set the expiration time for the route created by rt6_add_dflt_router(). However, there is a gap of time between calling rt6_add_dflt_router() and setting the expiration time in ndisc_router_discovery(). During this period, there is a possibility that a new route may be removed from the routing table. By setting the correct expiration time in rt6_add_dflt_router(), we can prevent this from happening. The reason for setting RTF_EXPIRES in rt6_add_dflt_router() is to start the Garbage Collection (GC) timer, as it only activates when a route with RTF_EXPIRES is added to a table. Suggested-by: David Ahern <dsahern@kernel.org> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-02-12 10:24:12 +00:00
Jakub Kicinski	f6ce9a1f6a	Merge branch 'for-io_uring-add-napi-busy-polling-support' Merge netdev bits of io_uring busy polling support. Jens Axboe says: ==================== io_uring: add napi busy polling support I finally got around to testing this patchset in its current form, and results look fine to me. It Works. Using the basic ping/pong test that's part of the liburing addition, without enabling NAPI I get: Stock settings, no NAPI, 100k packets: rtt(us) min/avg/max/mdev = 31.730/37.006/87.960/0.497 and with -t10 -b enabled: rtt(us) min/avg/max/mdev = 23.250/29.795/63.511/1.203 In short, this patchset enables per io_uring NAPI enablement, rather than need to enable that globally. This allows targeted NAPI usage with io_uring. Here's Stefan's v15 posting, which predates this one: https://lore.kernel.org/io-uring/20230608163839.2891748-1-shr@devkernel.io/ ==================== Link: https://lore.kernel.org/r/20240206163422.646218-1-axboe@kernel.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-09 10:35:34 -08:00
Stefan Roesch	b4e8ae5c8c	net: add napi_busy_loop_rcu() This adds the napi_busy_loop_rcu() function. This function assumes that the calling function is already holding the rcu read lock and napi_busy_loop() does not need to take the rcu read lock. Add a NAPI_F_NO_SCHED flag, which tells __napi_busy_loop() to abort if we need to reschedule rather than drop the RCU read lock and reschedule. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-3-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-09 10:01:09 -08:00
Stefan Roesch	13d381b440	net: split off __napi_busy_poll from napi_busy_poll This splits off the key part of the napi_busy_poll function into its own function, __napi_busy_poll, and changes the prefer_busy_poll bool to be flag based to allow passing in more flags in the future. This is done in preparation for an additional napi_busy_poll() function, that doesn't take the rcu_read_lock(). The new function is introduced in the next patch. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-2-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-09 10:01:09 -08:00
Eric Dumazet	e7689879d1	ethtool: do not use rtnl in ethnl_default_dumpit() for_each_netdev_dump() can be used with RCU protection, no need for rtnl if we are going to use dev_hold()/dev_put(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240207153514.3640952-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 19:08:51 -08:00
Eric Dumazet	0e0939c0ad	net-procfs: use xarray iterator to implement /proc/net/dev In commit `759ab1edb5` ("net: store netdevs in an xarray") Jakub added net->dev_by_index to map ifindex to netdevices. We can get rid of the old hash table (net->dev_index_head), one patch at a time, if performance is acceptable. This patch removes unpleasant code to something more readable. As a bonus, /proc/net/dev gets netdevices sorted by their ifindex. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240207165318.3814525-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 19:06:09 -08:00
Vladimir Oltean	36f75f74dc	net: dsa: tag_sja1105: remove "inline" keyword The convention is to not use the "inline" keyword for functions in C files, but to let the compiler choose. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240206112927.4134375-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 19:03:57 -08:00
Vladimir Oltean	83acbb9d07	net: dsa: remove "inline" from dsa_user_netpoll_send_skb() The convention is to not use "inline" functions in C files, and let the compiler decide whether to inline or not. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240206112927.4134375-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 19:03:57 -08:00
Jakub Kicinski	3be042cf46	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/common.h `38cc3c6dcc` ("net: stmmac: protect updates of 64-bit statistics counters") `fd5a6a7131` ("net: stmmac: est: Per Tx-queue error count for HLBF") `c5c3e1bfc9` ("net: stmmac: Offload queueMaxSDU from tc-taprio") drivers/net/wireless/microchip/wilc1000/netdev.c `c901388028` ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000") `328efda22a` ("wifi: wilc1000: do not realloc workqueue everytime an interface is added") net/unix/garbage.c `11498715f2` ("af_unix: Remove io_uring code for GC.") `1279f9d9de` ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-08 15:30:33 -08:00
Paolo Abeni	63e4b9d693	netfilter pull request 24-02-08 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmXEuicACgkQ1V2XiooU IOSbvA/9F2BC9TYKAh23/0EFbD4jOl4e26YE4E+Eu8AteoQ/nD+oI+mtWgw2hVXg zXvm1vfIc02jGuGfcPZ+EIv/dkznnDqqUpUGa4ixtgvRw2bKkb2kKMlrFsjzsihj yabXydwhxYE9b4Ch2AmRyApTLRMocte1IJ3ci4YUXwf68wZlOe2bIG5wyzGkFpjF QZN/Rr14UKjC57EYNdUG9UdybWSqSKD23LPZSaLvi6wxoZd8cIcIkng5K4N0WVKF lNskuNFY+j+bJz2Yn3mWIlCoM3R1N2B04t7wRkYnKWkSuwymG3O7JC3RUQaZDBZw 8AogEbvXaIY3nxyN4lHZ/jzM/QzNB1WHlPx6RjWKHoNhnas+xuBYrjCdJZwtEu8g xs27Tjk3QtCIuaMuhN0RFqiq93MqZD/qx++kwMwJA0Wrg76MLPpf8yEWwVGYcAEG 0EWa61UfPezbcVkW8XveW6lgDfcOIOpBevxDQ3Nf7JB0AcbVBks7oDpGwDc5Pdz5 6y7WQIilxUtu9bHODUxrshxgTBwsocVkXUTIogCihUC+SgSZF+/G796c9Iy5/kPq BtmSNJOJyCbnivkqKTLF0Pv0BplOv7W1sx2/fo+IfRXYTHoXVjHe1BYP0Ck3WEtS 9EPsFlI5f4AOtnPF3JrTPec9PvuHyVN+8aOPi82wlKiayJcXy1I= =Rh2n -----END PGP SIGNATURE----- Merge tag 'nf-24-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Narrow down target/match revision to u8 in nft_compat. 2) Bail out with unused flags in nft_compat. 3) Restrict layer 4 protocol to u16 in nft_compat. 4) Remove static in pipapo get command that slipped through when reducing set memory footprint. 5) Follow up incremental fix for the ipset performance regression, this includes the missing gc cancellation, from Jozsef Kadlecsik. 6) Allow to filter by zone 0 in ctnetlink, do not interpret zone 0 as no filtering, from Felix Huettner. 7) Reject direction for NFT_CT_ID. 8) Use timestamp to check for set element expiration while transaction is handled to prevent garbage collection from removing set elements that were just added by this transaction. Packet path and netlink dump/get path still use current time to check for expiration. 9) Restore NF_REPEAT in nfnetlink_queue, from Florian Westphal. 10) map_index needs to be percpu and per-set, not just percpu. At this time its possible for a pipapo set to fill the all-zero part with ones and take the 'might have bits set' as 'start-from-zero' area. From Florian Westphal. This includes three patches: - Change scratchpad area to a structure that provides space for a per-set-and-cpu toggle and uses it of the percpu one. - Add a new free helper to prepare for the next patch. - Remove the scratch_aligned pointer and makes AVX2 implementation use the exact same memory addresses for read/store of the matching state. netfilter pull request 24-02-08 * tag 'nf-24-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_set_pipapo: remove scratch_aligned pointer netfilter: nft_set_pipapo: add helper to release pcpu scratch area netfilter: nft_set_pipapo: store index in scratch maps netfilter: nft_set_rbtree: skip end interval element from gc netfilter: nfnetlink_queue: un-break NF_REPEAT netfilter: nf_tables: use timestamp to check for set element timeout netfilter: nft_ct: reject direction for ct id netfilter: ctnetlink: fix filtering for zone 0 netfilter: ipset: Missing gc cancellations fixed netfilter: nft_set_pipapo: remove static in nft_pipapo_get() netfilter: nft_compat: restrict match/target protocol to u16 netfilter: nft_compat: reject unused compat flag netfilter: nft_compat: narrow down revision to unsigned 8-bits ==================== Link: https://lore.kernel.org/r/20240208112834.1433-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-02-08 12:56:40 +01:00
Florian Westphal	5a8cdf6fd8	netfilter: nft_set_pipapo: remove scratch_aligned pointer use ->scratch for both avx2 and the generic implementation. After previous change the scratch->map member is always aligned properly for AVX2, so we can just use scratch->map in AVX2 too. The alignoff delta is stored in the scratchpad so we can reconstruct the correct address to free the area again. Fixes: `7400b06396` ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2024-02-08 12:24:02 +01:00
Florian Westphal	47b1c03c3c	netfilter: nft_set_pipapo: add helper to release pcpu scratch area After next patch simple kfree() is not enough anymore, so add a helper for it. Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2024-02-08 12:10:19 +01:00
Florian Westphal	76313d1a4a	netfilter: nft_set_pipapo: store index in scratch maps Pipapo needs a scratchpad area to keep state during matching. This state can be large and thus cannot reside on stack. Each set preallocates percpu areas for this. On each match stage, one scratchpad half starts with all-zero and the other is inited to all-ones. At the end of each stage, the half that starts with all-ones is always zero. Before next field is tested, pointers to the two halves are swapped, i.e. resmap pointer turns into fill pointer and vice versa. After the last field has been processed, pipapo stashes the index toggle in a percpu variable, with assumption that next packet will start with the all-zero half and sets all bits in the other to 1. This isn't reliable. There can be multiple sets and we can't be sure that the upper and lower half of all set scratch map is always in sync (lookups can be conditional), so one set might have swapped, but other might not have been queried. Thus we need to keep the index per-set-and-cpu, just like the scratchpad. Note that this bug fix is incomplete, there is a related issue. avx2 and normal implementation might use slightly different areas of the map array space due to the avx2 alignment requirements, so m->scratch (generic/fallback implementation) and ->scratch_aligned (avx) may partially overlap. scratch and scratch_aligned are not distinct objects, the latter is just the aligned address of the former. After this change, write to scratch_align->map_index may write to scratch->map, so this issue becomes more prominent, we can set to 1 a bit in the supposedly-all-zero area of scratch->map[]. A followup patch will remove the scratch_aligned and makes generic and avx code use the same (aligned) area. Its done in a separate change to ease review. Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2024-02-08 12:10:19 +01:00

1 2 3 4 5 ...

76027 Commits