linux

iv/linux

Author	SHA1	Message	Date
FUJITA Tomonori	3082412242	net: tn40xx: add phylink support This patch adds supports for multiple PHY hardware with phylink. The adapters with TN40xx chips use multiple PHY hardware; AMCC QT2025, TI TLK10232, Aqrate AQR105, and Marvell 88X3120, 88X3310, and MV88E2010. For now, the PCI ID table of this driver enables adapters using only QT2025 PHY. I've tested this driver and the QT2025 PHY driver (SFP+ 10G SR) with Edimax EN-9320 10G adapter. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Reviewed-by: Hans-Frieder Vogt <hfdevel@gmx.net> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20240623235507.108147-8-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
FUJITA Tomonori	7fdbd2f2bb	net: tn40xx: add mdio bus support This patch adds supports for mdio bus. A later path adds PHYLIB support on the top of this. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Link: https://patch.msgid.link/20240623235507.108147-7-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
FUJITA Tomonori	37c4947af4	net: tn40xx: add basic Rx handling This patch adds basic Rx handling. The Rx logic uses three major data structures; two ring buffers with NIC and one database. One ring buffer is used to send information to NIC about memory to be stored packets to be received. The other is used to get information from NIC about received packets. The database is used to keep the information about DMA mapping. After a packet arrived, the db is used to pass the packet to the network stack. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Reviewed-by: Hans-Frieder Vogt <hfdevel@gmx.net> Link: https://patch.msgid.link/20240623235507.108147-6-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
FUJITA Tomonori	dd2a0ff554	net: tn40xx: add basic Tx handling This patch adds device specific structures to initialize the hardware with basic Tx handling. The original driver loads the embedded firmware in the header file. This driver is implemented to use the firmware APIs. The Tx logic uses three major data structures; two ring buffers with NIC and one database. One ring buffer is used to send information about packets to be sent for NIC. The other is used to get information from NIC about packet that are sent. The database is used to keep the information about DMA mapping. After a packet is sent, the db is used to free the resource used for the packet. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Link: https://patch.msgid.link/20240623235507.108147-5-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
FUJITA Tomonori	ffa28c748b	net: tn40xx: add register defines This adds several defines to handle registers in Tehuti Networks TN40xx chips for later patches. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Reviewed-by: Hans-Frieder Vogt <hfdevel@gmx.net> Link: https://patch.msgid.link/20240623235507.108147-4-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
FUJITA Tomonori	ab61adc600	net: tn40xx: add pci driver for Tehuti Networks TN40xx chips This just adds the scaffolding for an ethernet driver for Tehuti Networks TN40xx chips. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240623235507.108147-3-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
FUJITA Tomonori	eee5528890	PCI: Add Edimax Vendor ID to pci_ids.h Add the Edimax Vendor ID (0x1432) for an ethernet driver for Tehuti Networks TN40xx chips. This ID can be used for Realtek 8180 and Ralink rt28xx wireless drivers. Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20240623235507.108147-2-fujita.tomonori@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 18:44:19 -07:00
Chris Packham	c0c68e4d52	dt-bindings: net: dsa: mediatek,mt7530: Minor wording fixes Update the mt7530 binding with some minor updates that make the document easier to read. Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20240624211858.1990601-1-chris.packham@alliedtelesis.co.nz Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:54:53 -07:00
Jakub Kicinski	a425a973e9	Merge branch 'gve-add-flow-steering-support' Ziwei Xiao says: ==================== gve: Add flow steering support To support flow steering in GVE driver, there are two adminq changes need to be made in advance. The first one is adding adminq mutex lock, which is to allow the incoming flow steering operations to be able to temporarily drop the rtnl_lock to reduce the latency for registering flow rules among several NICs at the same time. This could be achieved by the future changes to reduce the drivers' dependencies on the rtnl lock for particular ethtool ops. The second one is to add the extended adminq command so that we can support larger adminq command such as configure_flow_rule command. In that patch, there is a new added function called gve_adminq_execute_extended_cmd with the attribute of __maybe_unused. That attribute will be removed in the third patch of this series where it will use the previously unused function. And the other three patches are needed for the actual flow steering feature support in driver. ==================== Link: https://patch.msgid.link/20240625001232.1476315-1-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:48:35 -07:00
Jeroen de Borst	6f3bc48756	gve: Add flow steering ethtool support Implement the ethtool commands that can be used to configure and query flow-steering rules. A large part of this change consists of translating the ethtool representation of 'ntuples' to our internal gve_flow_rule and vice-versa in the new created gve_flow_rule.c Considering the possible large amount of flow rules, the driver doesn't store all the rules locally. When the user runs 'ethtool -n <nic>' to check the registered rules, the driver will send adminq command to query a limited amount of rules/rule ids(that filled in a 4096 bytes dma memory) at a time as a cache for the ethtool queries. The adminq query commands will be repeated for several times until the ethtool has queried all the needed rules. Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-6-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:48:33 -07:00
Jeroen de Borst	57718b60df	gve: Add flow steering adminq commands Add new adminq commands for the driver to configure and query flow rules that are stored in the device. Flow steering rules are assigned with a location that determines the relative order of the rules. Flow rules can run up to an order of millions. In such cases, storing a full copy of the rules in the driver to prepare for the ethtool query is infeasible while querying them from the device is better. That needs to be optimized too so that we don't send a lot of adminq commands. The solution here is to store a limited number of rules/rule ids in the driver in a cache. Use dma_pool to allocate 4k bytes which lets device write at most 46 flow rules(4096/88) or 1024 rule ids(4096/4) at a time. For configuring flow rules, there are 3 sub-commands: - ADD which adds a rule at the location supplied - DEL which deletes the rule at the location supplied - RESET which clears all currently active rules in the device For querying flow rules, there are also 3 sub-commands: - QUERY_RULES corresponds to ETHTOOL_GRXCLSRULE. It fills the rules in the allocated cache after querying the device - QUERY_RULES_IDS corresponds to ETHTOOL_GRXCLSRLALL. It fills the rule_ids in the allocated cache after querying the device - QUERY_RULES_STATS corresponds to ETHTOOL_GRXCLSRLCNT. It queries the device's current flow rule number and the supported max flow rule limit Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-5-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:48:33 -07:00
Jeroen de Borst	3519c00557	gve: Add flow steering device option Add a new device option to signal to the driver that the device supports flow steering. This device option also carries the maximum number of flow steering rules that the device can store. Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-4-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:48:33 -07:00
Jeroen de Borst	fcfe6318db	gve: Add adminq extended command The adminq command is limited to 64 bytes per entry and it's 56 bytes for the command itself at maximum. To support larger commands, we need to dma_alloc a separate memory to put the command in that memory and send the dma memory address instead of the actual command. Introduce an extended adminq command to wrap the real command with the inner opcode and the allocated dma memory address specified. Once the device receives it, it can get the real command from the given dma memory address. As designed with the device, all the extended commands will use inner opcode larger than 0xFF. Signed-off-by: Jeroen de Borst <jeroendb@google.com> Co-developed-by: Ziwei Xiao <ziweixiao@google.com> Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-3-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:48:33 -07:00
Ziwei Xiao	1108566ca5	gve: Add adminq mutex lock We were depending on the rtnl_lock to make sure there is only one adminq command running at a time. But some commands may take too long to hold the rtnl_lock, such as the upcoming flow steering operations. For such situations, it can temporarily drop the rtnl_lock, and replace it for these operations with a new adminq lock, which can ensure the adminq command execution to be thread-safe. Signed-off-by: Ziwei Xiao <ziweixiao@google.com> Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240625001232.1476315-2-ziweixiao@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:48:33 -07:00
Jakub Kicinski	63173885cc	Merge branch 'ethtool-provide-the-dim-profile-fine-tuning-channel' Heng Qi says: ==================== ethtool: provide the dim profile fine-tuning channel The NetDIM library provides excellent acceleration for many modern network cards. However, the default profiles of DIM limits its maximum capabilities for different NICs, so providing a way which the NIC can be custom configured is necessary. Currently, the way is based on the commonly used "ethtool -C". For example, on the server side, the virtio-net NIC with rx dim enabled has 8 queues and runs nginx. The client uses the following command to send traffic to the server: ./wrk http://server_ip:80 -c 64 -t 5 -d 30 Then adjust the default rx-profile for server dim to {.usec = 1, .pkts = 256, .comps = n/a,}, {.usec = 8, .pkts = 256, .comps = n/a,}, {.usec = 30, .pkts = 256, .comps = n/a,}, {.usec = 64, .pkts = 256, .comps = n/a,}, {.usec = 128, .pkts = 256, .comps = n/a,} The server PPS is improved by 20%+. ==================== Link: https://patch.msgid.link/20240621101353.107425-1-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:15:10 -07:00
Heng Qi	dcb67f6a9e	virtio-net: support dim profile fine-tuning Virtio-net has different types of back-end device implementations. In order to effectively optimize the dim library's gains for different device implementations, let's use the new interface params to initialize and query dim results from a customized profile list. Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240621101353.107425-6-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:15:06 -07:00
Heng Qi	13ba28c5cd	dim: add new interfaces for initialization and getting results DIM-related mode and work have been collected in one same place, so new interfaces are added to provide convenience. Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240621101353.107425-5-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:15:06 -07:00
Heng Qi	f750dfe825	ethtool: provide customized dim profile management The NetDIM library, currently leveraged by an array of NICs, delivers excellent acceleration benefits. Nevertheless, NICs vary significantly in their dim profile list prerequisites. Specifically, virtio-net backends may present diverse sw or hw device implementation, making a one-size-fits-all parameter list impractical. On Alibaba Cloud, the virtio DPU's performance under the default DIM profile falls short of expectations, partly due to a mismatch in parameter configuration. I also noticed that ice/idpf/ena and other NICs have customized profilelist or placed some restrictions on dim capabilities. Motivated by this, I tried adding new params for "ethtool -C" that provides a per-device control to modify and access a device's interrupt parameters. Usage ======== The target NIC is named ethx. Assume that ethx only declares support for rx profile setting (with DIM_PROFILE_RX flag set in profile_flags) and supports modification of usec and pkt fields. 1. Query the currently customized list of the device $ ethtool -c ethx ... rx-profile: {.usec = 1, .pkts = 256, .comps = n/a,}, {.usec = 8, .pkts = 256, .comps = n/a,}, {.usec = 64, .pkts = 256, .comps = n/a,}, {.usec = 128, .pkts = 256, .comps = n/a,}, {.usec = 256, .pkts = 256, .comps = n/a,} tx-profile: n/a 2. Tune $ ethtool -C ethx rx-profile 1,1,n_2,n,n_3,3,n_4,4,n_n,5,n "n" means do not modify this field. $ ethtool -c ethx ... rx-profile: {.usec = 1, .pkts = 1, .comps = n/a,}, {.usec = 2, .pkts = 256, .comps = n/a,}, {.usec = 3, .pkts = 3, .comps = n/a,}, {.usec = 4, .pkts = 4, .comps = n/a,}, {.usec = 256, .pkts = 5, .comps = n/a,} tx-profile: n/a 3. Hint If the device does not support some type of customized dim profiles, the corresponding "n/a" will display. If the "n/a" field is being modified, -EOPNOTSUPP will be reported. Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240621101353.107425-4-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:15:06 -07:00
Heng Qi	b65e697a7c	dim: make DIMLIB dependent on NET DIMLIB's capabilities are supplied by the dim, net_dim, and rdma_dim objects, and dim's interfaces solely act as a base for net_dim and rdma_dim and are not explicitly used anywhere else. rdma_dim is utilized by the infiniband driver, while net_dim is for network devices, excluding the soc/fsl driver. In this patch, net_dim relies on some NET's interfaces, thus DIMLIB needs to explicitly depend on the NET Kconfig. The soc/fsl driver uses the functions provided by net_dim, so it also needs to depend on NET. Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240621101353.107425-3-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:15:06 -07:00
Heng Qi	0e942053e4	linux/dim: move useful macros to .h file Useful macros will be used effectively elsewhere. These will be utilized in subsequent patches. Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240621101353.107425-2-hengqi@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:15:06 -07:00
Jakub Kicinski	c84f93243e	Merge branch 'ravb-add-mii-support-for-r-car-v4m' Geert Uytterhoeven says: ==================== ravb: Add MII support for R-Car V4M All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII interface. In addition, the first two EtherAVB instances on R-Car V4M also support the MII interface, but this is not yet supported by the driver. This patch series adds support for MII on R-Car Gen4, after the customary cleanup. The corresponding pin control support is available in [1]. Compile-tested only, as all AVB interfaces on the Gray Hawk Single development board are connected to RGMII PHYs. No regressions on R-Car V4H. [1] "[PATCH/RFC] pinctrl: renesas: r8a779h0: Add AVB MII pins and groups" https://lore.kernel.org/4a0a12227f2145ef53b18bc08f45b19dcd745fc6.1718378739.git.geert+renesas@glider.be/ v1: https://lore.kernel.org/f0ef3e00aec461beb33869ab69ccb44a23d78f51.1718378166.git.geert+renesas@glider.be ==================== Link: https://patch.msgid.link/cover.1719234830.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:07:06 -07:00
Geert Uytterhoeven	6e0713cc82	ravb: Add MII support for R-Car V4M All EtherAVB instances on R-Car Gen3/Gen4 SoCs support the RGMII interface. In addition, the first two EtherAVB instances on R-Car V4M also support the MII interface, but this is not yet supported by the driver. Add support for MII on R-Car Gen4 by adding an R-Car Gen4-specific EMAC initialization function that selects the MII clock instead of the RGMII clock when the PHY interface is MII. Note that all implementations of EtherAVB on R-Car Gen4 SoCs have the APSR register, but only MII-capable instances are documented to have the MIISELECT bit, which has a documented value of zero when reserved. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://patch.msgid.link/3a21d1d6680864aa85afff9260234c2b8054020a.1719234830.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:07:04 -07:00
Geert Uytterhoeven	8d653d26ff	ravb: Improve ravb_hw_info instance order Move ravb_gen2_hw_info before ravb_gen3_hw_info to match ravb_match_table[] order. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://patch.msgid.link/a76febe3737e26365a784e9193da9363f22aa550.1719234830.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 17:07:04 -07:00
Li RongQing	d891317fe4	virtio_net: Remove u64_stats_update_begin()/end() for stats fetch This place is fetching the stats, u64_stats_update_begin()/end() should not be used, and the fetcher of stats is in the same context as the updater of the stats, so don't need any protection Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/20240621094552.53469-1-lirongqing@baidu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 16:42:26 -07:00
Yujie Liu	c4532232fa	selftests: net: remove unneeded IP_GRE config It seems that there is no definition for config IP_GRE, and it is not a dependency of other configs, so remove it. linux$ find -name Kconfig \| xargs grep "IP_GRE" <-- nothing There is a IPV6_GRE config defined in net/ipv6/Kconfig. It only depends on NET_IPGRE_DEMUX but not IP_GRE. Signed-off-by: Yujie Liu <yujie.liu@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240624055539.2092322-1-yujie.liu@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 08:37:55 -07:00
James Chapman	a8a8d89dbd	l2tp: remove incorrect __rcu attribute This fixes a sparse warning. Fixes: d18d3f0a24fc ("l2tp: replace hlist with simple list for per-tunnel session list") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406220754.evK8Hrjw-lkp@intel.com/ Signed-off-by: James Chapman <jchapman@katalix.com> Link: https://patch.msgid.link/20240624082945.1925009-1-jchapman@katalix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-25 08:29:42 -07:00
Elad Yifee	73cfd947db	net: ethernet: mtk_eth_soc: ppe: prevent ppe update for non-mtk devices Introduce an additional validation to ensure that the PPE index is modified exclusively for mtk_eth ingress devices. This primarily addresses the issue related to WED operation with multiple PPEs. Fixes: dee4dd10c79a ("net: ethernet: mtk_eth_soc: ppe: add support for multiple PPEs") Signed-off-by: Elad Yifee <eladwf@gmail.com> Link: https://lore.kernel.org/r/20240623175113.24437-1-eladwf@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 15:35:53 +02:00
Paolo Abeni	1d70687592	Merge branch 'net-macb-wol-enhancements' Vineeth Karumanchi says: ==================== net: macb: WOL enhancements - Add provisioning for queue tie-off and queue disable during suspend. - Add support for ARP packet types to WoL. - Advertise WoL attributes by default. - Extend MACB supported WoL modes to the PHY supported WoL modes. - Deprecate magic-packet property. v6: https://lore.kernel.org/netdev/20240617070413.2291511-1-vineeth.karumanchi@amd.com/ v5: https://lore.kernel.org/netdev/20240611162827.887162-1-vineeth.karumanchi@amd.com/ v4: https://lore.kernel.org/lkml/20240610053936.622237-1-vineeth.karumanchi@amd.com/ v3: https://lore.kernel.org/netdev/20240605102457.4050539-1-vineeth.karumanchi@amd.com/ v2: https://lore.kernel.org/netdev/20240222153848.2374782-1-vineeth.karumanchi@amd.com/ v1: https://lore.kernel.org/lkml/20240130104845.3995341-1-vineeth.karumanchi@amd.com/#t ==================== Link: https://lore.kernel.org/r/20240621045735.3031357-1-vineeth.karumanchi@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:09 +02:00
Vineeth Karumanchi	783bfe279e	dt-bindings: net: cdns,macb: Deprecate magic-packet property WOL modes such as magic-packet should be an OS policy. By default, advertise supported modes and use ethtool to activate the required mode. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Vineeth Karumanchi	0cb8de39a7	net: macb: Add ARP support to WOL Extend wake-on LAN support with an ARP packet. Currently, if PHY supports WOL, ethtool ignores the modes supported by MACB. This change extends the WOL modes with MACB supported modes. Advertise wake-on LAN supported modes by default without relying on dt node. By default, wake-on LAN will be in disabled state. Using ethtool, users can enable/disable or choose packet types. For wake-on LAN via ARP, ensure the IP address is assigned and report an error otherwise. Co-developed-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Tested-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> # on SAMA7G5 Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Vineeth Karumanchi	3650a8cc5b	net: macb: Enable queue disable Enable queue disable for Versal devices. Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Vineeth Karumanchi	759cc793eb	net: macb: queue tie-off or disable during WOL suspend When GEM is used as a wake device, it is not mandatory for the RX DMA to be active. The RX engine in IP only needs to receive and identify a wake packet through an interrupt. The wake packet is of no further significance; hence, it is not required to be copied into memory. By disabling RX DMA during suspend, we can avoid unnecessary DMA processing of any incoming traffic. During suspend, perform either of the below operations: - tie-off/dummy descriptor: Disable unused queues by connecting them to a looped descriptor chain without free slots. - queue disable: The newer IP version allows disabling individual queues. Co-developed-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Harini Katakam <harini.katakam@amd.com> Signed-off-by: Vineeth Karumanchi <vineeth.karumanchi@amd.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Tested-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> # on SAMA7G5 Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:53:07 +02:00
Paolo Abeni	7e7c714a36	Merge branch 'af_unix-remove-spin_lock_nested-and-convert-to-lock_cmp_fn' Kuniyuki Iwashima says: ==================== af_unix: Remove spin_lock_nested() and convert to lock_cmp_fn. This series removes spin_lock_nested() in AF_UNIX and instead defines the locking orders as functions tied to each lock by lockdep_set_lock_cmp_fn(). When the defined function returns a negative value, lockdep considers it will not cause deadlock. (See ->cmp_fn() in check_deadlock() and check_prev_add().) When we cannot define the total ordering, we return -1 for the allowed ordering and otherwise 0 as undefined. [0] [0]: https://lore.kernel.org/netdev/thzkgbuwuo3knevpipu4rzsh5qgmwhklihypdgziiruabvh46f@uwdkpcfxgloo/ Changes: v4: * Patch 4 * Make unix_state_lock_cmp_fn() symmetric. v3: https://lore.kernel.org/netdev/20240614200715.93150-1-kuniyu@amazon.com/ * Patch 3 * Cache sk->sk_state * s/unix_state_lock()/unix_state_unlock()/ * Patch 8 * Add embryo -> listener locking order v2: https://lore.kernel.org/netdev/20240611222905.34695-1-kuniyu@amazon.com/ * Patch 1 & 2 * Use (((l) > (r)) - ((l) < (r))) for comparison v1: https://lore.kernel.org/netdev/20240610223501.73191-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240620205623.60139-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:21 +02:00
Kuniyuki Iwashima	22e5751b05	af_unix: Don't use spin_lock_nested() in copy_peercred(). When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket, the listener's sk_peer_pid/sk_peer_cred are copied to the client in copy_peercred(). Then, two sk_peer_locks are held there; one is client's and another is listener's. However, the latter is not needed because we hold the listner's unix_state_lock() there and unix_listen() cannot update the cred concurrently. Let's drop the unnecessary spin_lock() and use the bare spin_lock() for the client to protect concurrent read by getsockopt(SO_PEERCRED). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:19 +02:00
Kuniyuki Iwashima	e4bd881d98	af_unix: Remove put_pid()/put_cred() in copy_peercred(). When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket, the listener's sk_peer_pid/sk_peer_cred are copied to the client in copy_peercred(). Then, the client's sk_peer_pid and sk_peer_cred are always NULL, so we need not call put_pid() and put_cred() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	faf489e689	af_unix: Set sk_peer_pid/sk_peer_cred locklessly for new socket. init_peercred() is called in 3 places: 1. socketpair() : both sockets 2. connect() : child socket 3. listen() : listening socket The first two need not hold sk_peer_lock because no one can touch the socket. Let's set cred/pid without holding lock for the two cases and rename the old init_peercred() to update_peercred() to properly reflect the use case. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	8647ece481	af_unix: Define locking order for U_RECVQ_LOCK_EMBRYO in unix_collect_skb(). While GC is cleaning up cyclic references by SCM_RIGHTS, unix_collect_skb() collects skb in the socket's recvq. If the socket is TCP_LISTEN, we need to collect skb in the embryo's queue. Then, both the listener's recvq lock and the embroy's one are held. The locking is always done in the listener -> embryo order. Let's define it as unix_recvq_lock_cmp_fn() instead of using spin_lock_nested(). Note that the reverse order is defined for consistency. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	7202cb5916	af_unix: Remove U_LOCK_GC_LISTENER. Commit 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") added U_LOCK_GC_LISTENER for the old GC, but it's no longer needed for the new GC. Let's remove U_LOCK_GC_LISTENER and unix_state_lock_nested() as there's no user. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	c4da4661d9	af_unix: Remove U_LOCK_DIAG. sk_diag_dump_icons() acquires embryo's lock by unix_state_lock_nested() to fetch its peer. The embryo's ->peer is set to NULL only when its parent listener is close()d. Then, unix_release_sock() is called for each embryo after unlinking skb by skb_dequeue(). In sk_diag_dump_icons(), we hold the parent's recvq lock, so we need not acquire unix_state_lock_nested(), and peer is always non-NULL. Let's remove unnecessary unix_state_lock_nested() and non-NULL test for peer. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	b380b18102	af_unix: Don't acquire unix_state_lock() for sock_i_ino(). sk_diag_dump_peer() and sk_diag_dump() call unix_state_lock() for sock_i_ino() which reads SOCK_INODE(sk->sk_socket)->i_ino, but it's protected by sk->sk_callback_lock. Let's remove unnecessary unix_state_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	98f706de44	af_unix: Define locking order for U_LOCK_SECOND in unix_stream_connect(). While a SOCK_(STREAM\|SEQPACKET) socket connect()s to another, we hold two locks of them by unix_state_lock() and unix_state_lock_nested() in unix_stream_connect(). Before unix_state_lock_nested(), the following is guaranteed by checking sk->sk_state: 1. The first socket is TCP_LISTEN 2. The second socket is not the first one 3. Simultaneous connect() must fail So, the client state can be TCP_CLOSE or TCP_LISTEN or TCP_ESTABLISHED. Let's define the expected states as unix_state_lock_cmp_fn() instead of using unix_state_lock_nested(). Note that 2. is detected by debug_spin_lock_before() and 3. cannot be expressed as lock_cmp_fn. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	1ca27e0c8c	af_unix: Don't retry after unix_state_lock_nested() in unix_stream_connect(). When a SOCK_(STREAM\|SEQPACKET) socket connect()s to another one, we need to lock the two sockets to check their states in unix_stream_connect(). We use unix_state_lock() for the server and unix_state_lock_nested() for client with tricky sk->sk_state check to avoid deadlock. The possible deadlock scenario are the following: 1) Self connect() 2) Simultaneous connect() The former is simple, attempt to grab the same lock, and the latter is AB-BA deadlock. After the server's unix_state_lock(), we check the server socket's state, and if it's not TCP_LISTEN, connect() fails with -EINVAL. Then, we avoid the former deadlock by checking the client's state before unix_state_lock_nested(). If its state is not TCP_LISTEN, we can make sure that the client and the server are not identical based on the state. Also, the latter deadlock can be avoided in the same way. Due to the server sk->sk_state requirement, AB-BA deadlock could happen only with TCP_LISTEN sockets. So, if the client's state is TCP_LISTEN, we can give up the second lock to avoid the deadlock. CPU 1 CPU 2 CPU 3 connect(A -> B) connect(B -> A) listen(A) --- --- --- unix_state_lock(B) B->sk_state == TCP_LISTEN READ_ONCE(A->sk_state) == TCP_CLOSE ^^^^^^^^^ ok, will lock A unix_state_lock(A) .--------------' WRITE_ONCE(A->sk_state, TCP_LISTEN) \| unix_state_unlock(A) \| \| unix_state_lock(A) \| A->sk_sk_state == TCP_LISTEN \| READ_ONCE(B->sk_state) == TCP_LISTEN v ^^^^^^^^^^ unix_state_lock_nested(A) Don't lock B !! Currently, while checking the client's state, we also check if it's TCP_ESTABLISHED, but this is unlikely and can be checked after we know the state is not TCP_CLOSE. Moreover, if it happens after the second lock, we now jump to the restart label, but it's unlikely that the server is not found during the retry, so the jump is mostly to revist the client state check. Let's remove the retry logic and check the state against TCP_CLOSE first. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	ed99822817	af_unix: Define locking order for U_LOCK_SECOND in unix_state_double_lock(). unix_dgram_connect() and unix_dgram_{send,recv}msg() lock the socket and peer in ascending order of the socket address. Let's define the order as unix_state_lock_cmp_fn() instead of using unix_state_lock_nested(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Kuniyuki Iwashima	3955802f16	af_unix: Define locking order for unix_table_double_lock(). When created, AF_UNIX socket is put into net->unx.table.buckets[], and the hash is stored in sk->sk_hash. * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD When bind() is called, the socket could be moved to another bucket. * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 Then, we call unix_table_double_lock() which locks a single bucket or two. Let's define the order as unix_table_lock_cmp_fn() instead of using spin_lock_nested(). The locking is always done in ascending order of sk->sk_hash, which is the index of buckets/locks array allocated by kvmalloc_array(). sk_hash_A < sk_hash_B <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map So, the relation of two sk->sk_hash can be derived from the addresses of dep_map in the array of locks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2024-06-25 11:10:18 +02:00
Jakub Kicinski	bf2468f9af	Merge branch 'locking-introduce-nested-bh-locking' Sebastian Andrzej Siewior says: ==================== locking: Introduce nested-BH locking. Disabling bottoms halves acts as per-CPU BKL. On PREEMPT_RT code within local_bh_disable() section remains preemtible. As a result high prior tasks (or threaded interrupts) will be blocked by lower-prio task (or threaded interrupts) which are long running which includes softirq sections. The proposed way out is to introduce explicit per-CPU locks for resources which are protected by local_bh_disable() and use those only on PREEMPT_RT so there is no additional overhead for !PREEMPT_RT builds. The series introduces the infrastructure and converts large parts of networking which is largest stake holder here. Once this done the per-CPU lock from local_bh_disable() on PREEMPT_RT can be lifted. Performance testing. Baseline is net-next as of commit 93bda33046e7a ("Merge branch'net-constify-ctl_table-arguments-of-utility-functions'") plus v6.10-rc1. A 10GiG link is used between two hosts. The command xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e was invoked on the receiving side with a ixgbe. The sending side uses pktgen_sample03_burst_single_flow.sh on i40e. Baseline: \| eth1->? 9,018,604 rx/s 0 err,drop/s \| receive total 9,018,604 pkt/s 0 drop/s 0 error/s \| cpu:7 9,018,604 pkt/s 0 drop/s 0 error/s \| enqueue to cpu 3 9,018,602 pkt/s 0 drop/s 7.00 bulk-avg \| cpu:7->3 9,018,602 pkt/s 0 drop/s 7.00 bulk-avg \| kthread total 9,018,606 pkt/s 0 drop/s 214,698 sched \| cpu:3 9,018,606 pkt/s 0 drop/s 214,698 sched \| xdp_stats 0 pass/s 9,018,606 drop/s 0 redir/s \| cpu:3 0 pass/s 9,018,606 drop/s 0 redir/s \| redirect_err 0 error/s \| xdp_exception 0 hit/s perf top --sort cpu,symbol --no-children: \| 18.14% 007 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash \| 13.29% 007 [k] ixgbe_poll \| 12.66% 003 [k] cpu_map_kthread_run \| 7.23% 003 [k] page_frag_free \| 6.76% 007 [k] xdp_do_redirect \| 3.76% 007 [k] cpu_map_redirect \| 3.13% 007 [k] bq_flush_to_queue \| 2.51% 003 [k] xdp_return_frame \| 1.93% 007 [k] try_to_wake_up \| 1.78% 007 [k] _raw_spin_lock \| 1.74% 007 [k] cpu_map_enqueue \| 1.56% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop With this series applied: \| eth1->? 10,329,340 rx/s 0 err,drop/s \| receive total 10,329,340 pkt/s 0 drop/s 0 error/s \| cpu:6 10,329,340 pkt/s 0 drop/s 0 error/s \| enqueue to cpu 3 10,329,338 pkt/s 0 drop/s 8.00 bulk-avg \| cpu:6->3 10,329,338 pkt/s 0 drop/s 8.00 bulk-avg \| kthread total 10,329,321 pkt/s 0 drop/s 96,297 sched \| cpu:3 10,329,321 pkt/s 0 drop/s 96,297 sched \| xdp_stats 0 pass/s 10,329,321 drop/s 0 redir/s \| cpu:3 0 pass/s 10,329,321 drop/s 0 redir/s \| redirect_err 0 error/s \| xdp_exception 0 hit/s perf top --sort cpu,symbol --no-children: \| 20.90% 006 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash \| 12.62% 006 [k] ixgbe_poll \| 9.82% 003 [k] page_frag_free \| 8.73% 003 [k] cpu_map_bpf_prog_run_xdp \| 6.63% 006 [k] xdp_do_redirect \| 4.94% 003 [k] cpu_map_kthread_run \| 4.28% 006 [k] cpu_map_redirect \| 4.03% 006 [k] bq_flush_to_queue \| 3.01% 003 [k] xdp_return_frame \| 1.95% 006 [k] _raw_spin_lock \| 1.94% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop This diff appears to be noise. v8: https://lore.kernel.org/all/20240619072253.504963-1-bigeasy@linutronix.de v7: https://lore.kernel.org/all/20240618072526.379909-1-bigeasy@linutronix.de v6: https://lore.kernel.org/all/20240612170303.3896084-1-bigeasy@linutronix.de v5: https://lore.kernel.org/all/20240607070427.1379327-1-bigeasy@linutronix.de v4: https://lore.kernel.org/all/20240604154425.878636-1-bigeasy@linutronix.de v3: https://lore.kernel.org/all/20240529162927.403425-1-bigeasy@linutronix.de v2: https://lore.kernel.org/all/20240503182957.1042122-1-bigeasy@linutronix.de v1: https://lore.kernel.org/all/20231215171020.687342-1-bigeasy@linutronix.de ==================== Link: https://patch.msgid.link/20240620132727.660738-1-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:26 -07:00
Sebastian Andrzej Siewior	3f9fe37d9e	net: Move per-CPU flush-lists to bpf_net_context on PREEMPT_RT. The per-CPU flush lists, which are accessed from within the NAPI callback (xdp_do_flush() for instance), are per-CPU. There are subject to the same problem as struct bpf_redirect_info. Add the per-CPU lists cpu_map_flush_list, dev_map_flush_list and xskmap_map_flush_list to struct bpf_net_context. Add wrappers for the access. The lists initialized on first usage (similar to bpf_net_ctx_get_ri()). Cc: "Björn Töpel" <bjorn@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-16-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:24 -07:00
Sebastian Andrzej Siewior	401cb7dae8	net: Reference bpf_redirect_info via task_struct on PREEMPT_RT. The XDP redirect process is two staged: - bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the packet and makes decisions. While doing that, the per-CPU variable bpf_redirect_info is used. - Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info and it may also access other per-CPU variables like xskmap_flush_list. At the very end of the NAPI callback, xdp_do_flush() is invoked which does not access bpf_redirect_info but will touch the individual per-CPU lists. The per-CPU variables are only used in the NAPI callback hence disabling bottom halves is the only protection mechanism. Users from preemptible context (like cpu_map_kthread_run()) explicitly disable bottom halves for protections reasons. Without locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. PREEMPT_RT has forced-threaded interrupts enabled and every NAPI-callback runs in a thread. If each thread has its own data structure then locking can be avoided. Create a struct bpf_net_context which contains struct bpf_redirect_info. Define the variable on stack, use bpf_net_ctx_set() to save a pointer to it, bpf_net_ctx_clear() removes it again. The bpf_net_ctx_set() may nest. For instance a function can be used from within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and NET_TX_SOFTIRQ which does not. Therefore only the first invocations updates the pointer. Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct bpf_redirect_info. The returned data structure is zero initialized to ensure nothing is leaked from stack. This is done on first usage of the struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to note that initialisation is required. First invocation of bpf_net_ctx_get_ri() will memset() the data structure and update bpf_redirect_info::kern_flags. bpf_redirect_info::nh is excluded from memset because it is only used once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is moved past nh to exclude it from memset. The pointer to bpf_net_context is saved task's task_struct. Using always the bpf_net_context approach has the advantage that there is almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds. Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-15-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:24 -07:00
Sebastian Andrzej Siewior	78f520b7bb	net: Use nested-BH locking for bpf_scratchpad. bpf_scratchpad is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-14-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	d1542d4ae4	seg6: Use nested-BH locking for seg6_bpf_srh_states. The access to seg6_bpf_srh_states is protected by disabling preemption. Based on the code, the entry point is input_action_end_bpf() and every other function (the bpf helper functions bpf_lwt_seg6_*()), that is accessing seg6_bpf_srh_states, should be called from within input_action_end_bpf(). input_action_end_bpf() accesses seg6_bpf_srh_states first at the top of the function and then disables preemption. This looks wrong because if preemption needs to be disabled as part of the locking mechanism then the variable shouldn't be accessed beforehand. Looking at how it is used via test_lwt_seg6local.sh then input_action_end_bpf() is always invoked from softirq context. If this is always the case then the preempt_disable() statement is superfluous. If this is not always invoked from softirq then disabling only preemption is not sufficient. Replace the preempt_disable() statement with nested-BH locking. This is not an equivalent replacement as it assumes that the invocation of input_action_end_bpf() always occurs in softirq context and thus the preempt_disable() is superfluous. Add a local_lock_t the data structure and use local_lock_nested_bh() for locking. Add lockdep_assert_held() to ensure the lock is held while the per-CPU variable is referenced in the helper functions. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: David Ahern <dsahern@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-13-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00
Sebastian Andrzej Siewior	3414adbd6a	lwt: Don't disable migration prio invoking BPF. There is no need to explicitly disable migration if bottom halves are also disabled. Disabling BH implies disabling migration. Remove migrate_disable() and rely solely on disabling BH to remain on the same CPU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20240620132727.660738-12-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-06-24 16:41:23 -07:00

1 2 3 4 5 ...

1281266 Commits