linux

iv/linux

Author	SHA1	Message	Date
Tony Nguyen	79d872c62b	Documentation/eth/intel: Remove references to SourceForge The out-of-tree driver is hosted on SourceForge, as this does not apply to the kernel driver remove references to it. Also do some minor formatting changes around this section. Suggested-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>	2023-03-30 09:35:04 -07:00
Tony Nguyen	8ba732befd	Documentation/eth/intel: Update address for driver support Update the email address for support to use Intel Wired LAN, the mailing list used for kernel development. Suggested-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>	2023-03-30 09:34:59 -07:00
Wolfram Sang	da617cd8d9	smsc911x: remove superfluous variable init phydev is assigned a value right away, no need to initialize it. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20230329064414.25028-1-wsa+renesas@sang-engineering.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-30 15:35:33 +02:00
Paolo Abeni	4ddd6375c3	Merge branch 'net-rps-rfs-improvements' Eric Dumazet says: ==================== net: rps/rfs improvements Jason Xing attempted to optimize napi_schedule_rps() by avoiding unneeded NET_RX_SOFTIRQ raises: [1], [2] This is quite complex to implement properly. I chose to implement the idea, and added a similar optimization in ____napi_schedule() Overall, in an intensive RPC workload, with 32 TX/RX queues with RFS I was able to observe a ~10% reduction of NET_RX_SOFTIRQ invocations. While this had no impact on throughput or cpu costs on this synthetic benchmark, we know that firing NET_RX_SOFTIRQ from softirq handler can force __do_softirq() to wakeup ksoftirqd when need_resched() is true. This can have a latency impact on stressed hosts. [1] https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/ [2] https://lore.kernel.org/netdev/20230328142112.12493-1-kerneljasonxing@gmail.com/ ==================== Link: https://lore.kernel.org/r/20230328235021.1048163-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-30 13:40:04 +02:00
Eric Dumazet	8b43fd3d1d	net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQ ____napi_schedule() adds a napi into current cpu softnet_data poll_list, then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it. Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly from net_rx_action(), because we can process poll_list from this point, without going to full softirq loop. This needs a change in net_rx_action() to make sure we restart its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ being raised. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Xing <kernelxing@tencent.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-30 13:40:00 +02:00
Eric Dumazet	821eba962d	net: optimize napi_schedule_rps() Based on initial patch from Jason Xing. Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps() when we queued a packet into another cpu backlog. We can do this only in the context of us being called indirectly from net_rx_action(), to have the guarantee our rps_ipi_list will be processed before we exit from net_rx_action(). Link: https://lore.kernel.org/lkml/20230325152417.5403-1-kerneljasonxing@gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Xing <kernelxing@tencent.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-30 13:40:00 +02:00
Eric Dumazet	c59647c0dc	net: add softnet_data.in_net_rx_action We want to make two optimizations in napi_schedule_rps() and ____napi_schedule() which require to know if these helpers are called from net_rx_action(), instead of being called from other contexts. sd.in_net_rx_action is only read/written by the owning cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-30 13:40:00 +02:00
Eric Dumazet	8fcb76b934	net: napi_schedule_rps() cleanup napi_schedule_rps() return value is ignored, remove it. Change the comment to clarify the intent. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-30 13:40:00 +02:00
Jakub Kicinski	7079d5e61a	mlx5-updates-2023-03-28 Dragos Tatulea says: ==================== net/mlx5e: RX, Drop page_cache and fully use page_pool For page allocation on the rx path, the mlx5e driver has been using an internal page cache in tandem with the page pool. The internal page cache uses a queue for page recycling which has the issue of head of queue blocking. This patch series drops the internal page_cache altogether and uses the page_pool to implement everything that was done by the page_cache before: * Let the page_pool handle dma mapping and unmapping. * Use fragmented pages with fragment counter instead of tracking via page ref. * Enable skb recycling. The patch series has the following effects on the rx path: * Improved performance for the cases when there was low page recycling due to head of queue blocking in the internal page_cache. The test for this was running a single iperf TCP stream to a rx queue which is bound on the same cpu as the application. \|-------------+--------+--------+------+---------\| \| rq type \| before \| after \| unit \| diff \| \|-------------+--------+--------+------+---------\| \| striding rq \| 30.1 \| 31.4 \| Gbps \| 4.14 % \| \| legacy rq \| 30.2 \| 33.0 \| Gbps \| 8.48 % \| \|-------------+--------+--------+------+---------\| * Small XDP performance degradation. The test was is XDP drop program running on a single rx queue with small packets incoming it looks like this: \|-------------+----------+----------+------+---------\| \| rq type \| before \| after \| unit \| diff \| \|-------------+----------+----------+------+---------\| \| striding rq \| 19725449 \| 18544617 \| pps \| -6.37 % \| \| legacy rq \| 19879931 \| 18631841 \| pps \| -6.70 % \| \|-------------+----------+----------+------+---------\| This will be handled in a different patch series by adding support for multi-packet per page. * For other cases the performance is roughly the same. The above numbers were obtained on the following system: 24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz 32 GB RAM ConnectX-7 single port The breakdown on the patch series is the following: * Preparations for introducing the mlx5e_frag_page struct. * Delete the mlx5e_page_cache struct. * Enable dma mapping from page_pool. * Enable skb recycling and fragment counting. * Do deferred release of pages (just before alloc) to ensure better page_pool cache utilization. ==================== -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQjUY8ACgkQSD+KveBX +j6tVAf/QHCbKgt9c2Q5EpFch2e4x3A/HfE7DbxTancIj0cc1bH98xd4wO574aE4 PCJ/aJ+9zTLvTUgUnKDaiqonfmcsF7v6d/ltoLW1PTNnPqdsjsXpVy76dnL81SWy u/g7h68cfeMdMjAAoewyVv+k7GeTIZCsIdvik3dWGFQ67IpE1k5dLbO13YBNW/5m Cm39RzD55tjgxS8GHdyFYAV4MwgHy+pdhTYR9LGzH80hfd02KqsCO38u1NIShuez 1rwjRF213Qdln20bMNSNiXG36JUV65mo+Q/XHKOEjB0qNKRcF5bzZovqHzP+R7QZ qhhhfce8c63UWpcXADP6k6qevW8+UA== =8F1t -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-03-28 Dragos Tatulea says: ==================== net/mlx5e: RX, Drop page_cache and fully use page_pool For page allocation on the rx path, the mlx5e driver has been using an internal page cache in tandem with the page pool. The internal page cache uses a queue for page recycling which has the issue of head of queue blocking. This patch series drops the internal page_cache altogether and uses the page_pool to implement everything that was done by the page_cache before: * Let the page_pool handle dma mapping and unmapping. * Use fragmented pages with fragment counter instead of tracking via page ref. * Enable skb recycling. The patch series has the following effects on the rx path: * Improved performance for the cases when there was low page recycling due to head of queue blocking in the internal page_cache. The test for this was running a single iperf TCP stream to a rx queue which is bound on the same cpu as the application. \|-------------+--------+--------+------+---------\| \| rq type \| before \| after \| unit \| diff \| \|-------------+--------+--------+------+---------\| \| striding rq \| 30.1 \| 31.4 \| Gbps \| 4.14 % \| \| legacy rq \| 30.2 \| 33.0 \| Gbps \| 8.48 % \| \|-------------+--------+--------+------+---------\| * Small XDP performance degradation. The test was is XDP drop program running on a single rx queue with small packets incoming it looks like this: \|-------------+----------+----------+------+---------\| \| rq type \| before \| after \| unit \| diff \| \|-------------+----------+----------+------+---------\| \| striding rq \| 19725449 \| 18544617 \| pps \| -6.37 % \| \| legacy rq \| 19879931 \| 18631841 \| pps \| -6.70 % \| \|-------------+----------+----------+------+---------\| This will be handled in a different patch series by adding support for multi-packet per page. * For other cases the performance is roughly the same. The above numbers were obtained on the following system: 24 core Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz 32 GB RAM ConnectX-7 single port The breakdown on the patch series is the following: * Preparations for introducing the mlx5e_frag_page struct. * Delete the mlx5e_page_cache struct. * Enable dma mapping from page_pool. * Enable skb recycling and fragment counting. * Do deferred release of pages (just before alloc) to ensure better page_pool cache utilization. ==================== * tag 'mlx5-updates-2023-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats net/mlx5e: RX, Break the wqe bulk refill in smaller chunks net/mlx5e: RX, Increase WQE bulk size for legacy rq net/mlx5e: RX, Split off release path for xsk buffers for legacy rq net/mlx5e: RX, Defer page release in legacy rq for better recycling net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags net/mlx5e: RX, Defer page release in striding rq for better recycling net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name net/mlx5e: RX, Enable skb page recycling through the page_pool net/mlx5e: RX, Enable dma map and sync from page_pool allocator net/mlx5e: RX, Remove internal page_cache net/mlx5e: RX, Store SHAMPO header pages in array net/mlx5e: RX, Remove alloc unit layout constraint for striding rq net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation ==================== Link: https://lore.kernel.org/r/20230328205623.142075-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-29 22:15:24 -07:00
Simon Horman	c5370374bb	net: ena: removed unused tx_bytes variable clang 16.0.0 with W=1 reports: drivers/net/ethernet/amazon/ena/ena_netdev.c:1901:6: error: variable 'tx_bytes' set but not used [-Werror,-Wunused-but-set-variable] u32 tx_bytes = 0; The variable is not used so remove it. Signed-off-by: Simon Horman <horms@kernel.org> Acked-by: Shay Agroskin <shayagr@amazon.com> Link: https://lore.kernel.org/r/20230328151958.410687-1-horms@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-29 21:39:35 -07:00
Dan Carpenter	765f360464	octeon_ep: unlock the correct lock on error path The h and the f letters are swapped so it unlocks the wrong lock. Fixes: `577f0d1b1c` ("octeon_ep: add separate mailbox command and response queues") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/251aa2a2-913e-4868-aac9-0a90fc3eeeda@kili.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-29 21:39:35 -07:00
Tianfei Zhang	615927f1a4	ptp: add ToD device driver for Intel FPGA cards Adding a DFL (Device Feature List) device driver of ToD device for Intel FPGA cards. The Intel FPGA Time of Day(ToD) IP within the FPGA DFL bus is exposed as PTP Hardware clock(PHC) device to the Linux PTP stack to synchronize the system clock to its ToD information using phc2sys utility of the Linux PTP stack. The DFL is a hardware List within FPGA, which defines a linked list of feature headers within the device MMIO space to provide an extensible way of adding subdevice features. Signed-off-by: Raghavendra Khadatare <raghavendrax.anand.khadatare@intel.com> Signed-off-by: Tianfei Zhang <tianfei.zhang@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20230328142455.481146-1-tianfei.zhang@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-29 21:25:48 -07:00
Hao Lan	3b064f541b	net: hns3: support wake on lan configuration and query The HNS3 driver supports Wake-on-LAN, which can wake up the server from power off state to power on state by magic packet or magic security packet. ChangeLog: v1->v2: Deleted the debugfs function that overlaps with the ethtool function from suggestion of Andrew Lunn. v2->v3: Return the wol configuration stored in driver, suggested by Alexander H Duyck. v3->v4: Add a helper to go from netdev to the local struct, suggested by Simon Horman and Jakub Kicinski. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Hao Lan <lanhao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:07:42 +01:00
David S. Miller	be435af51f	Merge branch 'sfc-tc-decap-support' Edward Cree says: ==================== sfc: support TC decap rules This series adds support for offloading tunnel decapsulation TC rules to ef100 NICs, allowing matching encapsulated packets to be decapsulated in hardware and redirected to VFs. For now an encap match must be on precisely the following fields: ethertype (IPv4 or IPv6), source IP, destination IP, ipproto UDP, UDP destination port. This simplifies checking for overlaps in the driver; the hardware supports a wider range of match fields which future driver work may expose. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:09 +01:00
Edward Cree	17654d84b4	sfc: add offloading of 'foreign' TC (decap) rules A 'foreign' rule is one for which the net_dev is not the sfc netdevice or any of its representors. The driver registers indirect flow blocks for tunnel netdevs so that it can offload decap rules. For example: tc filter add dev vxlan0 parent ffff: protocol ipv4 flower \ enc_src_ip 10.1.0.2 enc_dst_ip 10.1.0.1 \ enc_key_id 1000 enc_dst_port 4789 \ action tunnel_key unset \ action mirred egress redirect dev $REPRESENTOR When notified of a rule like this, register an encap match on the IP and dport tuple (creating an Outer Rule table entry) and insert an MAE action rule to perform the decapsulation and deliver to the representee. Moved efx_tc_delete_rule() below efx_tc_flower_release_encap_match() to avoid the need for a forward declaration. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:08 +01:00
Edward Cree	746224cdef	sfc: add code to register and unregister encap matches Add a hashtable to detect duplicate and conflicting matches. If match is not a duplicate, call MAE functions to add/remove it from OR table. Calling code not added yet, so mark the new functions as unused. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:08 +01:00
Edward Cree	2245eb0086	sfc: add functions to insert encap matches into the MAE An encap match corresponds to an entry in the exact-match Outer Rule table; the lookup response includes the encap type (protocol) allowing the hardware to continue parsing into the inner headers. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:08 +01:00
Edward Cree	b7f5e17b3b	sfc: handle enc keys in efx_tc_flower_parse_match() Translate the fields from flow dissector into struct efx_tc_match. In efx_tc_flower_replace(), reject filters that match on them, because only 'foreign' filters (i.e. those for which the ingress dev is not the sfc netdev or any of its representors, e.g. a tunnel netdev) can use them. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:08 +01:00
Edward Cree	b9d5c9b7d8	sfc: add notion of match on enc keys to MAE machinery Extend the MAE caps check to validate that the hardware supports these outer-header matches where used by the driver. Extend efx_mae_populate_match_criteria() to fill in the outer rule ID and VNI match fields. Nothing yet populates these match fields, nor creates outer rules. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:08 +01:00
Edward Cree	edd025ca08	sfc: document TC-to-EF100-MAE action translation concepts Includes an explanation of the lifetime of the 'cursor' action-set `act`. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:06:08 +01:00
David S. Miller	37018b5a29	Merge branch 'macvlan-broadcast-queue-bypass' Herbert Xu says: ==================== macvlan: Allow some packets to bypass broadcast queue This patch series allows some packets to bypass the broadcast queue on receive. Currently all multicast packets are queued on receive and then processed in a work queue. This is to avoid an unbounded amount of work occurring in the receive path, as one broadcast packet could easily translate into 4,000 packets. However, for multicast packets with just one receiver (possible for IPv6 ND), this introduces unnecessary latency as the packet will go to exactly one device. This series allows such multicast packets to be processed inline. It also adds a toggle which lets the admin control what threshold to set between queueing and not queueing. A follow-up patch for iproute will be posted. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:03:33 +01:00
Herbert Xu	954d1fa1ac	macvlan: Add netlink attribute for broadcast cutoff Make the broadcast cutoff configurable through netlink. Note that macvlan is weird because there is no central device for us to configure (the lowerdev could be anything). So all the options are duplicated over what could be thousands of child devices. IFLA_MACVLAN_BC_QUEUE_LEN took the approach of taking the maximum of all child device settings. This is unnecessary as we could simply store the option in the port device and take the last child device that gets updated as the value to use. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:03:32 +01:00
Herbert Xu	d45276e75e	macvlan: Skip broadcast queue if multicast with single receiver As it stands all broadcast and multicast packets are queued and processed in a work queue. This is so that we don't overwhelm the receive softirq path by generating thousands of packets or more (see commit `412ca1550c` "macvlan: Move broadcasts into a work queue"). As such all multicast packets will be delayed, even if they will be received by a single macvlan device. As using a workqueue is not free in terms of latency, we should avoid this where possible. This patch adds a new filter to determine which addresses should be delayed and which ones won't. This is done using a crude counter of how many times an address has been added to the macvlan port (ha->synced). For now if an address has been added more than once, then it will be considered to be broadcast. This could be tuned further by making this threshold configurable. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:03:32 +01:00
David S. Miller	6fc5f5bcc0	Merge branch 'mptcp-cleanups' Matthieu Baerts says: ==================== mptcp: a couple of cleanups and improvements Patch 1 removes an unneeded address copy in subflow_syn_recv_sock(). Patch 2 simplifies subflow_syn_recv_sock() to postpone some actions and to avoid a bunch of conditionals. Patch 3 stops reporting limits that are not taken into account when the userspace PM is used. Patch 4 adds a new test to validate that the 'subflows' field reported by the kernel is correct. Such info can be retrieved via Netlink (e.g. with ss) or getsockopt(SOL_MPTCP, MPTCP_INFO). --- Changes in v2: - Patch 3/4's commit message has been updated to use the correct SHA - Rebased on latest net-next - Link to v1: https://lore.kernel.org/r/20230324-upstream-net-next-20230324-misc-features-v1-0-5a29154592bd@tessares.net ==================== Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:01:28 +01:00
Geliang Tang	9095ce97bf	selftests: mptcp: add mptcp_info tests This patch adds the mptcp_info fields tests in endpoint_tests(). Add a new function chk_mptcp_info() to check the given number of the given mptcp_info field. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/330 Signed-off-by: Geliang Tang <geliang.tang@suse.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:01:28 +01:00
Matthieu Baerts	e925a0322a	mptcp: do not fill info not used by the PM in used Only the in-kernel PM uses the number of address and subflow limits allowed per connection. It then makes more sense not to display such info when other PMs are used not to confuse the userspace by showing limits not being used. While at it, we can get rid of the "val" variable and add indentations instead. It would have been good to have done this modification directly in commit `4d25247d3a` ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") but as we change a bit the behaviour, it is fine not to backport it to stable. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:01:28 +01:00
Paolo Abeni	a88d0092b2	mptcp: simplify subflow_syn_recv_sock() Postpone the msk cloning to the child process creation so that we can avoid a bunch of conditionals. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/61 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:01:28 +01:00
Paolo Abeni	2bb9a37f0e	mptcp: avoid unneeded address copy In the syn_recv fallback path, the msk is unused. We can skip setting the socket address. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 09:01:28 +01:00
David S. Miller	9380d89104	Merge branch 'in6addr_any-cleanups' Kuniyuki Iwashima says: ==================== ipv6: Random cleanup for in6addr_any. The first patch removes in6addr_any alternatives and the second removes redundant initialisation of a local variable. Changes: v2: Use ipv6_addr_any() in patch 1. (David Ahern) v1: https://lore.kernel.org/netdev/20230322012204.33157-1-kuniyu@amazon.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:22:52 +01:00
Kuniyuki Iwashima	be689c719e	6lowpan: Remove redundant initialisation. We'll call memset(&tmp, 0, sizeof(tmp)) later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:22:52 +01:00
Kuniyuki Iwashima	8cdc3223e7	ipv6: Remove in6addr_any alternatives. Some code defines the IPv6 wildcard address as a local variable and use it with memcmp() or ipv6_addr_equal(). Let's use in6addr_any and ipv6_addr_any() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:22:52 +01:00
David S. Miller	5a8c8b72f6	Merge branch 'vsock-sockmap-support' Bobby Eshleman says: ==================== Add support for sockmap to vsock. We're testing usage of vsock as a way to redirect guest-local UDS requests to the host and this patch series greatly improves the performance of such a setup. Compared to copying packets via userspace, this improves throughput by 121% in basic testing. Tested as follows. Setup: guest unix dgram sender -> guest vsock redirector -> host vsock server Threads: 1 Payload: 64k No sockmap: - 76.3 MB/s - The guest vsock redirector was "socat VSOCK-CONNECT:2:1234 UNIX-RECV:/path/to/sock" Using sockmap (this patch): - 168.8 MB/s (+121%) - The guest redirector was a simple sockmap echo server, redirecting unix ingress to vsock 2:1234 egress. - Same sender and server programs Note: these numbers are from RFC v1 Only the virtio transport has been tested. The loopback transport was used in writing bpf/selftests, but not thoroughly tested otherwise. This series requires the skb patch. Changes in v4: - af_vsock: fix parameter alignment in vsock_dgram_recvmsg() - af_vsock: add TCP_ESTABLISHED comment in vsock_dgram_connect() - vsock/bpf: change ret type to bool Changes in v3: - vsock/bpf: Refactor wait logic in vsock_bpf_recvmsg() to avoid backwards goto - vsock/bpf: Check psock before acquiring slock - vsock/bpf: Return bool instead of int of 0 or 1 - vsock/bpf: Wrap macro args __sk/__psock in parens - vsock/bpf: Place comment trailer / on separate line Changes in v2: - vsock/bpf: rename vsock_dgram_* -> vsock_* - vsock/bpf: change sk_psock_{get,put} and {lock,release}_sock() order to minimize slock hold time - vsock/bpf: use "new style" wait - vsock/bpf: fix bug in wait log - vsock/bpf: add check that recvmsg sk_type is one dgram, seqpacket, or stream. Return error if not one of the three. - virtio/vsock: comment __skb_recv_datagram() usage - virtio/vsock: do not init copied in read_skb() - vsock/bpf: add ifdef guard around struct proto in dgram_recvmsg() - selftests/bpf: add vsock loopback config for aarch64 - selftests/bpf: add vsock loopback config for s390x - selftests/bpf: remove vsock device from vmtest.sh qemu machine - selftests/bpf: remove CONFIG_VIRTIO_VSOCKETS=y from config.x86_64 - vsock/bpf: move transport-related (e.g., if (!vsk->transport)) checks out of fast path ==================== Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:19:38 +01:00
Bobby Eshleman	d61bd8c1fd	selftests/bpf: add a test case for vsock sockmap Add a test case testing the redirection from connectible AF_VSOCK sockets to connectible AF_UNIX sockets. Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:19:38 +01:00
Bobby Eshleman	c7c605c982	selftests/bpf: add vsock to vmtest.sh Add vsock loopback to the test kernel. This allows sockmap for vsock to be tested. Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:19:38 +01:00
Bobby Eshleman	634f1a7110	vsock: support sockmap This patch adds sockmap support for vsock sockets. It is intended to be usable by all transports, but only the virtio and loopback transports are implemented. SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported. Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-03-29 08:19:38 +01:00
Bobby Eshleman	24265c2c91	testing/vsock: add vsock_perf to gitignore This adds the vsock_perf binary to the gitignore file. Fixes: `8abbffd27c` ("test/vsock: vsock_perf utility") Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reviewed-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20230327-vsock-add-vsock-perf-to-ignore-v1-1-f28a84f3606b@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:55:22 -07:00
Jakub Kicinski	35fae44e8e	Merge branch 'ynl-add-support-for-user-headers-and-struct-attrs' Donald Hunter says: ==================== ynl: add support for user headers and struct attrs Add support for user headers and struct attrs to YNL. This patchset adds features to ynl and add a partial spec for openvswitch that demonstrates use of the features. Patch 1-4 add features to ynl Patch 5 adds partial openvswitch specs that demonstrate the new features Patch 6-7 add documentation for legacy structs and for sub-type ==================== Link: https://lore.kernel.org/r/20230327083138.96044-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:45 -07:00
Donald Hunter	04eac39361	docs: netlink: document the sub-type attribute property Add a definition for sub-type to the protocol spec doc and a description of its usage for C arrays in genetlink-legacy. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Donald Hunter	88e2889684	docs: netlink: document struct support for genetlink-legacy Describe the genetlink-legacy support for using struct definitions for fixed headers and for binary attributes. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Donald Hunter	643ef4a676	netlink: specs: add partial specification for openvswitch The openvswitch family has a fixed header, uses struct attrs and has array values. This partial spec demonstrates these features in the YNL CLI. These specs are sufficient to create, delete and dump datapaths and to dump vports: $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_datapath.yaml \ --do dp-new --json '{ "dp-ifindex": 0, "name": "demo", "upcall-pid": 0}' None $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_datapath.yaml \ --dump dp-get --json '{ "dp-ifindex": 0 }' [{'dp-ifindex': 3, 'masks-cache-size': 256, 'megaflow-stats': {'cache-hits': 0, 'mask-hit': 0, 'masks': 0, 'pad1': 0, 'padding': 0}, 'name': 'test', 'stats': {'flows': 0, 'hit': 0, 'lost': 0, 'missed': 0}, 'user-features': {'dispatch-upcall-per-cpu', 'tc-recirc-sharing', 'unaligned'}}, {'dp-ifindex': 48, 'masks-cache-size': 256, 'megaflow-stats': {'cache-hits': 0, 'mask-hit': 0, 'masks': 0, 'pad1': 0, 'padding': 0}, 'name': 'demo', 'stats': {'flows': 0, 'hit': 0, 'lost': 0, 'missed': 0}, 'user-features': set()}] $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_datapath.yaml \ --do dp-del --json '{ "dp-ifindex": 0, "name": "demo"}' None $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ovs_vport.yaml \ --dump vport-get --json '{ "dp-ifindex": 3 }' [{'dp-ifindex': 3, 'ifindex': 3, 'name': 'test', 'port-no': 0, 'stats': {'rx-bytes': 0, 'rx-dropped': 0, 'rx-errors': 0, 'rx-packets': 0, 'tx-bytes': 0, 'tx-dropped': 0, 'tx-errors': 0, 'tx-packets': 0}, 'type': 'internal', 'upcall-pid': [0], 'upcall-stats': {'fail': 0, 'success': 0}}] Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Donald Hunter	f036d936ca	tools: ynl: Add fixed-header support to ynl Add support for netlink families that add an optional fixed header structure after the genetlink header and before any attributes. The fixed-header can be specified on a per op basis, or once for all operations, which serves as a default value that can be overridden. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Donald Hunter	2607191395	tools: ynl: Add struct attr decoding to ynl Add support for decoding attributes that contain C structs. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Donald Hunter	b423c3c863	tools: ynl: Add C array attribute decoding to ynl Add support for decoding C arrays from binay blobs in genetlink-legacy messages. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Donald Hunter	bec0b7a2db	tools: ynl: Add struct parsing to nlspec Add python classes for struct definitions to nlspec Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:54:43 -07:00
Jakub Kicinski	de7494524d	mlx5-updates-2023-03-20 mlx5 dynamic msix This patch series adds support for dynamic msix vectors allocation in mlx5. Eli Cohen Says: ================ The following series of patches modifies mlx5_core to work with the dynamic MSIX API. Currently, mlx5_core allocates all the interrupt vectors it needs and distributes them amongst the consumers. With the introduction of dynamic MSIX support, which allows for allocation of interrupts more than once, we now allocate vectors as we need them. This allows other drivers running on top of mlx5_core to allocate interrupt vectors for their own use. An example for this is mlx5_vdpa, which uses these vectors to propagate interrupts directly from the hardware to the vCPU [1]. As a preparation for using this series, a use after free issue is fixed in lib/cpu_rmap.c and the allocator for rmap entries has been modified. A complementary API for irq_cpu_rmap_add() has also been introduced. [1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e ================ -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmQeTIUACgkQSD+KveBX +j7oCQgAx9yNHM4BZD2UfIx/P+W13v1B+xOds04Vezl9JlakoqvviPxm3vvuKkl+ j/8DdyoqMUbWV0j5XxgZ+GG91bc14jN1GQ+4fUf63SzA99vAGb9GJPV2aQt5roGh JmMqI2utDfoz+29qtQ+kVchY5AN5AoPXSQH2zkEZmJaPUjYb9Dr/4IayL0JaViAw S31QLHKkSJ8bL8Wc6Op1emNVV7eXs18f7IIjVs3sYOb3WJRPVpmdKneRqLgVYplf Td40Gwobl1elpjEqSSRTJI5YUSR8gcAJlBqIwHeJzFFpO3Pnciopl761osNKKs/a 5ctES5DS6JHqqFGbWV1gKYcRMil3LA== =9i8l -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-03-20 mlx5 dynamic msix This patch series adds support for dynamic msix vectors allocation in mlx5. Eli Cohen Says: ================ The following series of patches modifies mlx5_core to work with the dynamic MSIX API. Currently, mlx5_core allocates all the interrupt vectors it needs and distributes them amongst the consumers. With the introduction of dynamic MSIX support, which allows for allocation of interrupts more than once, we now allocate vectors as we need them. This allows other drivers running on top of mlx5_core to allocate interrupt vectors for their own use. An example for this is mlx5_vdpa, which uses these vectors to propagate interrupts directly from the hardware to the vCPU [1]. As a preparation for using this series, a use after free issue is fixed in lib/cpu_rmap.c and the allocator for rmap entries has been modified. A complementary API for irq_cpu_rmap_add() has also been introduced. [1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e ================ * tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Provide external API for allocating vectors net/mlx5: Use one completion vector if eth is disabled net/mlx5: Refactor calculation of required completion vectors net/mlx5: Move devlink registration before mlx5_load net/mlx5: Use dynamic msix vectors allocation net/mlx5: Refactor completion irq request/release code net/mlx5: Improve naming of pci function vectors net/mlx5: Use newer affinity descriptor net/mlx5: Modify struct mlx5_irq to use struct msi_map net/mlx5: Fix wrong comment net/mlx5e: Coding style fix, add empty line lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add lib: cpu_rmap: Use allocator for rmap entries lib: cpu_rmap: Avoid use after free on rmap->obj array entries ==================== Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:52:12 -07:00
Jakub Kicinski	e70f94c6c7	docs: netdev: clarify the need to sending reverts as patches We don't state explicitly that reverts need to be submitted as a patch. It occasionally comes up. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230327172646.2622943-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:51:05 -07:00
Tom Rix	e48cefb9c8	net: ethernet: 8390: axnet_cs: remove unused xfer_count variable clang with W=1 reports drivers/net/ethernet/8390/axnet_cs.c:653:9: error: variable 'xfer_count' set but not used [-Werror,-Wunused-but-set-variable] int xfer_count = count; ^ This variable is not used so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230327235423.1777590-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:48:02 -07:00
Wolfram Sang	cdeccd13a0	Revert "sh_eth: remove open coded netif_running()" This reverts commit `ce1fdb0656`. It turned out this actually introduces a race condition. netif_running() is not a suitable check for get_stats. Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230327152112.15635-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 19:23:32 -07:00
Jakub Kicinski	2600badfea	Merge branch 'net-refcount-address-dst_entry-reference-count-scalability-issues' Thomas Gleixner says: ==================== net, refcount: Address dst_entry reference count scalability issues This is version 3 of this series. Version 2 can be found here: https://lore.kernel.org/lkml/20230307125358.772287565@linutronix.de Wangyang and Arjan reported a bottleneck in the networking code related to struct dst_entry::__refcnt. Performance tanks massively when concurrency on a dst_entry increases. This happens when there are a large amount of connections to or from the same IP address. The memtier benchmark when run on the same host as memcached amplifies this massively. But even over real network connections this issue can be observed at an obviously smaller scale (due to the network bandwith limitations in my setup, i.e. 1Gb). How to reproduce: Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100 on the same machine. localhost connections amplify the problem. Start with the defaults for $N and $M and increase them. Depending on your machine this will tank at some point. But even in reasonably small $N, $M scenarios the refcount operations and the resulting false sharing fallout becomes visible in perf top. At some point it becomes the dominating issue. There are two factors which make this reference count a scalability issue: 1) False sharing dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. The other problem is struct rtable, which embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall and contribute to the performance problem. 2) atomic_inc_not_zero() A reference on dst_entry:__refcnt is acquired via atomic_inc_not_zero() and released via atomic_dec_return(). atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop, which exposes O(N^2) behaviour under contention with N concurrent operations. Contention scalability is degrading with even a small amount of contenders and gets worse from there. Lightweight instrumentation exposed an average of 8!! retry loops per atomic_inc_not_zero() invocation in a inc()/dec() loop running concurrently on 112 CPUs. There is nothing which can be done to make atomic_inc_not_zero() more scalable. The following series addresses these issues: 1) Reorder and pad struct dst_entry to prevent the false sharing. 2) Implement and use a reference count implementation which avoids the atomic_inc_not_zero() problem. It is slightly less performant in the case of the final 0 -> -1 transition, but the deconstruction of these objects is a low frequency event. get()/put() pairs are in the hotpath and that's what this implementation optimizes for. The algorithm of this reference count is only suitable for RCU managed objects. Therefore it cannot replace the refcount_t algorithm, which is also based on atomic_inc_not_zero(), due to a subtle race condition related to the 0 -> -1 transition and the final verdict to mark the reference count dead. See details in patch 2/3. It might be just my lack of imagination which declares this to be impossible and I'd be happy to be proven wrong. As a bonus the new rcuref implementation provides underflow/overflow detection and mitigation while being performance wise on par with open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in the non-contended case. The combination of these two changes results in performance gains in micro benchmarks and also localhost and networked memtier benchmarks talking to memcached. It's hard to quantify the benchmark results as they depend heavily on the micro-architecture and the number of concurrent operations. The overall gain of both changes for localhost memtier ranges from 1.2X to 3.2X and from +2% to %5% range for networked operations on a 1Gb connection. A micro benchmark which enforces maximized concurrency shows a gain between 1.2X and 4.7X!!! Obviously this is focussed on a particular problem and therefore needs to be discussed in detail. It also requires wider testing outside of the cases which this is focussed on. Though the false sharing issue is obvious and should be addressed independent of the more focussed reference count changes. The series is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rcuref Changes vs. V2: - Rename __refcnt to __rcuref (Linus) - Fix comments and changelogs (Mark, Qiuxu) - Fixup kernel doc of generated atomic_add_negative() variants I want to say thanks to Wangyang who analyzed the issue and provided the initial fix for the false sharing problem. Further thanks go to Arjan Peter, Marc, Will and Borislav for valuable input and providing test results on machines which I do not have access to, and to Linus and Eric, Qiuxu and Mark for helpful feedback. ==================== Link: https://lore.kernel.org/r/20230323102649.764958589@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 18:53:43 -07:00
Thomas Gleixner	bc9d3a9f2a	net: dst: Switch to rcuref_t reference counting Under high contention dst_entry::__refcnt becomes a significant bottleneck. atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into high retry rates on contention. Switch the reference count to rcuref_t which results in a significant performance gain. Rename the reference count member to __rcuref to reflect the change. The gain depends on the micro-architecture and the number of concurrent operations and has been measured in the range of +25% to +130% with a localhost memtier/memcached benchmark which amplifies the problem massively. Running the memtier/memcached benchmark over a real (1Gb) network connection the conversion on top of the false sharing fix for struct dst_entry::__refcnt results in a total gain in the 2%-5% range over the upstream baseline. Reported-by: Wangyang Guo <wangyang.guo@intel.com> Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 18:52:28 -07:00

1 2 3 4 5 ...

1170862 Commits