1280816 Commits

Author SHA1 Message Date
Jakub Kicinski
6f80fcdfbc Merge branch 'mlxsw-use-page-pool-for-rx-buffers-allocation'
Petr Machata says:

====================
mlxsw: Use page pool for Rx buffers allocation

Amit Cohen  writes:

After using NAPI to process events from hardware, the next step is to
use page pool for Rx buffers allocation, which is also enhances
performance.

To simplify this change, first use page pool to allocate one continuous
buffer for each packet, later memory consumption can be improved by using
fragmented buffers.

This set significantly enhances mlxsw driver performance, CPU can handle
about 370% of the packets per second it previously handled.

The next planned improvement is using XDP to optimize telemetry.

Patch set overview:
Patches #1-#2 are small preparations for page pool usage
Patch #3 initializes page pool, but do not use it
Patch #4 converts the driver to use page pool for buffers allocations
Patch #5 is an optimization for buffer access
Patch #6 cleans up an unused structure
Patch #7 uses napi_consume_skb() as part of Tx completion
====================

Link: https://lore.kernel.org/r/cover.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:13 -07:00
Amit Cohen
d94ae6415b mlxsw: pci: Use napi_consume_skb() to free SKB as part of Tx completion
Currently, as part of Tx completion, the driver calls dev_kfree_skb_any()
to free the SKB. For this flow, the correct function is napi_consume_skb().
This function and dev_consume_skb_any() were added to be used for consumed
SKBs, which were not dropped, so the skb:kfree_skb tracepoint is not
triggered, and we can get better diagnostics about dropped packets.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/a9f9f3dc884c0d1be4bd4c9d72030c88c7ac004f.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:11 -07:00
Amit Cohen
e8441b1f6b mlxsw: pci: Do not store SKB for RDQ elements
The previous patch used page pool to allocate buffers for RDQ. With this
change, 'elem_info->u.rdq.skb' is not used anymore, as we do not allocate
SKB before getting the packet, we hold page pointer and build the SKB
around it once packet is received.

Remove the union and store SKB pointer for SDQ only.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/23a531008936dc9a1a298643fb1e4f9a7b8e6eb3.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:11 -07:00
Amit Cohen
0f3cd437a1 mlxsw: pci: Optimize data buffer access
Before accessing data buffer, call net_prefetch() to load it into the
cache. This change improves driver performance, CPU can handle about
7.1% more packets per second.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1fa07c510890866a6f201163ab7e78890ba28b3b.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:11 -07:00
Amit Cohen
b5b60bb491 mlxsw: pci: Use page pool for Rx buffers allocation
As part of driver init, all Rx queues are filled with buffers for
hardware usage. Later, when a packet is received, a new buffer should be
allocated to be used by hardware instead of the received buffer.
Packet's processing time includes allocation time, which can be improved
using page pool.

Using page pool, DMA mapping is done only for first allocation of buffers.
As subsequent buffers allocation avoid DMA mapping, it results in
performance improvement. The purpose of page pool is to allocate pages fast
from cache without locking. This lockless guarantee naturally comes from
running under a NAPI.

Use page pool to allocate the data buffer only, so hardware will use it to
fill the packet. At completion time, attach the data buffer (now filled
with packet payload) to new SKB which is allocated around the received
buffer. SKB building at completion time prevents cache miss for each
packet, as now the SKB is allocated right before packets will be handled by
networking stack.

Page pool for each Rx queue enhances Rx side performance by reclaiming
buffers back to each queue specific pool. This change significantly
improves driver performance, CPU can handle about 345% of the packets per
second it previously handled.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1cf788a8f43c70aae6d526018ef77becb27ad6d3.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:10 -07:00
Amit Cohen
5642c6a086 mlxsw: pci: Initialize page pool per CQ
Next patch will use page pool to allocate buffers for RDQ. Initialize
page pool for each CQ, which is mapped 1:1 to RDQ. Page pool for each Rx
queue enhances Rx side performance by reclaiming buffers back to each queue
specific pool.

When only one NAPI instance is the consumer of pages from page pool, it is
recommended to pass it as part of 'page_pool_params', then page pool APIs
will be done without special locks. mlxsw driver holds NAPI instance per
CQ, so add page pool per CQ and use the existing NAPI instance.

For now, pages are not allocated from the pool, next patch will use it.

Some notes regarding 'page_pool_params':
* Use PP_FLAG_DMA_MAP to allow page pool handles DMA mapping, for now
  do not use sync flag, as only the device writes to this memory and we
  read it only when it finishes writing there. This will probably be
  changed when we will support XDP.
* Define 'order' according to maximum MTU and take into account software
  overhead. Some round up are done, which means that we allocate more pages
  than we really need. This can be improved later by using fragmented
  buffers.
* Use pool_size = MLXSW_PCI_WQE_COUNT. This will be the size of 'ptr_ring',
  and should be the maximum amount of packets that page pool will allocate
  memory for. In our case, this is the queue size, defined as
  MLXSW_PCI_WQE_COUNT.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/02e5856ae7c572d4293ce6bb92c286ee6cfec800.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:10 -07:00
Amit Cohen
7555b7f338 mlxsw: pci: Store CQ pointer as part of RDQ structure
Next patches will add support for page pool in mlxsw driver. Page pool will
be used to allocate buffers for RDQ and will use NAPI instance of the
appropriate CQ (RDQ is mapped 1:1 to CQ).

To allow pool initialization as part of CQ init, when NAPI is initialized,
page_pool structure will be as part of CQ structure. Later, the allocations
for RDQ will be done from the pool in the appropriate CQ. To allow access
to the appropriate pool, set CQ pointer as part of RDQ initialization.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/d60918ca1e142a554af1df9c1152cdac83854a3b.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:10 -07:00
Amit Cohen
39fa294f58 mlxsw: pci: Split NAPI setup/teardown into two steps
mlxsw_pci_cq_napi_setup() includes both NAPI initialization and
enablement, similar to teardown function. Next patches will add support
for page pool in mlxsw driver, then we use NAPI instance for page pool.

Page pool initialization should be done before NAPI enablement, same for
page pool destruction which should be done after NAPI disablement.

As preparation, split NAPI setup/teardown into two steps, then page pool
setup will be done between the phases.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/8dbf37e859f07247498fca17109b8858ff2b0498.1718709196.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:38:10 -07:00
Lukasz Majewski
89f5e60777 net: hsr: cosmetic: Remove extra white space
This change just removes extra (i.e. not needed) white space in
prp_drop_frame() function.

No functional changes.

Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240618125817.1111070-1-lukma@denx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:32:57 -07:00
Jiri Pirko
c8bd1f7f3e virtio_net: add support for Byte Queue Limits
Add support for Byte Queue Limits (BQL).

Tested on qemu emulated virtio_net device with 1, 2 and 4 queues.
Tested with fq_codel and pfifo_fast. Super netperf with 50 threads is
running in background. Netperf TCP_RR results:

NOBQL FQC 1q:  159.56  159.33  158.50  154.31    agv: 157.925
NOBQL FQC 2q:  184.64  184.96  174.73  174.15    agv: 179.62
NOBQL FQC 4q:  994.46  441.96  416.50  499.56    agv: 588.12
NOBQL PFF 1q:  148.68  148.92  145.95  149.48    agv: 148.2575
NOBQL PFF 2q:  171.86  171.20  170.42  169.42    agv: 170.725
NOBQL PFF 4q: 1505.23 1137.23 2488.70 3507.99    agv: 2159.7875
  BQL FQC 1q: 1332.80 1297.97 1351.41 1147.57    agv: 1282.4375
  BQL FQC 2q:  768.30  817.72  864.43  974.40    agv: 856.2125
  BQL FQC 4q:  945.66  942.68  878.51  822.82    agv: 897.4175
  BQL PFF 1q:  149.69  151.49  149.40  147.47    agv: 149.5125
  BQL PFF 2q: 2059.32  798.74 1844.12  381.80    agv: 1270.995
  BQL PFF 4q: 1871.98 4420.02 4916.59 13268.16   agv: 6119.1875

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20240618144456.1688998-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:28:19 -07:00
Jeff Johnson
2b0cd6b727 net: smc9194: add missing MODULE_DESCRIPTION() macro
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/smsc/smc9194.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240618-md-m68k-drivers-net-ethernet-smsc-v1-1-ad3d7200421e@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:22:33 -07:00
Jeff Johnson
5e736135ad net: ethernet: mac89x0: add missing MODULE_DESCRIPTION() macro
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/cirrus/mac89x0.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240618-md-m68k-drivers-net-ethernet-cirrus-v1-1-07f5bd0b64cb@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:22:00 -07:00
Jeff Johnson
deb9d57662 net: amd: add missing MODULE_DESCRIPTION() macros
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/a2065.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/ariadne.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/atarilance.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/hplance.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/7990.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/mvme147.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/amd/sun3lance.o

Add the missing invocation of the MODULE_DESCRIPTION() macro to all
files which have a MODULE_LICENSE().

This includes drivers/net/ethernet/amd/lance.c which, although it did
not produce a warning with the m68k allmodconfig configuration, may
cause this warning with other configurations.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240618-md-m68k-drivers-net-ethernet-amd-v1-1-50ee7a9ad50e@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:21:40 -07:00
Jeff Johnson
e6fed01554 net: arcnet: com20020-isa: add missing MODULE_DESCRIPTION() macro
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/arcnet/com20020-isa.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240618-md-m68k-drivers-net-arcnet-v1-1-90e42bc58102@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19 17:21:12 -07:00
Matthias Schiffer
3d49ee2127 net: dsa: mt7530: add support for bridge port isolation
Remove a pair of ports from the port matrix when both ports have the
isolated flag set.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 13:10:01 +01:00
Matthias Schiffer
c25c961fc7 net: dsa: mt7530: factor out bridge join/leave logic
As preparation for implementing bridge port isolation, move the logic to
add and remove bits in the port matrix into a new helper
mt7530_update_port_member(), which is called from
mt7530_port_bridge_join() and mt7530_port_bridge_leave().

Another part of the preparation is using dsa_port_offloads_bridge_dev()
instead of dsa_port_offloads_bridge() to check for bridge membership, as
we don't have a struct dsa_bridge in mt7530_port_bridge_flags().

The port matrix setting is slightly streamlined, now always first setting
the mt7530_port's pm field and then writing the port matrix from that
field into the hardware register, instead of duplicating the bit
manipulation for both the struct field and the register.

mt7530_port_bridge_join() was previously using |= to update the port
matrix with the port bitmap, which was unnecessary, as pm would only
have the CPU port set before joining a bridge; a simple assignment can
be used for both joining and leaving (and will also work when individual
bits are added/removed in port_bitmap with regard to the previous port
matrix, which is what happens with port isolation).

No functional change intended.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 13:10:01 +01:00
David S. Miller
65e4efa049 Merge branch 'net-drop-rx-socket-tracepoint'
Yan Zhai says:

====================
net: pass receive socket to drop tracepoint

We set up our production packet drop monitoring around the kfree_skb
tracepoint. While this tracepoint is extremely valuable for diagnosing
critical problems, it also has some limitation with drops on the local
receive path: this tracepoint can only inspect the dropped skb itself,
but such skb might not carry enough information to:

1. determine in which netns/container this skb gets dropped
2. determine by which socket/service this skb oughts to be received

The 1st issue is because skb->dev is the only member field with valid
netns reference. But skb->dev can get cleared or reused. For example,
tcp_v4_rcv will clear skb->dev and in later processing it might be reused
for OFO tree.

The 2nd issue is because there is no reference on an skb that reliably
points to a receiving socket. skb->sk usually points to the local
sending socket, and it only points to a receive socket briefly after
early demux stage, yet the socket can get stolen later. For certain drop
reason like TCP OFO_MERGE, Zerowindow, UDP at PROTO_MEM error, etc, it
is hard to infer which receiving socket is impacted. This cannot be
overcome by simply looking at the packet header, because of
complications like sk lookup programs. In the past, single purpose
tracepoints like trace_udp_fail_queue_rcv_skb, trace_sock_rcvqueue_full,
etc are added as needed to provide more visibility. This could be
handled in a more generic way.

In this change set we propose a new 'sk_skb_reason_drop' call as a drop-in
replacement for kfree_skb_reason at various local input path. It accepts
an extra receiving socket argument. Both issues above can be resolved
via this new argument.

V4->V5: rename rx_skaddr to rx_sk to be more clear visually, suggested
by Jesper Dangaard Brouer.

V3->V4: adjusted the TP_STRUCT field order to align better, suggested by
Steven Rostedt.

V2->V3: fixed drop_monitor function signatures; fixed a few uninitialized sks;
Added a few missing report tags from test bots (also noticed by Dan
Carpenter and Simon Horman).

V1->V2: instead of using skb->cb, directly add the needed argument to
trace_kfree_skb tracepoint. Also renamed functions as Eric Dumazet
suggested.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:23 +01:00
Yan Zhai
e2e7d78d9a af_packet: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406011859.Aacus8GV-lkp@intel.com/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai
fc0cc92488 udp: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406011751.NpVN0sSk-lkp@intel.com/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai
46a02aa357 tcp: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406011539.jhwBd7DX-lkp@intel.com/
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai
ce9a2424e9 net: raw: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai
7467de1763 ping: use sk_skb_reason_drop to free rx packets
Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai
ba8de796ba net: introduce sk_skb_reason_drop function
Long used destructors kfree_skb and kfree_skb_reason do not pass
receiving socket to packet drop tracepoints trace_kfree_skb.
This makes it hard to track packet drops of a certain netns (container)
or a socket (user application).

The naming of these destructors are also not consistent with most sk/skb
operating functions, i.e. functions named "sk_xxx" or "skb_xxx".
Introduce a new functions sk_skb_reason_drop as drop-in replacement for
kfree_skb_reason on local receiving path. Callers can now pass receiving
sockets to the tracepoints.

kfree_skb and kfree_skb_reason are still usable but they are now just
inline helpers that call sk_skb_reason_drop.

Note it is not feasible to do the same to consume_skb. Packets not
dropped can flow through multiple receive handlers, and have multiple
receiving sockets. Leave it untouched for now.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
Yan Zhai
c53795d48e net: add rx_sk to trace_kfree_skb
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.

Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 12:44:22 +01:00
David S. Miller
6f46fc9bc2 Merge branch 'am65x-ptp'
Diogo Ivo says

====================
Enable PTP timestamping/PPS for AM65x SR1.0 devices

This patch series enables support for PTP in AM65x SR1.0 devices.

This feature relies heavily on the Industrial Ethernet Peripheral
(IEP) hardware module, which implements a hardware counter through
which time is kept. This hardware block is the basis for exposing
a PTP hardware clock to userspace and for issuing timestamps for
incoming/outgoing packets, allowing for time synchronization.

The IEP also has compare registers that fire an interrupt when the
counter reaches the value stored in a compare register. This feature
allows us to support PPS events in the kernel.

The changes are separated into five patches:
 - PATCH 01/05: Register SR1.0 devices with the IEP infrastructure to
		expose a PHC clock to userspace, allowing time to be
		adjusted using standard PTP tools. The code for issuing/
		collecting packet timestamps is already present in the
		current state of the driver, so only this needs to be
		done.
 - PATCH 02/05: Remove unnecessary spinlock synchronization.
 - PATCH 03/05: Document IEP interrupt in DT binding.
 - PATCH 04/05: Add support for IEP compare event/interrupt handling
		to enable PPS events.
 - PATCH 05/05: Add the interrupts to the IOT2050 device tree.

Currently every compare event generates two interrupts, the first
corresponding to the actual event and the second being a spurious
but otherwise harmless interrupt. The root cause of this has been
identified and has been solved in the platform's SDK. A forward port
of the SDK's patches also fixes the problem in upstream but is not
included here since it's upstreaming is out of the scope of this
series. If someone from TI would be willing to chime in and help
get the interrupt changes upstream that would be great!

Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
---
Changes in v4:
- Remove unused 'flags' variables in patch 02/05
- Add patch 03/05 describing IEP interrupt in DT binding
- Link to v3: https://lore.kernel.org/r/20240607-iep-v3-0-4824224105bc@siemens.com

Changes in v3:
- Collect Reviewed-by tags
- Add patch 02/04 removing spinlocks from IEP driver
- Use mutex-based synchronization when accessing HW registers
- Link to v2: https://lore.kernel.org/r/20240604-iep-v2-0-ea8e1c0a5686@siemens.com

Changes in v2:
- Collect Reviewed-by tags
- PATCH 01/03: Limit line length to 80 characters
- PATCH 02/03: Proceed with limited functionality if getting IRQ fails,
	       limit line length to 80 characters
- Link to v1: https://lore.kernel.org/r/20240529-iep-v1-0-7273c07592d3@siemens.com
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:56:10 +01:00
Diogo Ivo
71be1189c9 arm64: dts: ti: iot2050: Add IEP interrupts for SR1.0 devices
Add the interrupts needed for PTP Hardware Clock support via IEP
in SR1.0 devices.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:56:10 +01:00
Diogo Ivo
f18ad402cd net: ti: icss-iep: Enable compare events
The IEP module supports compare events, in which a value is written to a
hardware register and when the IEP counter reaches the written value an
interrupt is generated. Add handling for this interrupt in order to
support PPS events.

Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:56:10 +01:00
Diogo Ivo
5056860cf8 dt-bindings: net: Add IEP interrupt
The IEP interrupt is used in order to support both capture events, where
an incoming external signal gets timestamped on arrival, and compare
events, where an interrupt is generated internally when the IEP counter
reaches a programmed value.

Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:56:10 +01:00
Diogo Ivo
5758e03cf6 net: ti: icss-iep: Remove spinlock-based synchronization
As all sources of concurrency in hardware register access occur in
non-interrupt context eliminate spinlock-based synchronization and
rely on the mutex-based synchronization that is already present.

Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:56:10 +01:00
Diogo Ivo
5e1e43893b net: ti: icssg-prueth: Enable PTP timestamping support for SR1.0 devices
Enable PTP support for AM65x SR1.0 devices by registering with the IEP
infrastructure in order to expose a PTP clock to userspace.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Diogo Ivo <diogo.ivo@siemens.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:56:09 +01:00
Hongfu Li
9f1f70dd85 rds:Simplify the allocation of slab caches
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19 10:47:40 +01:00
Haiyang Zhang
382d1741b5 net: mana: Add support for page sizes other than 4KB on ARM64
As defined by the MANA Hardware spec, the queue size for DMA is 4KB
minimal, and power of 2. And, the HWC queue size has to be exactly
4KB.

To support page sizes other than 4KB on ARM64, define the minimal
queue size as a macro separately from the PAGE_SIZE, which we always
assumed it to be 4KB before supporting ARM64.

Also, add MANA specific macros and update code related to size
alignment, DMA region calculations, etc.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1718655446-6576-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18 18:21:18 -07:00
Jakub Kicinski
2c6a4b969c Merge branch 'net-mlx4_en-use-ethtool_puts-sprintf'
Kamal Heib says:

====================
net/mlx4_en: Use ethtool_puts/sprintf

This patchset updates the mlx4_en driver to use the ethtool_puts and
ethtool_sprintf helper functions.

Signed-off-by: Kamal Heib <kheib@redhat.com>
====================

Link: https://lore.kernel.org/r/20240617172329.239819-1-kheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18 18:19:23 -07:00
Kamal Heib
6c7dd432dc net/mlx4_en: Use ethtool_puts/sprintf to fill stats strings
Use the ethtool_puts/ethtool_sprintf helper to print the stats strings
into the ethtool strings interface.

Signed-off-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240617172329.239819-4-kheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18 18:19:18 -07:00
Kamal Heib
4454929c34 net/mlx4_en: Use ethtool_puts to fill selftest strings
Use the ethtool_puts helper to print the selftest strings into the
ethtool strings interface.

Signed-off-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240617172329.239819-3-kheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18 18:19:18 -07:00
Kamal Heib
e52e010395 net/mlx4_en: Use ethtool_puts to fill priv flags strings
Use the ethtool_puts helper to print the priv flags strings into the
ethtool strings interface.

Signed-off-by: Kamal Heib <kheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240617172329.239819-2-kheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-18 18:19:18 -07:00
Christophe JAILLET
8c379e3ce4 net: microchip: Constify struct vcap_operations
"struct vcap_operations" are not modified in these drivers.

Constifying this structure moves some data to a read-only section, so
increase overall security.

In order to do it, "struct vcap_control" also needs to be adjusted to this
new const qualifier.

As an example, on a x86_64, with allmodconfig:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  15176	   1094	     16	  16286	   3f9e	drivers/net/ethernet/microchip/lan966x/lan966x_vcap_impl.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  15268	    998	     16	  16282	   3f9a	drivers/net/ethernet/microchip/lan966x/lan966x_vcap_impl.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/d8e76094d2e98ebb5bfc8205799b3a9db0b46220.1718524644.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-18 15:26:39 +02:00
Paolo Abeni
e845bb84fb Merge branch 'introduce-phy-mode-10g-qxgmii'
Luo Jie says:

====================
Introduce PHY mode 10G-QXGMII

This patch series adds 10G-QXGMII mode for PHY driver. The patch
series is split from the QCA8084 PHY driver patch series below.
https://lore.kernel.org/all/20231215074005.26976-1-quic_luoj@quicinc.com/

Per Andrew Lunn’s advice, submitting this patch series for acceptance
as they already include the necessary 'Reviewed-by:' tags. This way,
they need not wait for QCA8084 series patches to conclude review.

Changes in v2:
	* remove PHY_INTERFACE_MODE_10G_QXGMII from workaround of
	  validation in the phylink_validate_phy. 10G_QXGMII will
	  be set into phy->possible_interfaces in its .config_init
	  method of PHY driver that supports it.
====================

Link: https://lore.kernel.org/r/20240615120028.2384732-1-quic_luoj@quicinc.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-18 13:28:29 +02:00
Vladimir Oltean
5dfabcdd76 dt-bindings: net: ethernet-controller: add 10g-qxgmii mode
Add the new interface mode 10g-qxgmii, which is similar to
usxgmii but extend to 4 channels to support maximum of 4
ports with the link speed 10M/100M/1G/2.5G.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-18 13:28:26 +02:00
Vladimir Oltean
777b8afb81 net: phy: introduce core support for phy-mode = "10g-qxgmii"
10G-QXGMII is a MAC-to-PHY interface defined by the USXGMII multiport
specification. It uses the same signaling as USXGMII, but it multiplexes
4 ports over the link, resulting in a maximum speed of 2.5G per port.

Some in-tree SoCs like the NXP LS1028A use "usxgmii" when they mean
either the single-port USXGMII or the quad-port 10G-QXGMII variant, and
they could get away just fine with that thus far. But there is a need to
distinguish between the 2 as far as SerDes drivers are concerned.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-18 13:28:26 +02:00
Furong Xu
041cc86b36 net: stmmac: Enable TSO on VLANs
The TSO engine works well when the frames are not VLAN Tagged.
But it will produce broken segments when frames are VLAN Tagged.

The first segment is all good, while the second segment to the
last segment are broken, they lack of required VLAN tag.

An example here:
========
// 1st segment of a VLAN Tagged TSO frame, nothing wrong.
MacSrc > MacDst, ethertype 802.1Q (0x8100), length 1518: vlan 100, p 1, ethertype IPv4 (0x0800), HostA:42643 > HostB:5201: Flags [.], seq 1:1449

// 2nd to last segments of a VLAN Tagged TSO frame, VLAN tag is missing.
MacSrc > MacDst, ethertype IPv4 (0x0800), length 1514: HostA:42643 > HostB:5201: Flags [.], seq 1449:2897
MacSrc > MacDst, ethertype IPv4 (0x0800), length 1514: HostA:42643 > HostB:5201: Flags [.], seq 2897:4345
MacSrc > MacDst, ethertype IPv4 (0x0800), length 1514: HostA:42643 > HostB:5201: Flags [.], seq 4345:5793
MacSrc > MacDst, ethertype IPv4 (0x0800), length 1514: HostA:42643 > HostB:5201: Flags [P.], seq 5793:7241

// normal VLAN Tagged non-TSO frame, nothing wrong.
MacSrc > MacDst, ethertype 802.1Q (0x8100), length 1022: vlan 100, p 1, ethertype IPv4 (0x0800), HostA:42643 > HostB:5201: Flags [P.], seq 7241:8193
MacSrc > MacDst, ethertype 802.1Q (0x8100), length 70: vlan 100, p 1, ethertype IPv4 (0x0800), HostA:42643 > HostB:5201: Flags [F.], seq 8193
========

When transmitting VLAN Tagged TSO frames, never insert VLAN tag by HW,
always insert VLAN tag to SKB payload, then TSO works well on VLANs for
all MAC cores.

Tested on DWMAC CORE 5.10a, DWMAC CORE 5.20a and DWXGMAC CORE 3.20a

Signed-off-by: Furong Xu <0x1207@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240615095611.517323-1-0x1207@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-18 13:00:01 +02:00
Kory Maincent
efb459303d net: Move dev_set_hwtstamp_phylib to net/core/dev.h
This declaration was added to the header to be called from ethtool.
ethtool is separated from core for code organization but it is not really
a separate entity, it controls very core things.
As ethtool is an internal stuff it is not wise to have it in netdevice.h.
Move the declaration to net/core/dev.h instead.

Remove the EXPORT_SYMBOL_GPL call as ethtool can not be built as a module.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240612-feature_ptp_netnext-v15-2-b2a086257b63@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17 18:25:53 -07:00
Jeff Johnson
0d9bb14427 net: dwc-xlgmac: fix missing MODULE_DESCRIPTION() warning
With ARCH=hexagon, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/net/ethernet/synopsys/dwc-xlgmac.o

With most other ARCH settings the MODULE_DESCRIPTION() is provided by
the macro invocation in dwc-xlgmac-pci.c. However, for hexagon, the
PCI bus is not enabled, and hence CONFIG_DWC_XLGMAC_PCI is not set.
As a result, dwc-xlgmac-pci.c is not compiled, and hence is not linked
into dwc-xlgmac.o.

To avoid this issue, relocate the MODULE_DESCRIPTION() and other
related macros from dwc-xlgmac-pci.c to dwc-xlgmac-common.c, since
that file already has an existing MODULE_LICENSE() and it is
unconditionally linked into dwc-xlgmac.o.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240616-md-hexagon-drivers-net-ethernet-synopsys-v1-1-55852b60aef8@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17 18:05:38 -07:00
Shradha Gupta
e275e19c91 net: mana: Use mana_cleanup_port_context() for rxq cleanup
To cleanup rxqs in port context structures, instead of duplicating the
code, use existing function mana_cleanup_port_context() which does
the exact cleanup that's needed.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Heng Qi <hengqi@linux.alibaba.com>
Link: https://lore.kernel.org/r/1718349548-28697-1-git-send-email-shradhagupta@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17 18:01:01 -07:00
Willem de Bruijn
dd89a81d85 fou: remove warn in gue_gro_receive on unsupported protocol
Drop the WARN_ON_ONCE inn gue_gro_receive if the encapsulated type is
not known or does not have a GRO handler.

Such a packet is easily constructed. Syzbot generates them and sets
off this warning.

Remove the warning as it is expected and not actionable.

The warning was previously reduced from WARN_ON to WARN_ON_ONCE in
commit 270136613bf7 ("fou: Do WARN_ON_ONCE in gue_gro_receive for bad
proto callbacks").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240614122552.1649044-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-17 17:49:50 -07:00
David S. Miller
4314175af4 Merge branch 'net-smc-IPPROTO_SMC'
D. Wythe says:

====================
Introduce IPPROTO_SMC

This patch allows to create smc socket via AF_INET,
similar to the following code,

/* create v4 smc sock */
v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);

/* create v6 smc sock */
v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);

There are several reasons why we believe it is appropriate here:

1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6)
address. There is no AF_SMC address at all.

2. Create smc socket in the AF_INET(6) path, which allows us to reuse
the infrastructure of AF_INET(6) path, such as common ebpf hooks.
Otherwise, smc have to implement it again in AF_SMC path. Such as:
  1. Replace IPPROTO_TCP with IPPROTO_SMC in the socket() syscall
     initiated by the user, without the use of LD-PRELOAD.
  2. Select whether immediate fallback is required based on peer's port/ip
     before connect().

A very significant result is that we can now use eBPF to implement smc_run
instead of LD_PRELOAD, who is completely ineffective in scenarios of static
linking.

Another potential value is that we are attempting to optimize the
performance of fallback socks, where merging socks is an important part,
and it relies on the creation of SMC sockets under the AF_INET path.
(More information :
https://lore.kernel.org/netdev/1699442703-25015-1-git-send-email-alibuda@linux.alibaba.com/T/)

v2 -> v1:

- Code formatting, mainly including alignment and annotation repair.
- move inet_smc proto ops to inet_smc.c, avoiding af_smc.c becoming too bulky.
- Fix the issue where refactoring affects the initialization order.
- Fix compile warning (unused out_inet_prot) while CONFIG_IPV6 was not set.

v3 -> v2:

- Add Alibaba's copyright information to the newfile

v4 -> v3:

- Fix some spelling errors
- Align function naming style with smc_sock_init() to smc_sk_init()
- Reversing the order of the conditional checks on clcsock to make the code more intuitive

v5 -> v4:

- Fix some spelling errors
- Added comment, "/* CONFIG_IPV6 */", after the final #endif directive.
- Rename smc_inet.h and smc_inet.c to smc_inet.h and smc_inet.c
- Encapsulate the initialization and destruction of inet_smc in inet_smc.c,
  rather than implementing it directly in af_smc.c.
- Remove useless header files in smc_inet.h
- Make smc_inet_prot_xxx and smc_inet_sock_init() to be static, since it's
  only used in smc_inet.c

v6 -> v5:

- Wrapping lines to not exceed 80 characters
- Combine initialization and error handling of smc_inet6 into the same #if
  macro block.

v7 -> v6:

- Modify the value of IPPROTO_SMC to 256 so that it does not affect IPPROTO-MAX

v8 -> v7:

- Remove useless declarations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17 13:14:09 +01:00
D. Wythe
d25a92ccae net/smc: Introduce IPPROTO_SMC
This patch allows to create smc socket via AF_INET,
similar to the following code,

/* create v4 smc sock */
v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);

/* create v6 smc sock */
v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);

There are several reasons why we believe it is appropriate here:

1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6)
address. There is no AF_SMC address at all.

2. Create smc socket in the AF_INET(6) path, which allows us to reuse
the infrastructure of AF_INET(6) path, such as common ebpf hooks.
Otherwise, smc have to implement it again in AF_SMC path.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17 13:14:09 +01:00
D. Wythe
13543d02c9 net/smc: expose smc proto operations
Externalize smc proto operations (smc_xxx) to allow
access from files other than af_smc.c

This is in preparation for the subsequent implementation
of the AF_INET version of SMC.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17 13:14:09 +01:00
D. Wythe
d0e35656d8 net/smc: refactoring initialization of smc sock
This patch aims to isolate the shared components of SMC socket
allocation by introducing smc_sk_init() for sock initialization
and __smc_create_clcsk() for the initialization of clcsock.

This is in preparation for the subsequent implementation of the
AF_INET version of SMC.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17 13:14:09 +01:00
Jakub Kicinski
f22b4b55ed net: make for_each_netdev_dump() a little more bug-proof
I find the behavior of xa_for_each_start() slightly counter-intuitive.
It doesn't end the iteration by making the index point after the last
element. IOW calling xa_for_each_start() again after it "finished"
will run the body of the loop for the last valid element, instead
of doing nothing.

This works fine for netlink dumps if they terminate correctly
(i.e. coalesce or carefully handle NLM_DONE), but as we keep getting
reminded legacy dumps are unlikely to go away.

Fixing this generically at the xa_for_each_start() level seems hard -
there is no index reserved for "end of iteration".
ifindexes are 31b wide, tho, and iterator is ulong so for
for_each_netdev_dump() it's safe to go to the next element.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17 13:11:02 +01:00