IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
More register macros need to be adjusted for the 3rd GMAC on MT7988.
Account for added bit in SYSCFG0_SGMII_MASK.
Fixes: 445eb6448ed3 ("net: ethernet: mtk_eth_soc: add basic support for MT7988 SoC")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/1c8da012e2ca80939906d85f314138c552139f0f.1692721443.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As we already added the exception tracing for XDP_TX, I think it is
necessary to add the exception tracing for other XDP actions, such
as XDP_REDIRECT, XDP_ABORTED and unknown error actions.
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230822065255.606739-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All callers of build_skb() (*) in bnxt are in NAPI context.
The budget checking is somewhat convoluted because in the shared
completion queue cases Rx packets are discarded by netpoll
by forcing an error (E). But that happens before skb allocation.
Only a call chain starting at __bnxt_poll_work() can lead to
an skb allocation and it checks budget (b).
* bnxt_rx_multi_page_skb
* bnxt_rx_skb
` bp->rx_skb_func
* bnxt_tpa_end
` bnxt_rx_pkt
E bnxt_force_rx_discard
E bnxt_poll_nitroa0
b __bnxt_poll_work
Use napi_build_skb() to take advantage of the skb cache.
In iperf tests with HW-GRO enabled it barely makes a difference
but in cases where HW-GRO is not as effective (or disabled) it
can give even a >10% boost (20.7Gbps -> 23.1Gbps).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove debug logs in port vlan management, since there are already multiple
tracepoints defined for those operations in DSA
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF programs that run on connect can rewrite the connect address. For
the connect system call this isn't a problem, because a copy of the address
is made when it is moved into kernel space. However, kernel_connect
simply passes through the address it is given, so the caller may observe
its address value unexpectedly change.
A practical example where this is problematic is where NFS is combined
with a system such as Cilium which implements BPF-based load balancing.
A common pattern in software-defined storage systems is to have an NFS
mount that connects to a persistent virtual IP which in turn maps to an
ephemeral server IP. This is usually done to achieve high availability:
if your server goes down you can quickly spin up a replacement and remap
the virtual IP to that endpoint. With BPF-based load balancing, mounts
will forget the virtual IP address when the address rewrite occurs
because a pointer to the only copy of that address is passed down the
stack. Server failover then breaks, because clients have forgotten the
virtual IP address. Reconnects fail and mounts remain broken. This patch
was tested by setting up a scenario like this and ensuring that NFS
reconnects worked after applying the patch.
Signed-off-by: Jordan Rife <jrife@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio_net driver currently deals with different versions and types
of virtio net headers, such as virtio_net_hdr_mrg_rxbuf,
virtio_net_hdr_v1_hash, etc. Due to these variations, the code relies
on multiple type casts to convert memory between different structures,
potentially leading to bugs when there are changes in these structures.
Introduces the "struct skb_vnet_common_hdr" as a unifying header
structure using a union. With this approach, various virtio net header
structures can be converted by accessing different members of this
structure, thus eliminating the need for type casting and reducing the
risk of potential bugs.
For example following code:
static struct sk_buff *page_to_skb(struct virtnet_info *vi,
struct receive_queue *rq,
struct page *page, unsigned int offset,
unsigned int len, unsigned int truesize,
unsigned int headroom)
{
[...]
struct virtio_net_hdr_mrg_rxbuf *hdr;
[...]
hdr_len = vi->hdr_len;
[...]
ok:
hdr = skb_vnet_hdr(skb);
memcpy(hdr, hdr_p, hdr_len);
[...]
}
When VIRTIO_NET_F_HASH_REPORT feature is enabled, hdr_len = 20
But the sizeof(*hdr) is 12,
memcpy(hdr, hdr_p, hdr_len); will copy 20 bytes to the hdr,
which make a potential risk of bug. And this risk can be avoided by
introducing struct skb_vnet_common_hdr.
Change log
v1->v2
feedback from Willem de Bruijn <willemdebruijn.kernel@gmail.com>
feedback from Simon Horman <horms@kernel.org>
1. change to use net-next tree.
2. move skb_vnet_common_hdr inside kernel file instead of the UAPI header.
v2->v3
feedback from Willem de Bruijn <willemdebruijn.kernel@gmail.com>
1. fix typo in commit message.
2. add original struct virtio_net_hdr into union
3. remove virtio_net_hdr_mrg_rxbuf variable in receive_buf;
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert list_for_each() to list_for_each_entry() where applicable.
No functional changed.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Pavlu says:
====================
Convert mlx4 to use auxiliary bus
This series converts the mlx4 drivers to use auxiliary bus, similarly to
how mlx5 was converted [1]. The first 6 patches are preparatory changes,
the remaining 4 are the final conversion.
Initial motivation for this change was to address a problem related to
loading mlx4_en/mlx4_ib by mlx4_core using request_module_nowait(). When
doing such a load in initrd, the operation is asynchronous to any init
control and can get unexpectedly affected/interrupted by an eventual
root switch. Using an auxiliary bus leaves these module loads to udevd
which better integrates with systemd processing. [2]
General benefit is to get rid of custom interface logic and instead use
a common facility available for this task. An obvious risk is that some
new bug is introduced by the conversion.
Leon Romanovsky was kind enough to check for me that the series passes
their verification tests.
Changes since v2 [3]:
* Use 'void *' as the event param of mlx4_dispatch_event().
Changes since v1 [4]:
* Fix a missing definition of the err variable in mlx4_en_add().
* Remove not needed comments about the event type in mlx4_en_event()
and mlx4_ib_event().
[1] https://lore.kernel.org/netdev/20201101201542.2027568-1-leon@kernel.org/
[2] https://lore.kernel.org/netdev/0a361ac2-c6bd-2b18-4841-b1b991f0635e@suse.com/
[3] https://lore.kernel.org/netdev/20230813145127.10653-1-petr.pavlu@suse.com/
[4] https://lore.kernel.org/netdev/20230804150527.6117-1-petr.pavlu@suse.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After the conversion to use the auxiliary bus, the custom device
management is not needed anymore and can be deleted.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the auxiliary bus to perform device management of the infiniband
part of the mlx4 driver.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the auxiliary bus to perform device management of the ethernet part
of the mlx4 driver.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an auxiliary virtual bus to model the mlx4 driver structure. The
code is added along the current custom device management logic.
Subsequent patches switch mlx4_en and mlx4_ib to the auxiliary bus and
the old interface is then removed.
Structure mlx4_priv gains a new adev dynamic array to keep track of its
auxiliary devices. Access to the array is protected by the global
mlx4_intf mutex.
Functions mlx4_register_device() and mlx4_unregister_device() are
updated to expose auxiliary devices on the bus in order to load mlx4_en
and/or mlx4_ib. Functions mlx4_register_auxiliary_driver() and
mlx4_unregister_auxiliary_driver() are added to substitute
mlx4_register_interface() and mlx4_unregister_interface(), respectively.
Function mlx4_do_bond() is adjusted to walk over the adev array and
re-adds a specific auxiliary device if its driver sets the
MLX4_INTFF_BONDING flag.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlx4_core driver has a logic that allows a sub-driver to set the
MLX4_INTFF_BONDING flag which then causes that function mlx4_do_bond()
asks the sub-driver to fully re-probe a device when its bonding
configuration changes.
Performing this operation is disallowed in mlx4_register_interface()
when it is detected that any mlx4 device is multifunction (SRIOV). The
code then resets MLX4_INTFF_BONDING in the driver flags.
Move this check directly into mlx4_do_bond(). It provides a better
separation as mlx4_core no longer directly modifies the sub-driver flags
and it will allow to get rid of explicitly keeping track of all mlx4
devices by the intf.c code when it is switched to an auxiliary bus.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function mlx4_en_queue_bond_work() is used in mlx4_en to start a bond
reconfiguration. It gathers data about a new port map setting, takes
a reference on the netdev that triggered the change and queues a work
object on mlx4_en_priv.mdev.workqueue to perform the operation. The
scheduled work is mlx4_en_bond_work() which calls
mlx4_bond()/mlx4_unbond() and consequently mlx4_do_bond().
At the same time, function mlx4_change_port_types() in mlx4_core might
be invoked to change the port type configuration. As part of its logic,
it re-registers the whole device by calling mlx4_unregister_device(),
followed by mlx4_register_device().
The two operations can result in concurrent access to the data about
currently active interfaces on the device.
Functions mlx4_register_device() and mlx4_unregister_device() lock the
intf_mutex to gain exclusive access to this data. The current
implementation of mlx4_do_bond() doesn't do that which could result in
an unexpected behavior. An updated version of mlx4_do_bond() for use
with an auxiliary bus goes and locks the intf_mutex when accessing a new
auxiliary device array.
However, doing so can then result in the following deadlock:
* A two-port mlx4 device is configured as an Ethernet bond.
* One of the ports is changed from eth to ib, for instance, by writing
into a mlx4_port<x> sysfs attribute file.
* mlx4_change_port_types() is called to update port types. It invokes
mlx4_unregister_device() to unregister the device which locks the
intf_mutex and starts removing all associated interfaces.
* Function mlx4_en_remove() gets invoked and starts destroying its first
netdev. This triggers mlx4_en_netdev_event() which recognizes that the
configured bond is broken. It runs mlx4_en_queue_bond_work() which
takes a reference on the netdev. Removing the netdev now cannot
proceed until the work is completed.
* Work function mlx4_en_bond_work() gets scheduled. It calls
mlx4_unbond() -> mlx4_do_bond(). The latter function tries to lock the
intf_mutex but that is not possible because it is held already by
mlx4_unregister_device().
This particular case could be possibly solved by unregistering the
mlx4_en_netdev_event() notifier in mlx4_en_remove() earlier, but it
seems better to decouple mlx4_en more and break this reference order.
Avoid then this scenario by recognizing that the bond reconfiguration
operates only on a mlx4_dev. The logic to queue and execute the bond
work can be moved into the mlx4_core driver. Only a reference on the
respective mlx4_dev object is needed to be taken during the work's
lifetime. This removes a call from mlx4_en that can directly result in
needing to lock the intf_mutex, it remains a privilege of the core
driver.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlx4_interface.activate callback was introduced in commit
79857cd31fe7 ("net/mlx4: Postpone the registration of net_device"). It
dealt with a situation when a netdev notifier received a NETDEV_REGISTER
event for a new net_device created by mlx4_en but the same device was
not yet visible to mlx4_get_protocol_dev(). The callback can be removed
now that mlx4_get_protocol_dev() is gone.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a notifier to implement mlx4_dispatch_event() in preparation to
switch mlx4_en and mlx4_ib to be an auxiliary device.
A problem is that if the mlx4_interface.event callback was replaced with
something as mlx4_adrv.event then the implementation of
mlx4_dispatch_event() would need to acquire a lock on a given device
before executing this callback. That is necessary because otherwise
there is no guarantee that the associated driver cannot get unbound when
the callback is running. However, taking this lock is not possible
because mlx4_dispatch_event() can be invoked from the hardirq context.
Using an atomic notifier allows the driver to accurately record when it
wants to receive these events and solves this problem.
A handler registration is done by both mlx4_en and mlx4_ib at the end of
their mlx4_interface.add callback. This matches the current situation
when mlx4_add_device() would enable events for a given device
immediately after this callback, by adding the device on the
mlx4_priv.list.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function mlx4_dispatch_event() takes an 'unsigned long' as its event
parameter. The actual value is none (MLX4_DEV_EVENT_CATASTROPHIC_ERROR),
a pointer to mlx4_eqe (MLX4_DEV_EVENT_PORT_MGMT_CHANGE), or a 32-bit
integer (remaining events).
In preparation to switch mlx4_en and mlx4_ib to be an auxiliary device,
the mlx4_interface.event callback is replaced with a notifier and
function mlx4_dispatch_event() gets updated to invoke
atomic_notifier_call_chain(). This requires forwarding the input 'param'
value from the former function to the latter. A problem is that the
notifier call takes 'void *' as its 'param' value, compared to
'unsigned long' used by mlx4_dispatch_event(). Re-passing the value
would need either punning it to 'void *' or passing down the address of
the input 'param'. Both approaches create a number of unnecessary casts.
Change instead the input 'param' of mlx4_dispatch_event() from
'unsigned long' to 'void *'. A mlx4_eqe pointer can be passed directly,
callers using an int value are adjusted to pass its address.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the mlx4_en_dev.nb notifier_block member to netdev_nb in
preparation to add a mlx4 core notifier_block.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the mlx4 driver interface by removing mlx4_get_protocol_dev()
and the associated mlx4_interface.get_dev callbacks. This is done in
preparation to use an auxiliary bus to model the mlx4 driver structure.
The change is motivated by the following situation:
* The mlx4_en interface is being initialized by mlx4_en_add() and
mlx4_en_activate().
* The latter activate function calls mlx4_en_init_netdev() ->
register_netdev() to register a new net_device.
* A netdev event NETDEV_REGISTER is raised for the device.
* The netdev notififier mlx4_ib_netdev_event() is called and it invokes
mlx4_ib_scan_netdevs() -> mlx4_get_protocol_dev() ->
mlx4_en_get_netdev() [via mlx4_interface.get_dev].
This chain creates a problem when mlx4_en gets switched to be an
auxiliary driver. It contains two device calls which would both need to
take a respective device lock.
Avoid this situation by updating mlx4_ib_scan_netdevs() to no longer
call mlx4_get_protocol_dev() but instead to utilize the information
passed in net_device.parent and net_device.dev_port. This data is
sufficient to determine that an updated port is one that the mlx4_ib
driver should take care of and to keep mlx4_ib_dev.iboe.netdevs up to
date.
Following that, update mlx4_ib_get_netdev() to also not call
mlx4_get_protocol_dev() and instead scan all current netdevs to find
find a matching one. Note that mlx4_ib_get_netdev() is called early from
ib_register_device() and cannot use data tracked in
mlx4_ib_dev.iboe.netdevs which is not at that point yet set.
Finally, remove function mlx4_get_protocol_dev() and the
mlx4_interface.get_dev callbacks (only mlx4_en_get_netdev()) as they
became unused.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8cd160a29415 ("qede: convert to new udp_tunnel_nic infra")
removed qede_udp_tunnel_{add,del}() but not the declarations.
Commit 0ebcebbef1cc ("qed: Read device port count from the shmem")
removed qed_device_num_engines() but not its declaration.
Commit 1e128c81290a ("qed: Add support for hardware offloaded FCoE.")
declared but never implemented qed_fcoe_set_pf_params().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of the newer silicon versions in CN10K series supports a feature
where in the current PTP timestamp in HW can be updated atomically
without losing any cpu cycles unlike read/modify/write register.
This patch uses this feature so that PTP accuracy can be improved
while adjusting the master offset in HW. There is no need for SW
timecounter when using this feature. So removed references to SW
timecounter wherever appropriate.
Signed-off-by: Sai Krishna <saikrishnag@marvell.com>
Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJBBAABCAArFiEEgKkgxbID4Gn1hq6fcJGo2a1f9gAFAmTku5sNHGZ3QHN0cmxl
bi5kZQAKCRBwkajZrV/2ANYQEAC+Ub5YNzQ7tcABozPWRpno4i3rHxBHCdp1faDu
9ISdxwq62k4ynhrGb4UyVrw8PJDxlFKOtnmx1GnB7/FTwFpbIfqc4D/I0sO6RIn8
z4G7ph9afen1Qme9Y03/5XE/C+HYDBy8bK4efHUUWyiAQJcTQnrdwC6EokXHnsRK
zARvfyTD91IsIFZLkArqVe2VvvThhJL44Xci+vfPkTXQHI30nnYRGFn/gWnEbU2j
jTi4rHm58oAJbYuEt2YCn6O9TwtcnbvxT0VcIb7viiWeJ+dHhGhsx89Sy1Qd37ko
m3qZ7ZxR2+oEHWWpgnXrI6kMrN3ZH5DjR/pMlFnoiwnfgjfsnludMwzneRDszi9Q
97/e5EP7WqSr0VRAge7HmgCDapbFSdIRLa4ZpCyX7CdIY1nIHajk7PJNnjq+xJ2X
YHyjDY14HHi436nMTwKXzPECiqVgaOpqx9PgIlGmssTzfOYGO8+Q/bGy2cuLOz65
a++iHM9hcAQV6VJfOB45CtFQmIKC4rWf1eC7Ba/oFDRvbfiaLZ1t5vawoBugyDeY
5RbGWJobjlo3V0BnzFS56wBYNgdOqO7pfXzrvzpKZJyLdRFMIdwrioCCeerbNF+M
vEh7RuEiKW6ydN7jlBD9ZxxgeGxwPYML8H8Ru8BW4NcjOzxrzq0MkRz03+HYBqhj
akfj8g==
=OSUE
-----END PGP SIGNATURE-----
Merge tag 'nf-next-23-08-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
First patch resolves a fortify warning by wrapping the to-be-copied
members via struct_group.
Second patch replaces array[0] with array[] in ebtables uapi.
Both changes from GONG Ruiqi.
The largest chunk is replacement of strncpy with strscpy_pad()
in netfilter, from Justin Stitt.
Last patch, from myself, aborts ruleset validation if a fatal
signal is pending, this speeds up process exit.
* tag 'nf-next-23-08-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: allow loop termination for pending fatal signal
netfilter: xtables: refactor deprecated strncpy
netfilter: x_tables: refactor deprecated strncpy
netfilter: nft_meta: refactor deprecated strncpy
netfilter: nft_osf: refactor deprecated strncpy
netfilter: nf_tables: refactor deprecated strncpy
netfilter: nf_tables: refactor deprecated strncpy
netfilter: ipset: refactor deprecated strncpy
netfilter: ebtables: replace zero-length array members
netfilter: ebtables: fix fortify warnings in size_entry_mwt()
====================
Link: https://lore.kernel.org/r/20230822154336.12888-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mat Martineau says:
====================
mptcp: Prepare MPTCP packet scheduler for BPF extension
The kernel's MPTCP packet scheduler has, to date, been a one-size-fits
all algorithm that is hard-coded. It attempts to balance latency and
throughput when transmitting data across multiple TCP subflows, and has
some limited tunability through sysctls. It has been a long-term goal of
the Linux MPTCP community to support customizable packet schedulers for
use cases that need to make different trade-offs regarding latency,
throughput, redundancy, and other metrics. BPF is well-suited for
configuring customized, per-packet scheduling decisions without having
to modify the kernel or manage out-of-tree kernel modules.
The first steps toward implementing BPF packet schedulers are to update
the existing MPTCP transmit loops to allow more flexible scheduling
decisions, and to add infrastructure for swappable packet schedulers.
The existing scheduling algorithm remains the default. BPF-related
changes will be in a future patch series.
This code has been in the MPTCP development tree for quite a while,
undergoing testing in our CI and community.
Patches 1 and 2 refactor the transmit code and do some related cleanup.
Patches 3-9 add infrastructure for registering and calling multiple
schedulers.
Patch 10 connects the in-kernel default scheduler to the new
infrastructure.
====================
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-0-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch defines the default packet scheduler mptcp_sched_default.
Register it in mptcp_sched_init(), which is invoked in mptcp_proto_init().
Skip deleting this default scheduler in mptcp_unregister_scheduler().
Set msk->sched to the default scheduler when the input parameter of
mptcp_init_sched() is NULL.
Invoke mptcp_sched_default_get_subflow in get_send() and get_retrans()
if the defaut scheduler is set or msk->sched is NULL.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-10-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds the multiple subflows support for __mptcp_retrans(). Use
get_retrans() wrapper instead of mptcp_subflow_get_retrans() in it.
Check the subflow scheduled flags to test which subflow or subflows are
picked by the scheduler, use them to send data.
Move msk_owned_by_me() and fallback checks into get_retrans() wrapper
from mptcp_subflow_get_retrans().
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-9-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds the multiple subflows support for __mptcp_push_pending
and __mptcp_subflow_push_pending. Use get_send() wrapper instead of
mptcp_subflow_get_send() in them.
Check the subflow scheduled flags to test which subflow or subflows are
picked by the scheduler, use them to send data.
Move msk_owned_by_me() and fallback checks into get_send() wrapper from
mptcp_subflow_get_send().
This commit allows the scheduler to set the subflow->scheduled bit in
multiple subflows, but it does not allow for sending redundant data.
Multiple scheduled subflows will send sequential data on each subflow.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-8-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch defines two packet scheduler wrappers mptcp_sched_get_send()
and mptcp_sched_get_retrans(), invoke get_subflow() of msk->sched in
them.
Set data->reinject to true in mptcp_sched_get_retrans(), set it false in
mptcp_sched_get_send().
If msk->sched is NULL, use default functions mptcp_subflow_get_send()
and mptcp_subflow_get_retrans() to send data.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-7-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new member scheduled in struct mptcp_subflow_context,
which will be set in the MPTCP scheduler context when the scheduler
picks this subflow to send data.
Add a new helper mptcp_subflow_set_scheduled() to set this flag using
WRITE_ONCE().
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-6-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new struct member sched in struct mptcp_sock.
And two helpers mptcp_init_sched() and mptcp_release_sched() to
init and release it.
Init it with the sysctl scheduler in mptcp_init_sock(), copy the
scheduler from the parent in mptcp_sk_clone(), and release it in
__mptcp_destroy_sock().
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-5-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new sysctl, named scheduler, to support for selection
of different schedulers. Export mptcp_get_scheduler helper to get this
sysctl.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-4-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch defines struct mptcp_sched_ops, which has three struct members,
name, owner and list, and four function pointers: init(), release() and
get_subflow().
The scheduler function get_subflow() have a struct mptcp_sched_data
parameter, which contains a reinject flag for retrans or not, a subflows
number and a mptcp_subflow_context array.
Add the scheduler registering, unregistering and finding functions to add,
delete and find a packet scheduler on the global list mptcp_sched_list.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-3-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the burst check conditions have moved out of the function
mptcp_subflow_get_send(), it makes all msk->last_snd useless.
This patch drops them as well as the macro MPTCP_RESET_SCHEDULER.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-2-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To support redundant package schedulers more easily, this patch refactors
__mptcp_push_pending() logic from:
For each dfrag:
While sends succeed:
Call the scheduler (selects subflow and msk->snd_burst)
Update subflow locks (push/release/acquire as needed)
Send the dfrag data with mptcp_sendmsg_frag()
Update already_sent, snd_nxt, snd_burst
Update msk->first_pending
Push/release on final subflow
->
While first_pending isn't empty:
Call the scheduler (selects subflow and msk->snd_burst)
Update subflow locks (push/release/acquire as needed)
For each pending dfrag:
While sends succeed:
Send the dfrag data with mptcp_sendmsg_frag()
Update already_sent, snd_nxt, snd_burst
Update msk->first_pending
Break if required by msk->snd_burst / etc
Push/release on final subflow
Refactors __mptcp_subflow_push_pending logic from:
For each dfrag:
While sends succeed:
Call the scheduler (selects subflow and msk->snd_burst)
Send the dfrag data with mptcp_subflow_delegate(), break
Send the dfrag data with mptcp_sendmsg_frag()
Update dfrag->already_sent, msk->snd_nxt, msk->snd_burst
Update msk->first_pending
->
While first_pending isn't empty:
Call the scheduler (selects subflow and msk->snd_burst)
Send the dfrag data with mptcp_subflow_delegate(), break
Send the dfrag data with mptcp_sendmsg_frag()
For each pending dfrag:
While sends succeed:
Send the dfrag data with mptcp_sendmsg_frag()
Update already_sent, snd_nxt, snd_burst
Update msk->first_pending
Break if required by msk->snd_burst / etc
Move the duplicate code from __mptcp_push_pending() and
__mptcp_subflow_push_pending() into a new helper function, named
__subflow_push_pending(). Simplify __mptcp_push_pending() and
__mptcp_subflow_push_pending() by invoking this helper.
Also move the burst check conditions out of the function
mptcp_subflow_get_send(), check them in __subflow_push_pending() in
the inner "for each pending dfrag" loop.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-1-0c860fb256a8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) aRFS ethtool stats
Improve aRFS observability by adding new set of counters. Each Rx
ring will have this set of counters listed below.
These counters are exposed through ethtool -S.
1.1) arfs_add: number of times a new rule has been created.
1.2) arfs_request_in: number of times a rule was requested to move from
its current Rx ring to a new Rx ring (incremented on the destination
Rx ring).
1.3) arfs_request_out: number of times a rule was requested to move out
from its current Rx ring (incremented on source/current Rx ring).
1.4) arfs_expired: number of times a rule has been expired by the
kernel and removed from HW.
1.5) arfs_err: number of times a rule creation or modification has
failed.
2) Supporting inline WQE when possible in SW steering
3) Misc cleanups and fixups to net-next branch
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmTjpQUACgkQSD+KveBX
+j5o5ggAtapFYgHRQKC7gZOcAs1oKHUWij4vfN0/TGk0EY8jvIPQaYm/OI/P6dc1
/6ENMKwyc5UmYuFHpdmYtdmS9hd2yjB7CGv0oSLwtYPXSeUK3/gJj6RU/KoA8oA4
Qhv7rxPl7ZBa1TQEw7DqIUKFJdaN78vQiNLPU96veKgoHohbhqO+2pOGpZtDhzjg
2jSvVJ213rgfT1ZuUiGeMviAhIrKZ0t5vvKxJtST72UrWOyDZdlPUdb7/kjfoDJO
ZkOTKZyNySYjtknhBdiR6a0CzP63wvkRLbtqE4HyUh38GpnNLB7Fgij22d5c14+1
9IRYIVSGQti5mcp/IXcpYU58+7g1ag==
=ecFF
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2023-08-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-08-16
1) aRFS ethtool stats
Improve aRFS observability by adding new set of counters. Each Rx
ring will have this set of counters listed below.
These counters are exposed through ethtool -S.
1.1) arfs_add: number of times a new rule has been created.
1.2) arfs_request_in: number of times a rule was requested to move from
its current Rx ring to a new Rx ring (incremented on the destination
Rx ring).
1.3) arfs_request_out: number of times a rule was requested to move out
from its current Rx ring (incremented on source/current Rx ring).
1.4) arfs_expired: number of times a rule has been expired by the
kernel and removed from HW.
1.5) arfs_err: number of times a rule creation or modification has
failed.
2) Supporting inline WQE when possible in SW steering
3) Misc cleanups and fixups to net-next branch
* tag 'mlx5-updates-2023-08-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Devcom, only use devcom after NULL check in mlx5_devcom_send_event()
net/mlx5: DR, Supporting inline WQE when possible
net/mlx5: Rename devlink port ops struct for PFs/VFs
net/mlx5: Remove VPORT_UPLINK handling from devlink_port.c
net/mlx5: Call mlx5_esw_offloads_rep_load/unload() for uplink port directly
net/mlx5: Update dead links in Kconfig documentation
net/mlx5: Remove health syndrome enum duplication
net/mlx5: DR, Remove unneeded local variable
net/mlx5: DR, Fix code indentation
net/mlx5: IRQ, consolidate irq and affinity mask allocation
net/mlx5e: Fix spelling mistake "Faided" -> "Failed"
net/mlx5e: aRFS, Introduce ethtool stats
net/mlx5e: aRFS, Warn if aRFS table does not exist for aRFS rule
net/mlx5e: aRFS, Prevent repeated kernel rule migrations requests
====================
Link: https://lore.kernel.org/r/20230821175739.81188-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev_queue_xmit_nit() already uses rcu_read_lock() / rcu_read_unlock()
and nothing suggests that softIRQs should be disabled around it.
Therefore, remove the rcu_read_lock_bh() / rcu_read_unlock_bh()
surrounding it.
Tested using [1] with lockdep enabled.
[1]
#!/bin/bash
ip link add name vrf1 up type vrf table 100
ip link add name veth0 type veth peer name veth1
ip link set dev veth1 master vrf1
ip link set dev veth0 up
ip link set dev veth1 up
ip address add 192.0.2.1/24 dev veth0
ip address add 192.0.2.2/24 dev veth1
ip rule add pref 32765 table local
ip rule del pref 0
tcpdump -i vrf1 -c 20 -w /dev/null &
sleep 10
ping -i 0.1 -c 10 -q 192.0.2.2
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230821142339.1889961-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function is not called from an atomic context so use GFP_KERNEL
instead of GFP_ATOMIC. The allocation of the per-CPU stats is already
performed with GFP_KERNEL.
Tested using test_vxlan_vnifiltering.sh with CONFIG_DEBUG_ATOMIC_SLEEP.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20230821141923.1889776-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 264a9c5c9dff ("net: sparx5: Remove unused GLAG handling in PGID")
removed sparx5_pgid_alloc_glag() but not its declaration.
Commit 27d293cceee5 ("net: microchip: sparx5: Add support for rule count by cookie")
removed vcap_rule_iter() but not its declaration.
Commit 8beef08f4618 ("net: microchip: sparx5: Adding initial VCAP API support")
declared but never implemented vcap_api_set_client().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230821135556.43224-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit fbfb8031533c ("ionic: Add hardware init and device commands")
declared but never implemented ionic_q_rewind()/ionic_set_dma_mask().
Commit 969f84394604 ("ionic: sync the filters in the work task")
declared but never implemented ionic_rx_filters_need_sync().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20230821134717.51936-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 91a98917a883 ("net: dsa: microchip: move switch chip_id detection to ksz_common")
removed ksz8_switch_detect() but not its declaration.
Commit 6ec23aaaac43 ("net: dsa: microchip: move ksz_dev_ops to ksz_common.c")
declared but never implemented other functions.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20230821125501.19624-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
abort early so task can exit faster if a fatal signal is pending,
no need to continue validation in that case.
Signed-off-by: Florian Westphal <fw@strlen.de>
Prefer `strscpy_pad` as it's a more robust interface whilst maintaing
zero-padding behavior.
There may have existed a bug here due to both `tbl->repl.name` and
`info->name` having a size of 32 as defined below:
| #define XT_TABLE_MAXNAMELEN 32
This may lead to buffer overreads in some situations -- `strscpy` solves
this by guaranteeing NUL-termination of the dest buffer.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Use `strscpy_pad` over `strncpy` for NUL-terminated strings.
We can also drop the + 1 from `NFT_OSF_MAXGENRELEN + 1` since `strscpy`
will guarantee NUL-termination.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>