IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
As with the previous patch, we preallocate to skb's scm_fp_list an
array of struct unix_edge in the number of inflight AF_UNIX fds.
There we just preallocate memory and do not use immediately because
sendmsg() could fail after this point. The actual use will be in
the next patch.
When we queue skb with inflight edges, we will set the inflight
socket's unix_sock as unix_edge->predecessor and the receiver's
unix_sock as successor, and then we will link the edge to the
inflight socket's unix_vertex.edges.
Note that we set NULL to cloned scm_fp_list.edges in scm_fp_dup()
so that MSG_PEEK does not change the shape of the directed graph.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will replace the garbage collection algorithm for AF_UNIX, where
we will consider each inflight AF_UNIX socket as a vertex and its file
descriptor as an edge in a directed graph.
This patch introduces a new struct unix_vertex representing a vertex
in the graph and adds its pointer to struct unix_sock.
When we send a fd using the SCM_RIGHTS message, we allocate struct
scm_fp_list to struct scm_cookie in scm_fp_copy(). Then, we bump
each refcount of the inflight fds' struct file and save them in
scm_fp_list.fp.
After that, unix_attach_fds() inexplicably clones scm_fp_list of
scm_cookie and sets it to skb. (We will remove this part after
replacing GC.)
Here, we add a new function call in unix_attach_fds() to preallocate
struct unix_vertex per inflight AF_UNIX fd and link each vertex to
skb's scm_fp_list.vertices.
When sendmsg() succeeds later, if the socket of the inflight fd is
still not inflight yet, we will set the preallocated vertex to struct
unix_sock.vertex and link it to a global list unix_unvisited_vertices
under spin_lock(&unix_gc_lock).
If the socket is already inflight, we free the preallocated vertex.
This is to avoid taking the lock unnecessarily when sendmsg() could
fail later.
In the following patch, we will similarly allocate another struct
per edge, which will finally be linked to the inflight socket's
unix_vertex.edges.
And then, we will count the number of edges as unix_vertex.out_degree.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240325202425.60930-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TCP ehash table is often sparsely populated.
inet_twsk_purge() spends too much time calling cond_resched().
This patch can reduce time spent in inet_twsk_purge() by 20x.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240327191206.508114-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As of commit 916444df305e ("ptp: deprecate gettime64() in favor of
gettimex64()") (new) PTP drivers should rather implement gettimex64().
In addition, this variant provides timestamps from the system clock. The
readings have to be recorded right before and after reading the lowest bits
of the PHC timestamp.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smc_hash_sk and smc_unhash_sk are only used in af_smc.c, so make them
static and remove the output symbol. They can be called under the path
.prot->hash()/unhash().
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Asbjørn Sloth Tønnesen says:
====================
make skip_sw actually skip software
During development of flower-route[1], which I
recently presented at FOSDEM[2], I noticed that
CPU usage, would increase the more rules I installed
into the hardware for IP forwarding offloading.
Since we use TC flower offload for the hottest
prefixes, and leave the long tail to the normal (non-TC)
Linux network stack for slow-path IP forwarding.
We therefore need both the hardware and software
datapath to perform well.
I found that skip_sw rules, are quite expensive
in the kernel datapath, since they must be evaluated
and matched upon, before the kernel checks the
skip_sw flag.
This patchset optimizes the case where all rules
are skip_sw, by implementing a TC bypass for these
cases, where TC is only used as a control plane
for the hardware path.
v4:
- Rebased onto net-next, now that net-next is open again
v3: https://lore.kernel.org/netdev/20240306165813.656931-1-ast@fiberby.net/
- Patch 3:
- Fix source_inline
- Fix build failure, when CONFIG_NET_CLS without CONFIG_NET_CLS_ACT.
v2: https://lore.kernel.org/netdev/20240305144404.569632-1-ast@fiberby.net/
- Patch 1:
- Add Reviewed-By from Jiri Pirko
- Patch 2:
- Move code, to avoid forward declaration (Jiri).
- Patch 3
- Refactor to use a static key.
- Add performance data for trapping, or sending
a packet to a non-existent chain (as suggested by Marcelo).
v1: https://lore.kernel.org/netdev/20240215160458.1727237-1-ast@fiberby.net/
[1] flower-route
https://github.com/fiberby-dk/flower-route
[2] FOSDEM talk
https://fosdem.org/2024/schedule/event/fosdem-2024-3337-flying-higher-hardware-offloading-with-bird/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TC filters come in 3 variants:
- no flag (try to process in hardware, but fallback to software))
- skip_hw (do not process filter by hardware)
- skip_sw (do not process filter by software)
However skip_sw is implemented so that the skip_sw
flag can first be checked, after it has been matched.
IMHO it's common when using skip_sw, to use it on all rules.
So if all filters in a block is skip_sw filters, then
we can bail early, we can thus avoid having to match
the filters, just to check for the skip_sw flag.
This patch adds a bypass, for when only TC skip_sw rules
are used. The bypass is guarded by a static key, to avoid
harming other workloads.
There are 3 ways that a packet from a skip_sw ruleset, can
end up in the kernel path. Although the send packets to a
non-existent chain way is only improved a few percents, then
I believe it's worth optimizing the trap and fall-though
use-cases.
+----------------------------+--------+--------+--------+
| Test description | Pre- | Post- | Rel. |
| | kpps | kpps | chg. |
+----------------------------+--------+--------+--------+
| basic forwarding + notrack | 3589.3 | 3587.9 | 1.00x |
| switch to eswitch mode | 3081.8 | 3094.7 | 1.00x |
| add ingress qdisc | 3042.9 | 3063.6 | 1.01x |
| tc forward in hw / skip_sw |37024.7 |37028.4 | 1.00x |
| tc forward in sw / skip_hw | 3245.0 | 3245.3 | 1.00x |
+----------------------------+--------+--------+--------+
| tests with only skip_sw rules below: |
+----------------------------+--------+--------+--------+
| 1 non-matching rule | 2694.7 | 3058.7 | 1.14x |
| 1 n-m rule, match trap | 2611.2 | 3323.1 | 1.27x |
| 1 n-m rule, goto non-chain | 2886.8 | 2945.9 | 1.02x |
| 5 non-matching rules | 1958.2 | 3061.3 | 1.56x |
| 5 n-m rules, match trap | 1911.9 | 3327.0 | 1.74x |
| 5 n-m rules, goto non-chain| 2883.1 | 2947.5 | 1.02x |
| 10 non-matching rules | 1466.3 | 3062.8 | 2.09x |
| 10 n-m rules, match trap | 1444.3 | 3317.9 | 2.30x |
| 10 n-m rules,goto non-chain| 2883.1 | 2939.5 | 1.02x |
| 25 non-matching rules | 838.5 | 3058.9 | 3.65x |
| 25 n-m rules, match trap | 824.5 | 3323.0 | 4.03x |
| 25 n-m rules,goto non-chain| 2875.8 | 2944.7 | 1.02x |
| 50 non-matching rules | 488.1 | 3054.7 | 6.26x |
| 50 n-m rules, match trap | 484.9 | 3318.5 | 6.84x |
| 50 n-m rules,goto non-chain| 2884.1 | 2939.7 | 1.02x |
+----------------------------+--------+--------+--------+
perf top (25 n-m skip_sw rules - pre patch):
20.39% [kernel] [k] __skb_flow_dissect
16.43% [kernel] [k] rhashtable_jhash2
10.58% [kernel] [k] fl_classify
10.23% [kernel] [k] fl_mask_lookup
4.79% [kernel] [k] memset_orig
2.58% [kernel] [k] tcf_classify
1.47% [kernel] [k] __x86_indirect_thunk_rax
1.42% [kernel] [k] __dev_queue_xmit
1.36% [kernel] [k] nft_do_chain
1.21% [kernel] [k] __rcu_read_lock
perf top (25 n-m skip_sw rules - post patch):
5.12% [kernel] [k] __dev_queue_xmit
4.77% [kernel] [k] nft_do_chain
3.65% [kernel] [k] dev_gro_receive
3.41% [kernel] [k] check_preemption_disabled
3.14% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
2.88% [kernel] [k] __netif_receive_skb_core.constprop.0
2.49% [kernel] [k] mlx5e_xmit
2.15% [kernel] [k] ip_forward
1.95% [kernel] [k] mlx5e_tc_restore_tunnel
1.92% [kernel] [k] vlan_gro_receive
Test setup:
DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
Data rate measured on switch (Extreme X690), and DUT connected as
a router on a stick, with pktgen and pktsink as VLANs.
Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests.
Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a count of filters per block.
Counter updates are protected by cb_lock, which is
also used to protect the offload counters.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a count of skip_sw filters.
This counter is protected by the cb_lock, and is updated
at the same time as offloadcnt.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Nguyen says:
====================
ice: use less resources in switchdev
Michal Swiatkowski says:
Switchdev is using one queue per created port representor. This can
quickly lead to Rx queue shortage, as with subfunction support user
can create high number of PRs.
Save one MSI-X and 'number of PRs' * 1 queues.
Refactor switchdev slow-path to use less resources (even no additional
resources). Do this by removing control plane VSI and move its
functionality to PF VSI. Even with current solution PF is acting like
uplink and can't be used simultaneously for other use cases (adding
filters can break slow-path).
In short, do Tx via PF VSI and Rx packets using PF resources. Rx needs
additional code in interrupt handler to choose correct PR netdev.
Previous solution had to queue filters, it was way more elegant but
needed one queue per PRs. Beside that this refactor mostly simplifies
switchdev configuration.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: count representor stats
ice: do switchdev slow-path Rx using PF VSI
ice: change repr::id values
ice: remove switchdev control plane VSI
ice: control default Tx rule in lag
ice: default Tx rule instead of to queue
ice: do Tx through PF netdev in slow-path
ice: remove eswitch changing queues algorithm
====================
Link: https://lore.kernel.org/r/20240325202623.1012287-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the user wants to add an ntuple filter to an RSS context, select
the appropriate VNIC belonging to the selected RSS context and add the
VNIC destination rule.
Make the necessary changes to bnxt_add_ntuple_cls_rule().
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-13-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Refactor bnxt_cfg_rfs_ring_tbl_idx() to pass in the filter structure
pointer instead of the RX ring number. This will allow an ntuple
filter to be set up for the non-default RSS contexts in the next
patch.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-12-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Support up to 32 RSS contexts per device if supported by the device.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-11-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a new bnxt_modify_rss() function to modify the RSS key and RSS
indirection table. The new function can modify the parameters for
the default context or additional contexts.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-10-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Modify bnxt_rfs_capable() to check that there are enough resources
to support aRFS/ntuple filters for a new RSS context requested by
the user. Existing use cases in the driver will always set the
new parameter to false.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-9-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bnxt_rfs_capable() determines the number of VNICs and RSS_CTXs
required to support aRFS and then reserves the resources. We already
have functions bnxt_get_total_vnics() and bnxt_get_total_rss_ctxs()
to do that. Simplify the code by calling these functions. It is
also more correct to do the resource reservation after
bnxt_can_reserve_rings() returns true.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-8-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We will need to dynamically allocate and change indirection tables
for additional RSS contexts. Add the rss_ctx pointer parameter to
bnxt_alloc_rss_indir_tbl() and bnxt_set_dflt_rss_indir_tbl().
Existing usage will always pass rss_ctx as NULL which means the
default RSS context.
When supporting additional RSS contexts in subsequent patches, we'll
pass the valid rss_ctx to these 2 functions.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add struct bnxt_rss_ctx, related storage lists, required
defines, and its alloc/free functions.
Later patches will use them in order to support multiple
RSS contexts.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current VNIC structures are stored in an array bp->vnic_info[].
The index of the array (vnic_id) is passed to all the functions that
need to reference the VNIC.
This patch changes the scheme to pass the VNIC pointer instead of the
vnic index. Subsequent patches will create additional VNICs that
will not be stored in the bp->vnic_info[] array. Using the VNIC
pointer will work for all the VNICs.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a pure refactoring patch. The new function
bnxt_hwrm_vnic_set_rss_p5() will set up the P5_PLUS specific RSS ring
table and then call bnxt_hwrm_vnic_cfg() to setup the vnic for proper
RSS operations. This new function will be used later for additional
RSS contexts.
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240325222902.220712-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use a new default 1 second timeout value instead of the existing
1 msec value. The driver will keep track of the remaining time
before timeout and will pass this value to bnxt_hwrm_port_ts_query().
The firmware supports timeout values up to 65535 usecs. If the
timeout value passed to bnxt_hwrm_port_ts_query() is less than the
FW max value, we will use that value to precisely control the
specified timeout. If it is larger than the FW max value, we will
use the FW max value and any additional retry to reach the desired
timeout will be done in the context of bnxt_ptp_ts_aux_eork().
Link: https://lore.kernel.org/netdev/20240229070202.107488-2-michael.chan@broadcom.com/
Cc: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20240325222902.220712-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The caller can pass this new timeout parameter to the function to
specify the firmware timeout value when requesting the TX timestamp
from the firmware. This will allow the caller to precisely control
the timeout and will be used in the next patch. In this patch, the
parameter is 0 which means to use the current default value.
Cc: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20240325222902.220712-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Romain Gantois says:
====================
Fix missing PHY-to-MAC RX clock
There is an issue with some stmmac/PHY combinations that has been reported
some time ago in a couple of different series:
Clark Wang's report:
https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/
Clément Léger's report:
https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/
Stmmac controllers require an RX clock signal from the MII bus to perform
their hardware initialization successfully. This causes issues with some
PHY/PCS devices. If these devices do not bring the clock signal up before
the MAC driver initializes its hardware, then said initialization will
fail. This can happen at probe time or when the system wakes up from a
suspended state.
This series introduces new flags for phy_device and phylink_pcs. These
flags allow MAC drivers to signal to PHY/PCS drivers that the RX clock
signal should be enabled as soon as possible, and that it should always
stay enabled.
I have included specific uses of these flags that fix the RZN1 GMAC1 stmmac
driver that I am currently working on and that is not yet upstream. I have
also included changes to the at803x PHY driver that should fix the issue
that Clark Wang was having.
====================
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-0-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stmmac controllers connected to an at803x PHY cannot resume properly after
suspend when WoL is enabled. This happens because the MAC requires an RX
clock generated by the PHY to initialize its hardware properly. But the RX
clock is cut when the PHY suspends and isn't brought up until the MAC
driver resumes the phylink.
Prevent the at803x PHY driver from going into suspend if the attached MAC
driver always requires an RX clock signal.
Reported-by: Clark Wang <xiaoning.wang@nxp.com>
Link: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[rgantois: commit log]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-6-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is a reocurring issue with stmmac controllers where the MAC fails to
initialize its hardware if an RX clock signal isn't provided on the MAC/PHY
link.
This causes issues when PHY or PCS devices either go into suspend while
cutting the RX clock or do not bring the clock signal up early enough for
the MAC to initialize successfully.
Set the mac_requires_rxc flag in the stmmac phylink config so that PHY/PCS
drivers know to keep the RX clock up at all times.
Reported-by: Clark Wang <xiaoning.wang@nxp.com>
Link: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/
Reported-by: Clément Léger <clement.leger@bootlin.com>
Link: https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/
Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-5-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Global stmmac support for early initialization of PCS devices requires a
generic PCS reference that can be passed to phylink_pcs_pre_init().
Currently, the mac_device_info struct contains only one PCS field, which is
specific to the Lynx model.
As PCS models are hardware-specific, it is more appropriate to have a
generic PCS field in the mac_device_info struct.
Refactor the lynx_pcs field into a generic phylink_pcs field.
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-4-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When initializing attached PHYs, there are some cases where we don't expect
any PHY to be connected. The logic uses conditions based on various local
PCS configuration, but also calls-in phylink_expects_phy() via
stmmac_init_phy(), which is enough to ensure we don't try to initialize a
PHY when using a Lynx PCS, as long as we have the phy_interface set to a
802.3z mode and are using inband negociation.
Drop the lynx check, making the stmmac generic code more pcs_lynx-agnostic.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
[rgantois: commit log]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-3-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some MAC drivers (e.g. stmmac) require a continuous receive clock signal to
be generated by a PCS that is handled by a standalone PCS driver.
Such a PCS driver does not have access to a PHY device, thus cannot check
the PHY_F_RXC_ALWAYS_ON flag. They cannot check max_requires_rxc in the
phylink config either, since it is a private member. Therefore, a new flag
is needed to signal to the PCS that it should keep the RX clock signal up
at all times.
Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-2-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some MAC controllers (e.g. stmmac) require their connected PHY to
continuously provide a receive clock signal. This can cause issues in two
cases:
1. The clock signal hasn't been started yet by the time the MAC driver
initializes its hardware. This can make the initialization fail, as in
the case of the rzn1 GMAC1 driver.
2. The clock signal is cut during a power saving event. By the time the
MAC is brought back up, the clock signal is still not active since
phylink_start hasn't been called yet. This brings us back to case 1.
If a PHY driver reads this flag, it should ensure that the receive clock
signal is started as soon as possible, and that it isn't brought down when
the PHY goes into suspend.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[rgantois: commit log]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-1-24a74e5c761f@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexander Lobakin says:
====================
compiler_types: add Endianness-dependent __counted_by_{le,be}
Some structures contain flexible arrays at the end and the counter for
them, but the counter has explicit Endianness and thus __counted_by()
can't be used directly.
To increase test coverage for potential problems without breaking
anything, introduce __counted_by_{le,be} defined depending on platform's
Endianness to either __counted_by() when applicable or noop otherwise.
The first user will be virtchnl2.h from idpf just as example with 9 flex
structures having Little Endian counters.
Maybe it would be a good idea to introduce such attributes on compiler
level if possible, but for now let's stop on what we have.
====================
Link: https://lore.kernel.org/r/20240327142241.1745989-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both virtchnl2.h and its consumer idpf_virtchnl.c are very error-prone.
There are 10 structures with flexible arrays at the end, but 9 of them
has flex member counter in Little Endian.
Make the code a bit more robust by applying __counted_by_le() to those
9. LE platforms is the main target for this driver, so they would
receive additional protection.
While we're here, add __counted_by() to virtchnl2_ptype::proto_id, as
its counter is `u8` regardless of the Endianness.
Compile test on x86_64 (LE) didn't reveal any new issues after applying
the attributes.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20240327142241.1745989-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To ease maintaining of virtchnl2.h, which already is messy enough,
make it self-contained by adding missing if_ether.h include due to
%ETH_ALEN usage.
At the same time, virtchnl2_lan_desc.h is not used anywhere in the
file, so move this include to idpf_txrx.h to speed up C preprocessing.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20240327142241.1745989-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some structures contain flexible arrays at the end and the counter for
them, but the counter has explicit Endianness and thus __counted_by()
can't be used directly.
To increase test coverage for potential problems without breaking
anything, introduce __counted_by_{le,be}() defined depending on
platform's Endianness to either __counted_by() when applicable or noop
otherwise.
Maybe it would be a good idea to introduce such attributes on compiler
level if possible, but for now let's stop on what we have.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20240327142241.1745989-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__napi_alloc_skb() is napi_alloc_skb() with the added flexibility
of choosing gfp_mask. This is a NAPI function, so GFP_ATOMIC is
implied. The only practical choice the caller has is whether to
set __GFP_NOWARN. But that's a false choice, too, allocation failures
in atomic context will happen, and printing warnings in logs,
effectively for a packet drop, is both too much and very likely
non-actionable.
This leads me to a conclusion that most uses of napi_alloc_skb()
are simply misguided, and should use __GFP_NOWARN in the first
place. We also have a "standard" way of reporting allocation
failures via the queue stat API (qstats::rx-alloc-fail).
The direct motivation for this patch is that one of the drivers
used at Meta calls napi_alloc_skb() (so prior to this patch without
__GFP_NOWARN), and the resulting OOM warning is the top networking
warning in our fleet.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240327040213.3153864-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
qed_init_pci() used pci_params.pm_cap only to cache the pci_dev.pm_cap.
Drop the cache and use pci_dev.pm_cap directly.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240325224931.1462051-1-helgaas@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This counter counts the number of times get_ptype_map is executed on the
admin queue, and was previously missing from the stats report.
Signed-off-by: John Fraker <jfraker@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240325223308.618671-1-jfraker@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niklas Söderlund says:
====================
ravb: Support describing the MDIO bus
This series adds support to the binding and driver of the Renesas
Ethernet AVB to described the MDIO bus. Currently the driver uses
the OF node of the device itself when registering the MDIO bus.
This forces any MDIO bus properties the MDIO core should react on
to be set on the device OF node. This is confusing and none of
the MDIO bus properties are described in the Ethernet AVB bindings.
Patch 1/2 extends the bindings with an optional mdio child-node
to the device that can be used to contain the MDIO bus settings.
While patch 2/2 changes the driver to use this node (if present)
when registering the MDIO bus.
If the new optional mdio child-node is not present the driver
fallback to the old behavior and uses the device OF node like before.
This change is fully backward compatible with existing usage
of the bindings.
====================
Link: https://lore.kernel.org/r/20240325153451.2366083-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver used the DT node of the device itself when registering the
MDIO bus. While this works, it creates a problem: it forces any MDIO bus
properties to also be set on the devices DT node. This mixes the
properties of two distinctly different things and is confusing.
This change adds support for an optional mdio node to be defined as a
child to the device DT node. The child node can then be used to describe
MDIO bus properties that the MDIO core can act on when registering the
bus.
If no mdio child node is found the driver fallback to the old behavior
and register the MDIO bus using the device DT node. This change is
backward compatible with old bindings in use.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240325153451.2366083-3-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Renesas Ethernet AVB bindings do not allow the MDIO bus to be
described. This has not been needed as only a single PHY is
supported and no MDIO bus properties have been needed.
Add an optional mdio node to the binding which allows the MDIO bus to be
described and allow bus properties to be set.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20240325153451.2366083-2-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some times it would be convenient to read the integer as hex, like
mask values.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240327123130.1322921-2-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata says:
====================
selftests: Fixes for kernel CI
As discussed on the bi-weekly call on Jan 30, and in mailing around
kernel CI effort, some changes are desirable in the suite of forwarding
selftests the better to work with the CI tooling. Namely:
- The forwarding selftests use a configuration file where names of
interfaces are defined and various variables can be overridden. There
is also forwarding.config.sample that users can use as a template to
refer to when creating the config file. What happens a fair bit is
that users either do not know about this at all, or simply forget, and
are confused by cryptic failures about interfaces that cannot be
created.
In patches #1 - #3 have lib.sh just be the single source of truth with
regards to which variables exist. That includes the topology variables
which were previously only in the sample file, and any "tweak
variables", such as what tools to use, sleep times, etc.
forwarding.config.sample then becomes just a placeholder with a couple
examples. Unless specific HW should be exercised, or specific tools
used, the defaults are usually just fine.
- Several net/forwarding/ selftests (and one net/ one) cannot be run on
veth pairs, they need an actual HW interface to run on. They are
generic in the sense that any capable HW should pass them, which is
why they have been put to net/forwarding/ as opposed to drivers/net/,
but they do not generalize to veth. The fact that these tests are in
net/forwarding/, but still complaining when run, is confusing.
In patches #4 - #6 move these tests to a new directory
drivers/net/hw.
- The following patches extend the codebase to handle well test results
other than pass and fail.
Patch #7 is preparatory. It converts several log_test_skip to XFAIL,
so that tests do not spuriously end up returning non-0 when they
are not supposed to.
In patches #8 - #10, introduce some missing ksft constants, then support
having those constants in RET, and then finally in EXIT_STATUS.
- The traffic scheduler tests generate a large amount of network traffic
to test the behavior of the scheduler. This demands a relatively
high-performance computer. On slow machines, such as with a debugging
kernel, the test would spuriously fail.
It can still be useful to "go through the motions" though, to possibly
catch bugs in setup of the scheduler graph and passing packets around.
Thus we still want to run the tests, just with lowered demands.
To that end, in patches #11 - #12, introduce an environment variable
KSFT_MACHINE_SLOW, with obvious meaning. Tests can then make checks
more lenient, such as mark failures as XFAIL. A helper, xfail_on_slow,
is provided to mark performance-sensitive parts of the selftest.
- In patch #13, use a similar mechanism to mark a NH group stats
selftest to XFAIL HW stats tests when run on VETH pairs.
- All these changes complicate the hitherto straightforward logging and
checking logic, so in patch #14, add a selftest that checks this
functionality in lib.sh.
v1 (vs. an RFC circulated through linux-kselftest):
- Patch #9:
- Clarify intended usage by s/set_ret/ret_set_ksft_status/,
s/nret/ksft_status/
====================
Link: https://lore.kernel.org/r/cover.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rerunning various scenarios to make sure lib.sh changes do not impact the
observable behavior is no fun. Add a selftest at least for the bare basics
-- the mechanics of setting RET, retmsg, and EXIT_STATUS.
Since the selftest itself uses lib.sh, it would be possible to break lib.sh
in such a way that invalidates result of the selftest. Since the metatest
only uses the bare basics (just pass/fail), hopefully such fundamental
breakages would be noticed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/6d25cedbf2d4b83614944809a34fe023fbe8db38.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the NH group stats tests are currently run on a veth topology, the
HW-stats leg of each test is SKIP'ped. But kernel networking CI interprets
skips as a sign that tooling is missing, and prompts maintainer
investigation. Lack of capability to pass a test should be expressed as
XFAIL.
Selftests that require HW should normally be put in drivers/net/hw, but
doing so for the NH counter selftests would just lead to a lot of
duplicity.
So instead, introduce a helper, xfail_on_veth(), which can be used to mark
selftests that should XFAIL instead of FAILing when run on a veth topology.
On non-veth topology, they don't do anything.
Use the helper in the HW-stats part of router_mpath_nh_lib selftest.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/15f0ab9637aa0497f164ec30e83c1c8f53d53719.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Several tests in the suite use large amounts of traffic to e.g. cause
congestion and evaluate RED or shaper performance. These tests will not run
well on a slow machine, be it one with heavy debug kernel, or a VM, or e.g.
a single-board computer. Allow users to specify an environment variable,
KSFT_MACHINE_SLOW=yes, to indicate that the tests are being run on one such
machine.
Performance sensitive tests can then use a new helper, xfail_on_slow(), to
mark parts of the test that are sensitive to low-performance machines.
The helper can be used to just mark the whole suite, like so:
xfail_on_slow tests_run
... or, on the other side of the granularity spectrum, to override
individual checks:
xfail_on_slow check_err $? "Expected much, got little."
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/99a376a2d2ffdaeee7752b1910cb0c3ea5d80fbe.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>