IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This patch changes CPT mailbox message format to
support new block CPT1 in 98xx silicon.
cpt_rd_wr_reg ->
Modify cpt_rd_wr_reg mailbox and its handler to
accommodate new block CPT1.
cpt_lf_alloc ->
Modify cpt_lf_alloc mailbox and its handler to
configure LFs from a block address out of multiple
blocks of same type. If a PF/VF needs to configure
LFs from both the blocks then this mbox should be
called twice.
Signed-off-by: Mahipal Challa <mchalla@marvell.com>
Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mdio_bus reset code first de-asserted the reset by allocating with
GPIOD_OUT_LOW, then asserted and de-asserted again. In other words, if
the reset signal defaulted to asserted, there'd be a short "spike"
before the reset.
Here is what happens depending on the pre-existing state of the reset
signal:
Reset (previously asserted): ~~~|_|~~~~|_______
Reset (previously deasserted): _____|~~~~|_______
^ ^ ^
A B C
At point A, the low going transition is because the reset line is
requested using GPIOD_OUT_LOW. If the line is successfully requested,
the first thing we do is set it high _without_ any delay. This is
point B. So, a glitch occurs between A and B.
We then fsleep() and finally set the GPIO low at point C.
Requesting the line using GPIOD_OUT_HIGH eliminates the A and B
transitions. Instead we get:
Reset (previously asserted) : ~~~~~~~~~~|______
Reset (previously deasserted): ____|~~~~~|______
^ ^
A C
Where A and C are the points described above in the code. Point B
has been eliminated.
The issue was found when we pulled down the reset signal for the
Marvell 88E1512P PHY (because it requires at least 50ms after POR with
an active clock). Looking at the reset signal with a scope revealed a
short spike, point B in the artwork above.
Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210202143239.10714-1-mike.looijmans@topic.nl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Probe should return an error code if platform_get_irq_byname() fails
but it returns success instead.
Fixes: 6c30384eb1de ("net: mscc: ocelot: register devlink ports")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/YBkXyFIl4V9hgxYM@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are several error handling bugs in mscc_ocelot_init_ports(). I
went through the code, and carefully audited it and made fixes and
cleanups.
1) The ocelot_probe_port() function didn't have a mirror release function
so it was hard to follow. I created the ocelot_release_port()
function.
2) In the ocelot_probe_port() function, if the register_netdev() call
failed, then it lead to a double free_netdev(dev) bug. Fix this by
setting "ocelot->ports[port] = NULL" on the error path.
3) I was concerned that the "port" which comes from of_property_read_u32()
might be out of bounds so I added a check for that.
4) In the original code if ocelot_regmap_init() failed then the driver
tried to continue but I think that should be a fatal error.
5) If ocelot_probe_port() failed then the most recent devlink was leaked.
The fix for mostly came Vladimir Oltean. Get rid of "registered_ports"
and just set a bit in "devlink_ports_registered" to say when the
devlink port has been registered (and needs to be unregistered on
error). There are fewer than 32 ports so a u32 is large enough for
this purpose.
6) The error handling if the final ocelot_port_devlink_init() failed had
two problems. The "while (port-- >= 0)" loop should have been
"--port" pre-op instead of a post-op to avoid a buffer underflow.
The "if (!registered_ports[port])" condition was reversed leading to
resource leaks and double frees.
Fixes: 6c30384eb1de ("net: mscc: ocelot: register devlink ports")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/YBkXhqRxHtRGzSnJ@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Brian Vazquez says:
====================
net: use INDIRECT_CALL in some dst_ops
This patch series uses the INDIRECT_CALL wrappers in some dst_ops
functions to mitigate retpoline costs. Benefits depend on the
platform as described below.
Background: The kernel rewrites the retpoline code at
__x86_indirect_thunk_r11 depending on the CPU's requirements.
The INDIRECT_CALL wrappers provide hints on possible targets and
save the retpoline overhead using a direct call in case the
target matches one of the hints.
The retpoline overhead for the following three cases has been
measured by Luigi Rizzo in microbenchmarks, using CPU performance
counters, and cover reasonably well the range of possible retpoline
overheads compared to a plain indirect call (in equal conditions,
specifically with predicted branch, hot cache):
- just "jmp *(%r11)" on modern platforms like Intel Cascadelake.
In this case the overhead is just 2 clock cycles:
- "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs.
In this case the lfence is blocked until pending reads complete,
so the actual overhead depends on previous instructions.
The best case we have measured 15 clock cycles of overhead.
- worst case, e.g. skylake, the full retpoline is used
__x86_indirect_thunk_r11: call set_u_target
capture_speculation: pause
lfence
jmp capture_speculation
.align 16
set_up_target: mov %r11, (%rsp)
ret
In this case the overhead has been measured in 35-40 clock cycles.
The actual time saved hence depends on the platform and current
clock speed (which varies heavily, especially when C-states are active).
Also note that actual benefit might be lower than expected if the
longer retpoline overlaps with some pending memory read.
MEASUREMENTS:
The INDIRECT_CALL wrappers in this patchset involve the processing
of incoming SYN and generation of syncookies. Hence, the test has been
run by configuring a receiving host with a single NIC rx queue, disabling
RPS and RFS so that all processing occurs on the same core.
An external source generates SYN fast enough to saturate the receiving CPU.
We ran two sets of experiments, with and without the dst_output patch,
comparing the number of syncookies generated over a 20s period
in multiple runs.
Assuming the CPU is saturated, the time per packet is
t = number_of_packets/total_time
and if the two datasets have statistically meaningful difference,
the difference in times between the two cases gives an estimate
of the benefits from one INDIRECT_CALL.
Here are the experimental results:
Skylake Syncookies over 20s (5 tests)
---------------------------------------------------
indirect 9166325 9182023 9170093 9134014 9171082
retpoline 9099308 9126350 9154841 9056377 9122376
Computing the stats on the ns_pkt = 20e6/total_packets gives the following:
$ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp
x /tmp/sk-indirect
+ /tmp/sk-retp
+----------------------------------------------------------------------+
|x xx x + x + + + +|
||______M__A_______|_|____________M_____A___________________| |
+----------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 2.17817e-06 2.18962e-06 2.181e-06 2.182292e-06 4.3252133e-09
+ 5 2.18464e-06 2.20839e-06 2.19241e-06 2.194974e-06 8.8695958e-09
Difference at 95.0% confidence
1.2682e-08 +/- 1.01766e-08
0.581132% +/- 0.466326%
(Student's t, pooled s = 6.97772e-09)
This suggests a difference of 13ns +/- 10ns
Our expectation from microbenchmarks was 35-40 cycles per call,
but part of the gains may be eaten by stalls from pending memory reads.
For Cascadelake:
Cascadelake Syncookies over 20s (5 tests)
---------------------------------------------------------
indirect 10339797 10297547 10366826 10378891 10384854
retpoline 10332674 10366805 10320374 10334272 10374087
Computing the stats on the ns_pkt = 20e6/total_packets gives no
meaningful difference even at just 80% (this was expected):
$ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp
x /tmp/cl-indirect
+ /tmp/cl-retp
+----------------------------------------------------------------------+
| x x + * x + + + x|
||______________|_M_________A_____A_______M________|___| |
+----------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 1.92588e-06 1.94221e-06 1.92923e-06 1.931716e-06 6.6936746e-09
+ 5 1.92788e-06 1.93791e-06 1.93531e-06 1.933188e-06 4.3734106e-09
No difference proven at 80.0% confidence
====================
Link: https://lore.kernel.org/r/20210201174132.3534118-1-brianvv@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip6_output and ip_output
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch avoids the indirect call for the common case:
ip_local_deliver and ip6_input
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This converts the driver to use the new tasklet API introduced in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")
It is unfortunate that we need to add a pointer to the driver context to
get back to the usbnet device, but the space will be reclaimed once
there are no more users of the old API left and we can remove the data
value and flag from the tasklet struct.
Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Link: https://lore.kernel.org/r/20210130234637.26505-1-kernel@esmil.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
inet_gro_receive() and inet_gro_complete() are part
of GRO engine which can not be modular.
Similarly, inet_gso_segment() does not need to be exported,
being part of GSO stack.
In other words, net/ipv6/ip6_offload.o is part of vmlinux,
regardless of CONFIG_IPV6.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20210202154145.1568451-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- cfg80211_dev_rename() requires RTNL
- cfg80211_change_iface() and cfg80211_set_encryption()
require wiphy mutex (was missing in wireless extensions)
- cfg80211_destroy_ifaces() requires wiphy mutex
- netdev registration can fail due to notifiers, and then
notifiers are "unrolled", need to handle this properly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmAZZAoACgkQB8qZga/f
l8QW4A/+NSpqm/MMuw9IUwRsZ3wJUYI2SDj7xhbfpFEdNAmm7RjJLxF8NqCzFtqo
2XM9DzGkbvQKlPi4DmahnRVycFqlqTGkDDs5WOvg9NdL8/zygDLsTMWJsvyI6XEp
4Y8qLuwJpoaxDhmtEpjBNbQiZrXDdYptMRsvWpaLiLN8nlzWim+Sm+qiMeIWpxz2
axutgbyfO4pREU3wxRbxe2V0RNLxRqJ7g10siAkchP+NoK2SjM1tQKzyuN7ruImZ
cVriA+j1u43rWseedoKZzofCvgd74nZAi87u8dpk673s7V71//8uTHhpIOBYUbfp
6mn1V2QhjiLZ3UZfdQFFQ+WjoowSEPMQ6gPe0EdPlWTYVNRWcQXzjIlznooZnxrE
KVWYDYxkQKMgqrTFdUjcjOza9m4DG0aAJqqSQZ3r9KsRftseLM680vZY6AoteOOM
WeaEt3p1Qaza18CA3BH1wVHmbNnwIfCiHtmsAefkgTD3cVH28IIUyGk4RsFPGWi0
APNNQOyiPPQCVnDAMZjhrKPMX2NTyWw5UziFhPPc2jo00XjPW0+mPRjiSO2W55kz
ixui/foUN5um3kEc9wJgh62eLYjOAtBomKCNZEZQjJpS9VtsvyoaY/ZjK69lP3wQ
XERj8D+fVpm8Hidbv7tb2gElJvra0X4ZCky2KixnICjCrwDBBzw=
=bgnR
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This time, only RTNL locking reduction fallout.
- cfg80211_dev_rename() requires RTNL
- cfg80211_change_iface() and cfg80211_set_encryption()
require wiphy mutex (was missing in wireless extensions)
- cfg80211_destroy_ifaces() requires wiphy mutex
- netdev registration can fail due to notifiers, and then
notifiers are "unrolled", need to handle this properly
* tag 'mac80211-next-for-net-next-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
cfg80211: fix netdev registration deadlock
cfg80211: call cfg80211_destroy_ifaces() with wiphy lock held
wext: call cfg80211_set_encryption() with wiphy lock held
wext: call cfg80211_change_iface() with wiphy lock held
nl80211: call cfg80211_dev_rename() under RTNL
====================
Link: https://lore.kernel.org/r/20210202144106.38207-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mat Martineau says:
====================
mptcp: ADD_ADDR enhancements
This patch series from the MPTCP tree contains enhancements and
associated tests for the ADD_ADDR ("add address") MPTCP option. This
option allows already-connected MPTCP peers to share additional IP
addresses with each other, which can then be used to create additional
subflows within those MPTCP connections.
Patches 1 & 2 remove duplicated data in the per-connection path manager
structure.
Patches 3-6 initiate additional subflows when an address is added using
the netlink path manager interface and improve ADD_ADDR signaling
reliability, subject to configured limits. Self tests are also updated.
Patches 7-15 add new support for optional port numbers in ADD_ADDR. This
includes creating an additional in-kernel TCP listening socket for the
requested port number, validating the port number when processing
incoming subflow connections, including the port number in netlink
interfaces, and adding some new MIBs. New self test cases are added for
subflows connecting with alternate port numbers.
====================
Link: https://lore.kernel.org/r/20210201230920.66027-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds testcases for ADD_ADDR with port and the related MIB
counters check in chk_add_nr. The output looks like this:
24 signal address with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
25 subflow and signal with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
26 remove single address with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
rm [ ok ] - sf [ ok ]
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds the mibs for ADD_ADDR with port:
MPTCP_MIB_PORTADD for received ADD_ADDR suboption with a port number.
MPTCP_MIB_PORTSYNRX, MPTCP_MIB_PORTSYNACKRX, MPTCP_MIB_PORTACKRX, for
received MP_JOIN's SYN or SYN/ACK or ACK with a port number which is
different from the msk's port number.
MPTCP_MIB_MISMATCHPORTSYNRX and MPTCP_MIB_MISMATCHPORTACKRX, for
received SYN or ACK MP_JOIN with a mismatched port-number.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new argument for pm_nl_ctl tool. We can use it like
this:
# pm_nl_ctl add 10.0.2.1 flags signal port 10100
# pm_nl_ctl dump
id 1 flags signal 10.0.2.1 10100
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds MPTCP_PM_ADDR_ATTR_PORT filling and parsing in PM
netlink.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When dealing with the addresses list local_addr_list or anno_list, we
should enable the function addresses_equal's parameter use_port. And
enable it in address_zero too.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds two new helpers, subflow_use_different_sport and
subflow_use_different_dport, to check whether the subflow's source or
destination port number is different from the msk's port number. When
receiving the MP_JOIN's SYN/SYNACK/ACK, we do these port number checks
and print out the different port numbers.
And furthermore, when receiving the MP_JOIN's SYN/ACK, we also use a new
helper mptcp_pm_sport_in_anno_list to check whether this port number is
announced. If it isn't, we need to abort this connection.
This patch also populates the local address's port field in
local_address.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds a new helper named subflow_req_create_thmac, which is
extracted from subflow_token_join_request. It initializes subflow_req's
local_nonce and thmac fields, those are the more expensive to populate.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch drops the unused parameter skb in subflow_token_join_request.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch creates a listening socket when an address with a port-number
is added by PM netlink. Then binds the new port to the socket, and
listens for new connections.
When the address is removed or the addresses are flushed by PM netlink,
release the listening socket.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds testcases to create subflows or signal addresses for the
newly added IPv4 or IPv6 addresses.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch changes the removing addresses numbers to minus values, left
the plus values for the adding addresses numbers.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch changes the sending ACK conditions for the ADD_ADDR, send an
ACK packet for any ADD_ADDR, not just when ipv6 addresses or port
numbers are included.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/139
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, when a new MPTCP endpoint is added, the existing MPTCP
sockets are not affected.
This patch implements a new function mptcp_nl_add_subflow_or_signal_addr,
invoked when an address is added from PM netlink. This function traverses
the MPTCP sockets list and invokes mptcp_pm_create_subflow_or_signal_addr
to try to create a subflow or signal an address for the newly added
address, if local constraint allows that.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/19
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch drops the per-msk values add_addr_signal_max,
add_addr_accept_max, local_addr_max and subflows_max fields in struct
mptcp_pm_data, uses the pernet *_max values instead. And adds four new
helpers to get the pernet *_max values separately.
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch uses WRITE_ONCE() for all the pernet add_addr_signal_max,
add_addr_accept_max, local_addr_max and subflows_max fields in struct
pm_nl_pernet to avoid concurrency issues.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
According to the vendor driver, the new chip with XID 0x54b is
essentially the same as the one with XID 0x54a, but it doesn't need the
firmware.
So add support accordingly.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/20210202044813.1304266-1-kai.heng.feng@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says:
====================
Add notifications when route hardware flags change
Routes installed to the kernel can be programmed to capable devices, in
which case they are marked with one of two flags. RTM_F_OFFLOAD for
routes that offload traffic from the kernel and RTM_F_TRAP for routes
that trap packets to the kernel for processing (e.g., host routes).
These flags are of interest to routing daemons since they would like to
delay advertisement of routes until they are installed in hardware. This
allows them to avoid packet loss or misrouted packets. Currently,
routing daemons do not receive any notifications when these flags are
changed, requiring them to poll the kernel tables for changes which is
inefficient.
This series addresses the issue by having the kernel emit RTM_NEWROUTE
notifications whenever these flags change. The behavior is controlled by
two sysctls (net.ipv4.fib_notify_on_flag_change and
net.ipv6.fib_notify_on_flag_change) that default to 0 (no
notifications).
Note that even if route installation in hardware is improved to be more
synchronous, these notifications are still of interest. For example, a
multipath route can change from RTM_F_OFFLOAD to RTM_F_TRAP if its
neighbours become invalid. A routing daemon can choose to withdraw /
replace the route in that case. In addition, the deletion of a route
from the kernel can prompt the installation of an identical route
(already in kernel, with an higher metric) to hardware.
For testing purposes, netdevsim is aligned to simulate a "real" driver
that programs routes to hardware.
Series overview:
Patches #1-#2 align netdevsim to perform route programming in a
non-atomic context
Patches #3-#5 add sysctl to control IPv4 notifications
Patches #6-#8 add sysctl to control IPv6 notifications
Patch #9 extends existing fib tests to set sysctls before running tests
Patch #10 adds test for fib notifications over netdevsim
====================
Link: https://lore.kernel.org/r/20210201194757.3463461-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add test to check fib notifications behavior.
The test checks route addition, route deletion and route replacement for
both IPv4 and IPv6.
When fib_notify_on_flag_change=0, expect single notification for route
addition/deletion/replacement.
When fib_notify_on_flag_change=1, expect:
- two notification for route addition/replacement, first without RTM_F_TRAP
and second with RTM_F_TRAP.
- single notification for route deletion.
$ ./fib_notifications.sh
TEST: IPv4 route addition [ OK ]
TEST: IPv4 route deletion [ OK ]
TEST: IPv4 route replacement [ OK ]
TEST: IPv6 route addition [ OK ]
TEST: IPv6 route deletion [ OK ]
TEST: IPv6 route replacement [ OK ]
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Run the test cases with both `fib_notify_on_flag_change` sysctls set to
'1', and then with both sysctls set to '0' to verify there are no
regressions in the test when notifications are added.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.
The asynchronous nature of route installation in hardware can lead
to a routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.
It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.
Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.
Introduce a sysctl that controls this behavior.
Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.
Move fib6_info_hw_flags_set() to C file because it is no longer a short
function.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the next patch mlxsw and netdevsim will fail in compilation if
CONFIG_IPV6 is disabled.
Do not call fib6_info_hw_flags_set() when IPv6 is disabled.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The next patch will emit notification when hardware flags are changed,
in case that fib_notify_on_flag_change sysctl is set to 1.
To know sysctl values, net struct is needed.
This change is consistent with the IPv4 version, which gets 'net' struct
as its first argument.
Currently, the only callers of this function are mlxsw and netdevsim.
Patch the callers to pass net.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.
The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.
It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.
Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.
Introduce a sysctl that controls this behavior.
Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Publish fib_nlmsg_size() to allow it to be used later on from
fib_alias_hw_flags_set().
Remove the inline keyword since it shouldn't be used inside C files.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib_dump_info() does not change 'fri', so pass it as 'const'.
It will later allow us to invoke fib_dump_info() from
fib_alias_hw_flags_set().
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, netdevsim implements dummy FIB offload and marks notified
routes with RTM_F_TRAP flag. netdevsim does not defer route notifications
to a work queue because it does not need to program any hardware.
Given that netdevsim's purpose is to both give an example implementation
and allow developers to test their code, align netdevsim to a "real"
hardware device driver like mlxsw and have it also perform the route
"programming" in a non-atomic context.
It will be used to test route flags notifications which will be added in
the next patches.
The following changes are needed when route handling is performed in WQ:
- Handle the accounting in the main context, to be able to return an
error for adding route when all the routes are used.
For FIB_EVENT_ENTRY_REPLACE increase the counter before scheduling
the delayed work, and in case that this event replaces an existing route,
decrease the counter as part of the delayed work.
- For IPv6, cannot use fen6_info->rt->fib6_siblings list because it
might be changed during handling the delayed work.
Save an array with the nexthops as part of fib6_event struct, and take
a reference for each nexthop to prevent them from being freed while
event is queued.
- Change GFP_ATOMIC allocations to GFP_KERNEL.
- Use single work item that is handling a list of ordered routes.
Handling routes must be processed in the order they were submitted to
avoid logical errors that could lead to unexpected failures.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When route is added/deleted, the appropriate counter is increased/decreased
to maintain number of routes.
User can limit the number of routes and then according to the appropriate
counter, adding more routes than the limitation is forbidden.
Currently, there is one lock which protects hashtable, list and accounting.
Handling the counters will be performed from both atomic context and
non-atomic context, while the hashtable and the list will be used only from
non-atomic context and therefore will be protected by a separate lock.
Protect accounting by using an atomic variable, so lock is not needed.
v2:
* Use atomic64_sub() in nsim_nexthop_account()'s error path
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder says:
====================
net: ipa: don't disable NAPI in suspend
This is version 2 of a series that reworks the order in which things
happen during channel stop and suspend (and start and resume), in
order to address a hang that has been observed during suspend.
The introductory message on the first version of the series gave
some history which is omitted here.
The end result of this series is that we only enable NAPI and the
I/O completion interrupt on a channel when we start the channel for
the first time. And we only disable them when stopping the channel
"for good." In other words, NAPI and the completion interrupt
remain enabled while a channel is stopped for suspend.
One comment on version 1 of the series suggested *not* returning
early on success in a function, instead having both success and
error paths return from the same point at the end of the function
block. This has been addressed in this version.
In addition, this version consolidates things a little bit, but the
net result of the series is exactly the same as version 1 (with the
exception of the return fix mentioned above).
First, patch 6 in the first version was a small step to make patch 7
easier to understand. The two have been combined now.
Second, previous version moved (and for suspend/resume, eliminated)
I/O completion interrupt and NAPI disable/enable control in separate
steps (patches). Now both are moved around together in patch 5 and
6, which eliminates the need for the final (NAPI-only) patch.
I won't repeat the patch summaries provided in v1:
https://lore.kernel.org/netdev/20210129202019.2099259-1-elder@linaro.org/
Many thanks to Willem de Bruijn for his thoughtful input.
====================
Link: https://lore.kernel.org/r/20210201172850.2221624-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Transactions to send data for a network device can be allocated at
any time up until the point the TX queue is stopped. It is possible
for ipa_start_xmit() to be called in one context just before a
the transmit queue is stopped in another.
Update gsi_channel_trans_last() so that for TX channels the
allocated and pending transaction lists are checked--in addition
to the completed and polled lists--to determine the "last"
transaction. This means any transaction that has been allocated
before the TX queue is stopped will be allowed to complete before
we conclude the channel is quiesced.
Rework the function a bit to use a list pointer and gotos.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
No completion interrupts will occur while an endpoint is suspended,
nor when a channel has been stopped for suspend. So there's no need
to disable the interrupt during suspend and re-enable it when
resuming. Without any interrupts occurring, there is no need to
disable/re-enable NAPI for channel suspend/resume either.
We'll only enable NAPI and the interrupt when we first start the
channel, and disable it again only when it's "really" stopped.
To accomplish this, move the enable/disable calls out of
__gsi_channel_start() and __gsi_channel_stop(), and into
gsi_channel_start() and gsi_channel_stop() instead.
Add a call to napi_synchronize() to gsi_channel_suspend(), to ensure
NAPI polling is done before moving on.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Disable both the I/O completion interrupt and NAPI polling on a
channel *after* we successfully stop it rather than before. This
ensures a completion occurring just before the channel is stopped
gets processed.
Enable NAPI polling and the interrupt *before* starting a channel
rather than after, to be symmetric. A stopped channel won't
generate any completion interrupts anyway.
Enable NAPI before the interrupt and disable it afterward.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Open-code gsi_channel_freeze() and gsi_channel_thaw() in all callers
and get rid of these two functions. This is part of reworking the
sequence of things done during channel suspend/resume and start/stop.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create a new function that does most of the work of starting a
channel. What's different is that it takes a flag indicating
whether the channel should really be started or not. Create
another new function __gsi_channel_stop() that behaves similarly.
IPA v3.5.1 implements suspend using a special SUSPEND endpoint
setting. If the endpoint is suspended when an I/O completes on the
underlying GSI channel, a SUSPEND interrupt is generated.
Newer versions of IPA do not implement the SUSPEND endpoint mode.
Instead, endpoint suspend is implemented by simply stopping the
underlying GSI channel. In this case, a completing I/O on a
*stopped* channel causes the SUSPEND interrupt condition.
These new functions put all activity related to starting or stopping
a channel (including "thawing/freezing" the channel) in one place,
whether or not the channel is actually started or stopped.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create a new helper function that encapsulates issuing a set of
channel stop commands, retrying if appropriate, with a short delay
between attempts.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If an error occurs starting a channel, don't "thaw" it.
We should assume the channel remains in a non-started state.
Update the comment in gsi_channel_stop(); calls to this function
are no longer retried.
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>