IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct xmit hash steps for layer3+4 as introduced by commit
49aefd131739 ("bonding: do not discard lowest hash bit for non layer3+4
hashing").
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit c1f897ce186a ("bonding: set default miimon value for non-arp
modes if not set") the miimon default was changed from zero to 100 if
arp_interval is also zero. Document this fact in bonding.rst.
Fixes: c1f897ce186a ("bonding: set default miimon value for non-arp modes if not set")
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The main usage of the struct thunderbolt_ip_frame_header is to handle
the packets on the media layer. The header is bound to the protocol
in which the byte ordering is crucial. However the data type definition
doesn't use that and sparse is unhappy, for example (17 altogether):
.../thunderbolt.c:718:23: warning: cast to restricted __le32
.../thunderbolt.c:966:42: warning: incorrect type in assignment (different base types)
.../thunderbolt.c:966:42: expected unsigned int [usertype] frame_count
.../thunderbolt.c:966:42: got restricted __le32 [usertype]
Switch to the bitwise types in the struct thunderbolt_ip_frame_header to
reduce this, but not completely solving (9 left), because the same data
type is used for Rx header handled locally (in CPU byte order).
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Letting the compiler remove these functions when the kernel is built
without CONFIG_PM_SLEEP support is simpler and less heavier for builds
than the use of __maybe_unused attributes.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devlink instances may contain thousands of ports. Storing them in
linked list and looking them up is not scalable. Convert the linked list
into xarray.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Andrzej Siewior says:
====================
I started playing with HSR and run into a problem. Tested latest
upstream -rc and noticed more problems. Now it appears to work.
For testing I have a small three node setup with iperf and ping. While
iperf doesn't complain ping reports missing packets and duplicates.
====================
Link: https://lore.kernel.org/r/20221129164815.128922-1-bigeasy@linutronix.de/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This test adds a basic HSRv0 network with 3 nodes. In its current shape
it sends and forwards packets, announcements and so merges nodes based
on MAC A/B information.
It is able to detect duplicate packets and packetloss should any occur.
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
self_node_db is a list_head with one entry of struct hsr_node. The
purpose is to hold the two MAC addresses of the node itself.
It is convenient to recycle the structure. However having a list_head
and fetching always the first entry is not really optimal.
Created a new data strucure contaning the two MAC addresses named
hsr_self_node. Access that structure like an RCU protected pointer so
it can be replaced on the fly without blocking the reader.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hsr_register_frame_out() compares new sequence_nr vs the old one
recorded in hsr_node::seq_out and if the new sequence_nr is higher then
it will be written to hsr_node::seq_out as the new value.
This operation isn't locked so it is possible that two frames with the
same sequence number arrive (via the two slave devices) and are fed to
hsr_register_frame_out() at the same time. Both will pass the check and
update the sequence counter later to the same value. As a result the
content of the same packet is fed into the stack twice.
This was noticed by running ping and observing DUP being reported from
time to time.
Instead of using the hsr_priv::seqnr_lock for the whole receive path (as
it is for sending in the master node) add an additional lock that is only
used for sequence number checks and updates.
Add a per-node lock that is used during sequence number reads and
updates.
Fixes: f421436a591d3 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sending frames via the hsr (master) device requires a sequence number
which is tracked in hsr_priv::sequence_nr and protected by
hsr_priv::seqnr_lock. Each time a new frame is sent, it will obtain a
new id and then send it via the slave devices.
Each time a packet is sent (via hsr_forward_do()) the sequence number is
checked via hsr_register_frame_out() to ensure that a frame is not
handled twice. This make sense for the receiving side to ensure that the
frame is not injected into the stack twice after it has been received
from both slave ports.
There is no locking to cover the sending path which means the following
scenario is possible:
CPU0 CPU1
hsr_dev_xmit(skb1) hsr_dev_xmit(skb2)
fill_frame_info() fill_frame_info()
hsr_fill_frame_info() hsr_fill_frame_info()
handle_std_frame() handle_std_frame()
skb1's sequence_nr = 1
skb2's sequence_nr = 2
hsr_forward_do() hsr_forward_do()
hsr_register_frame_out(, 2) // okay, send)
hsr_register_frame_out(, 1) // stop, lower seq duplicate
Both skbs (or their struct hsr_frame_info) received an unique id.
However since skb2 was sent before skb1, the higher sequence number was
recorded in hsr_register_frame_out() and the late arriving skb1 was
dropped and never sent.
This scenario has been observed in a three node HSR setup, with node1 +
node2 having ping and iperf running in parallel. From time to time ping
reported a missing packet. Based on tracing that missing ping packet did
not leave the system.
It might be possible (didn't check) to drop the sequence number check on
the sending side. But if the higher sequence number leaves on wire
before the lower does and the destination receives them in that order
and it will drop the packet with the lower sequence number and never
inject into the stack.
Therefore it seems the only way is to lock the whole path from obtaining
the sequence number and sending via dev_queue_xmit() and assuming the
packets leave on wire in the same order (and don't get reordered by the
NIC).
Cover the whole path for the master interface from obtaining the ID
until after it has been forwarded via hsr_forward_skb() to ensure the
skbs are sent to the NIC in the order of the assigned sequence numbers.
Fixes: f421436a591d3 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The hsr device is a software device. Its
net_device_ops::ndo_start_xmit() routine will process the packet and
then pass the resulting skb to dev_queue_xmit().
During processing, hsr acquires a lock with spin_lock_bh()
(hsr_add_node()) which needs to be promoted to the _irq() suffix in
order to avoid a potential deadlock.
Then there are the warnings in dev_queue_xmit() (due to
local_bh_disable() with disabled interrupts) left.
Instead trying to address those (there is qdisc and…) for netpoll sake,
just disable netpoll on hsr.
Disable netpoll on hsr and replace the _irqsave() locking with _bh().
Fixes: f421436a591d3 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Due to the hashed-MAC optimisation one problem become visible:
hsr_handle_sup_frame() walks over the list of available nodes and merges
two node entries into one if based on the information in the supervision
both MAC addresses belong to one node. The list-walk happens on a RCU
protected list and delete operation happens under a lock.
If the supervision arrives on both slave interfaces at the same time
then this delete operation can occur simultaneously on two CPUs. The
result is the first-CPU deletes the from the list and the second CPUs
BUGs while attempting to dereference a poisoned list-entry. This happens
more likely with the optimisation because a new node for the mac_B entry
is created once a packet has been received and removed (merged) once the
supervision frame has been received.
Avoid removing/ cleaning up a hsr_node twice by adding a `removed' field
which is set to true after the removal and checked before the removal.
Fixes: f266a683a4804 ("net/hsr: Better frame dispatch")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hsr_forward_skb() a skb and keeps information in an on-stack
hsr_frame_info. hsr_get_node() assigns hsr_frame_info::node_src which is
from a RCU list. This pointer is used later in hsr_forward_do().
I don't see a reason why this pointer can't vanish midway since there is
no guarantee that hsr_forward_skb() is invoked from an RCU read section.
Use rcu_read_lock() to protect hsr_frame_info::node_src from its
assignment until it is no longer used.
Fixes: f266a683a4804 ("net/hsr: Better frame dispatch")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The hlist optimisation (which not only uses hlist_head instead of
list_head but also splits hsr_priv::node_db into an array of 256 slots)
does not consider the "node merge":
Upon starting the hsr network (with three nodes) a packet that is
sent from node1 to node3 will also be sent from node1 to node2 and then
forwarded to node3.
As a result node3 will receive 2 packets because it is not able
to filter out the duplicate. Each packet received will create a new
struct hsr_node with macaddress_A only set the MAC address it received
from (the two MAC addesses from node1).
At some point (early in the process) two supervision frames will be
received from node1. They will be processed by hsr_handle_sup_frame()
and one frame will leave early ("Node has already been merged") and does
nothing. The other frame will be merged as portB and have its MAC
address written to macaddress_B and the hsr_node (that was created for
it as macaddress_A) will be removed.
From now on HSR is able to identify a duplicate because both packets
sent from one node will result in the same struct hsr_node because
hsr_get_node() will find the MAC address either on macaddress_A or
macaddress_B.
Things get tricky with the optimisation: If sender's MAC address is
saved as macaddress_A then the lookup will work as usual. If the MAC
address has been merged into macaddress_B of another hsr_node then the
lookup won't work because it is likely that the data structure is in
another bucket. This results in creating a new struct hsr_node and not
recognising a possible duplicate.
A way around it would be to add another hsr_node::mac_list_B and attach
it to the other bucket to ensure that this hsr_node will be looked up
either via macaddress_A _or_ macaddress_B.
I however prefer to revert it because it sounds like an academic problem
rather than real life workload plus it adds complexity. I'm not an HSR
expert with what is usual size of a network but I would guess 40 to 60
nodes. With 10.000 nodes and assuming 60us for pass-through (from node
to node) then it would take almost 600ms for a packet to almost wrap
around which sounds a lot.
Revert the hash MAC addresses optimisation.
Fixes: 4acc45db71158 ("net: hsr: use hlist_head instead of list_head for mac addresses")
Cc: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 9ed7bfc79542 ("sctp: fix memory leak in
sctp_stream_outq_migrate()"), sctp_sched_set_sched() is the only
place calling sched->free(), and it can actually be replaced by
sched->free_sid() on each stream, and yet there's already a loop
to traverse all streams in sctp_sched_set_sched().
This patch adds a function sctp_sched_free_sched() where it calls
sched->free_sid() for each stream to replace sched->free() calls
in sctp_sched_set_sched() and then deletes the unused free member
from struct sctp_sched_ops.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/e10aac150aca2686cb0bd0570299ec716da5a5c0.1669849471.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthieu Baerts says:
====================
mptcp: PM listener events + selftests cleanup
Thanks to the patch 6/11, the MPTCP path manager now sends Netlink events
when MPTCP listening sockets are created and closed. The reason why it is
needed is explained in the linked ticket [1]:
MPTCP for Linux, when not using the in-kernel PM, depends on the
userspace PM to create extra listening sockets before announcing
addresses and ports. Let's call these "PM listeners".
With the existing MPTCP netlink events, a userspace PM can create
PM listeners at startup time, or in response to an incoming connection.
Creating sockets in response to connections is not optimal: ADD_ADDRs
can't be sent until the sockets are created and listen()ed, and if all
connections are closed then it may not be clear to the userspace
PM daemon that PM listener sockets should be cleaned up.
Hence this feature request: to add MPTCP netlink events for listening
socket close & create, so PM listening sockets can be managed based
on application activity.
[1] https://github.com/multipath-tcp/mptcp_net-next/issues/313
Selftests for these new Netlink events have been added in patches 9,11/11.
The remaining patches introduce different cleanups and small improvements
in MPTCP selftests to ease the maintenance and the addition of new tests.
====================
Link: https://lore.kernel.org/r/20221130140637.409926-1-matthieu.baerts@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds test coverage for listening sockets created by the
in-kernel path manager in mptcp_join.sh.
It adds the listener event checking in the existing "remove single
address with port" test. The output looks like this:
003 remove single address with port syn[ ok ] - synack[ ok ] - ack[ ok ]
add[ ok ] - echo [ ok ] - pt [ ok ]
syn[ ok ] - synack[ ok ] - ack[ ok ]
syn[ ok ] - ack [ ok ]
rm [ ok ] - rmsf [ ok ] invert
CREATE_LISTENER 10.0.2.1:10100[ ok ]
CLOSE_LISTENER 10.0.2.1:10100 [ ok ]
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch moves evts_ns1 and evts_ns2 out of do_transfer() as two global
variables in mptcp_join.sh. Init them in init() and remove them in
cleanup().
Add a new helper reset_with_events() to save the outputs of 'pm_nl_ctl
events' command in them. And a new helper kill_events_pids() to kill
pids of 'pm_nl_ctl events' command. Use these helpers in userspace pm
tests.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds test coverage for listening sockets created by userspace
processes.
It adds a new test named test_listener() and a new verifying helper
verify_listener_events(). The new output looks like this:
CREATE_SUBFLOW 10.0.2.2 (ns2) => 10.0.2.1 (ns1) [OK]
DESTROY_SUBFLOW 10.0.2.2 (ns2) => 10.0.2.1 (ns1) [OK]
MP_PRIO TX [OK]
MP_PRIO RX [OK]
CREATE_LISTENER 10.0.2.2:37106 [OK]
CLOSE_LISTENER 10.0.2.2:37106 [OK]
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch makes server_evts and client_evts global in userspace_pm.sh,
then these two variables could be used in test_announce(), test_remove()
and test_subflows(). The local variable 'evts' in these three functions
then could be dropped.
Also move local variable 'file' as a global one.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some userspace pm tests failed since pm listener events have been added.
Now MPTCP_EVENT_LISTENER_CREATED event becomes the first item in the
events list like this:
type:15,family:2,sport:10006,saddr4:0.0.0.0
type:1,token:3701282876,server_side:1,family:2,saddr4:10.0.1.1,...
And no token value in this MPTCP_EVENT_LISTENER_CREATED event.
This patch fixes this by specifying the type 1 item to search for token
values.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds two new MPTCP netlink event types for PM listening
socket create and close, named MPTCP_EVENT_LISTENER_CREATED and
MPTCP_EVENT_LISTENER_CLOSED.
Add a new function mptcp_event_pm_listener() to push the new events
with family, port and addr to userspace.
Invoke mptcp_event_pm_listener() with MPTCP_EVENT_LISTENER_CREATED in
mptcp_listen() and mptcp_pm_nl_create_listen_socket(), invoke it with
MPTCP_EVENT_LISTENER_CLOSED in __mptcp_close_ssk().
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Just to avoid classical Bash pitfall where variables are accidentally
overridden by other functions because the proper scope has not been
defined.
That's also what is done in other MPTCP selftests scripts where all non
local variables are defined at the beginning of the script and the
others are defined with the "local" keyword.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It is clearer to declare these global variables at the beginning of the
file as it is done in other MPTCP selftests rather than in functions in
the middle of the script.
So for uniformity reason, we can do the same here in mptcp_sockopt.sh.
Suggested-by: Geliang Tang <geliang.tang@suse.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The definition of 'rndh' was probably copied from one script to another
but some times, 'sec' was not defined, not used and/or not spelled
properly.
Here all the 'rndh' are now defined the same way.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some variables were set but never used.
This was not causing any issues except adding some confusion and having
shellcheck complaining about them.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A new "sandbox" net namespace is available where no other netfilter
rules have been added.
Use this new netns instead of re-using "ns1" and clean it.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I must have missed that these stats are only exposed
via the unstructured ethtool -S when they got merged.
Plumb in the structured form.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20221130013108.90062-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Arınç ÜNAL says:
====================
remove label = "cpu" from DSA dt-binding
With this patch series, we're completely getting rid of 'label = "cpu";'
which is not used by the DSA dt-binding at all.
Information for taking the patches for maintainers:
Patch 1: netdev maintainers (based off netdev/net-next.git main)
Patch 2-3: SoC maintainers (based off soc/soc.git soc/dt)
Patch 4: MIPS maintainers (based off mips/linux.git mips-next)
Patch 5: PowerPC maintainers (based off powerpc/linux.git next-test)
I've been meaning to submit this for a few months. Find the relevant
conversation here:
https://lore.kernel.org/netdev/20220913155408.GA3802998-robh@kernel.org/
Here's how I did it, for the interested (or suggestions):
Find the platforms which have got 'label = "cpu";' defined.
grep -rnw . -e 'label = "cpu";'
Remove the line where 'label = "cpu";' is included.
sed -i /'label = "cpu";'/,+d arch/arm/boot/dts/*
sed -i /'label = "cpu";'/,+d arch/arm64/boot/dts/freescale/*
sed -i /'label = "cpu";'/,+d arch/arm64/boot/dts/marvell/*
sed -i /'label = "cpu";'/,+d arch/arm64/boot/dts/mediatek/*
sed -i /'label = "cpu";'/,+d arch/arm64/boot/dts/rockchip/*
sed -i /'label = "cpu";'/,+d arch/mips/boot/dts/qca/*
sed -i /'label = "cpu";'/,+d arch/mips/boot/dts/ralink/*
sed -i /'label = "cpu";'/,+d arch/powerpc/boot/dts/turris1x.dts
sed -i /'label = "cpu";'/,+d Documentation/devicetree/bindings/net/qca,ar71xx.yaml
Restore the symlink files which typechange after running sed.
====================
Link: https://lore.kernel.org/r/20221130141040.32447-1-arinc.unal@arinc9.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is not used by the DSA dt-binding, so remove it from the examples.
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert BUG_ON() to WARN_ON_ONCE() and warn as well for unlikely
static key int overflow error-path.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the kernel was short on (atomic) memory and failed to allocate it -
don't proceed to creation of request socket. Otherwise the socket would
be unsigned and userspace likely doesn't expect that the TCP is not
MD5-signed anymore.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To do that, separate two scenarios:
- where it's the first MD5 key on the system, which means that enabling
of the static key may need to sleep;
- copying of an existing key from a listening socket to the request
socket upon receiving a signed TCP segment, where static key was
already enabled (when the key was added to the listening socket).
Now the life-time of the static branch for TCP-MD5 is until:
- last tcp_md5sig_info is destroyed
- last socket in time-wait state with MD5 key is closed.
Which means that after all sockets with TCP-MD5 keys are gone, the
system gets back the performance of disabled md5-key static branch.
While at here, provide static_key_fast_inc() helper that does ref
counter increment in atomic fashion (without grabbing cpus_read_lock()
on CONFIG_JUMP_LABEL=y). This is needed to add a new user for
a static_key when the caller controls the lifetime of another user.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a helper to allocate tcp_md5sig_info, that will help later to
do/allocate things when info allocated, once per socket.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1. With CONFIG_JUMP_LABEL=n static_key_slow_inc() doesn't have any
protection against key->enabled refcounter overflow.
2. With CONFIG_JUMP_LABEL=y static_key_slow_inc_cpuslocked()
still may turn the refcounter negative as (v + 1) may overflow.
key->enabled is indeed a ref-counter as it's documented in multiple
places: top comment in jump_label.h, Documentation/staging/static-keys.rst,
etc.
As -1 is reserved for static key that's in process of being enabled,
functions would break with negative key->enabled refcount:
- for CONFIG_JUMP_LABEL=n negative return of static_key_count()
breaks static_key_false(), static_key_true()
- the ref counter may become 0 from negative side by too many
static_key_slow_inc() calls and lead to use-after-free issues.
These flaws result in that some users have to introduce an additional
mutex and prevent the reference counter from overflowing themselves,
see bpf_enable_runtime_stats() checking the counter against INT_MAX / 2.
Prevent the reference counter overflow by checking if (v + 1) > 0.
Change functions API to return whether the increment was successful.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The open code is defined as a helper function(tp_to_dev) on r8169_main.c,
which the open code is &tp->pci_dev->dev. The helper function was added
in commit 1e1205b7d3e9 ("r8169: add helper tp_to_dev"). And then later,
commit f1e911d5d0df ("r8169: add basic phylib support") added
r8169_phylink_handler function but it didn't use the helper function.
Thus, tp_to_dev() replaces the open code. This patch doesn't change logic.
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/20221129161244.5356-1-claudiajkang@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Vladimir Oltean says:
====================
Fix rtnl_mutex deadlock with DPAA2 and SFP modules
This patch set deliberately targets net-next and lacks Fixes: tags due
to caution on my part.
While testing some SFP modules on the Solidrun Honeycomb LX2K platform,
I noticed that rebooting causes a deadlock:
============================================
WARNING: possible recursive locking detected
6.1.0-rc5-07010-ga9b9500ffaac-dirty #656 Not tainted
--------------------------------------------
systemd-shutdow/1 is trying to acquire lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30
but task is already holding lock:
ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(rtnl_mutex);
lock(rtnl_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
6 locks held by systemd-shutdow/1:
#0: ffffa62db6863c70 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xd4/0x260
#1: ffff2f2b0176f100 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xf4/0x260
#2: ffff2f2b017be900 (&dev->mutex){....}-{4:4}, at: device_shutdown+0x104/0x260
#3: ffff2f2b017680f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
#4: ffff2f2b0e1608f0 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x40/0x260
#5: ffffa62db6cf42f0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x1c/0x30
stack backtrace:
CPU: 6 PID: 1 Comm: systemd-shutdow Not tainted 6.1.0-rc5-07010-ga9b9500ffaac-dirty #656
Hardware name: SolidRun LX2160A Honeycomb (DT)
Call trace:
lock_acquire+0x68/0x84
__mutex_lock+0x98/0x460
mutex_lock_nested+0x2c/0x40
rtnl_lock+0x1c/0x30
sfp_bus_del_upstream+0x1c/0xac
phylink_destroy+0x1c/0x50
dpaa2_mac_disconnect+0x28/0x70
dpaa2_eth_remove+0x1dc/0x1f0
fsl_mc_driver_remove+0x24/0x60
device_remove+0x70/0x80
device_release_driver_internal+0x1f0/0x260
device_links_unbind_consumers+0xe0/0x110
device_release_driver_internal+0x138/0x260
device_release_driver+0x18/0x24
bus_remove_device+0x12c/0x13c
device_del+0x16c/0x424
fsl_mc_device_remove+0x28/0x40
__fsl_mc_device_remove+0x10/0x20
device_for_each_child+0x5c/0xac
dprc_remove+0x94/0xb4
fsl_mc_driver_remove+0x24/0x60
device_remove+0x70/0x80
device_release_driver_internal+0x1f0/0x260
device_release_driver+0x18/0x24
bus_remove_device+0x12c/0x13c
device_del+0x16c/0x424
fsl_mc_bus_remove+0x8c/0x10c
fsl_mc_bus_shutdown+0x10/0x20
platform_shutdown+0x24/0x3c
device_shutdown+0x15c/0x260
kernel_restart+0x40/0xa4
__do_sys_reboot+0x1e4/0x260
__arm64_sys_reboot+0x24/0x30
But fixing this appears to be not so simple. The patch set represents my
attempt to address it.
In short, the problem is that dpaa2_mac_connect() and dpaa2_mac_disconnect()
call 2 phylink functions in a row, one takes rtnl_lock() itself -
phylink_create(), and one which requires rtnl_lock() to be held by the
caller - phylink_fwnode_phy_connect(). The existing approach in the
drivers is too simple. We take rtnl_lock() when calling dpaa2_mac_connect(),
which is what results in the deadlock.
Fixing just that creates another problem. The drivers make use of
rtnl_lock() for serializing with other code paths too. I think I've
found all those code paths, and established other mechanisms for
serializing with them.
====================
Link: https://lore.kernel.org/r/20221129141221.872653-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
After the introduction of a private mac_lock that serializes access to
priv->mac (and port_priv->mac in the switch), the only remaining purpose
of rtnl_lock() is to satisfy the locking requirements of
phylink_fwnode_phy_connect() and phylink_disconnect_phy().
But the functions these live in, dpaa2_mac_connect() and
dpaa2_mac_disconnect(), have contradictory locking requirements.
While phylink_fwnode_phy_connect() wants rtnl_lock() to be held,
phylink_create() wants it to not be held.
Move the rtnl_lock() from top-level (in the dpaa2-eth and dpaa2-switch
drivers) to only surround the phylink calls that require it, in the
dpaa2-mac library code.
This is possible because dpaa2_mac_connect() and dpaa2_mac_disconnect()
run unlocked, and there isn't any danger of an AB/BA deadlock between
the rtnl_mutex and other private locks.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The dpaa2-switch driver uses a DPMAC in the same way as the dpaa2-eth
driver, so we need to duplicate the locking solution established by the
previous change to the switch driver as well.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The dpaa2 architecture permits dynamic connections between objects on
the fsl-mc bus, specifically between a DPNI object (represented by a
struct net_device) and a DPMAC object (represented by a struct phylink).
The DPNI driver is notified when those connections are created/broken
through the dpni_irq0_handler_thread() method. To ensure that ethtool
operations, as well as netdev up/down operations serialize with the
connection/disconnection of the DPNI with a DPMAC,
dpni_irq0_handler_thread() takes the rtnl_lock() to block those other
operations from taking place.
There is code called by dpaa2_mac_connect() which wants to acquire the
rtnl_mutex once again, see phylink_create() -> phylink_register_sfp() ->
sfp_bus_add_upstream() -> rtnl_lock(). So the strategy doesn't quite
work out, even though it's fairly simple.
Create a different strategy, where all code paths in the dpaa2-eth
driver access priv->mac only while they are holding priv->mac_lock.
The phylink instance is not created or connected to the PHY under the
priv->mac_lock, but only assigned to priv->mac then. This will eliminate
the reliance on the rtnl_mutex.
Add lockdep annotations and put comments where holding the lock is not
necessary, and priv->mac can be dereferenced freely.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
dpaa2_eth_connect_mac() is called both from dpaa2_eth_probe() and from
dpni_irq0_handler_thread().
It could happen that the DPNI gets connected to a DPMAC on the fsl-mc
bus exactly during probe, as soon as the "endpoint change" interrupt is
requested in dpaa2_eth_setup_irqs(). This will cause the
dpni_irq0_handler_thread() to register a phylink instance for that DPMAC.
Then, the probing function will also try to register a phylink instance
for the same DPMAC, operation which should fail (and this will fail the
probing of the driver).
Reorder dpaa2_eth_setup_irqs() and dpaa2_eth_connect_mac(), such that
dpni_irq0_handler_thread() never races with the DPMAC-related portion of
the probing path.
Also reorder dpaa2_eth_disconnect_mac() to be in the mirror position of
dpaa2_eth_connect_mac() in the teardown path.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The helper function will gain a lockdep annotation in a future patch.
Make sure to benefit from it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
DPNIs and DPSW objects can connect and disconnect at runtime from DPMAC
objects on the same fsl-mc bus. The DPMAC object also holds "ethtool -S"
unstructured counters. Those counters are only shown for the entity
owning the netdev (DPNI, DPSW) if it's connected to a DPMAC.
The ethtool stringset code path is split into multiple callbacks, but
currently, connecting and disconnecting the DPMAC takes the rtnl_lock().
This blocks the entire ethtool code path from running, see
ethnl_default_doit() -> rtnl_lock() -> ops->prepare_data() ->
strset_prepare_data().
This is going to be a problem if we are going to no longer require
rtnl_lock() when connecting/disconnecting the DPMAC, because the DPMAC
could appear between ops->get_sset_count() and ops->get_strings().
If it appears out of the blue, we will provide a stringset into an array
that was dimensioned thinking the DPMAC wouldn't be there => array
accessed out of bounds.
There isn't really a good way to work around that, and I don't want to
put too much pressure on the ethtool framework by playing locking games.
Just make the DPMAC counters be always available. They'll be zeroes if
the DPNI or DPSW isn't connected to a DPMAC.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The dpaa2-switch has the exact same locking requirements when connected
to a DPMAC, so it needs port_priv->mac to always point either to NULL,
or to a DPMAC with a fully initialized phylink instance.
Make the same preparatory change in the dpaa2-switch driver as in the
dpaa2-eth one.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are 2 requirements for correct code:
- Any time the driver accesses the priv->mac pointer at runtime, it
either holds NULL to indicate a DPNI-DPNI connection (or unconnected
DPNI), or a struct dpaa2_mac whose phylink instance was fully
initialized (created and connected to the PHY). No changes are made to
priv->mac while it is being used. Currently, rtnl_lock() watches over
the call to dpaa2_eth_connect_mac(), so it serves the purpose of
serializing this with all readers of priv->mac.
- dpaa2_mac_connect() should run unlocked, because inside it are 2
phylink calls with incompatible locking requirements: phylink_create()
requires that the rtnl_mutex isn't held, and phylink_fwnode_phy_connect()
requires that the rtnl_mutex is held. The only way to solve those
contradictory requirements is to let dpaa2_mac_connect() take
rtnl_lock() when it needs to.
To solve both requirements, we need to identify the writer side of the
priv->mac pointer, which can be wrapped in a mutex private to the driver
in a future patch. The dpaa2_mac_connect() cannot be part of the writer
side critical section, because of an AB/BA deadlock with rtnl_lock().
So the strategy needs to be that where we prepare the DPMAC by calling
dpaa2_mac_connect(), and only make priv->mac point to it once it's fully
prepared. This ensures that the writer side critical section has the
absolute minimum surface it can.
The reverse strategy is adopted in the dpaa2_eth_disconnect_mac() code
path. This makes sure that priv->mac is NULL when we start tearing down
the DPMAC that we disconnected from, and concurrent code will simply not
see it.
No locking changes in this patch (concurrent code is still blocked by
the rtnl_mutex).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
dpaa2_mac_disconnect() will only be called with a NULL mac->phylink if
dpaa2_mac_connect() failed, or was never called.
The callers are these:
dpaa2_eth_disconnect_mac():
if (dpaa2_eth_is_type_phy(priv))
dpaa2_mac_disconnect(priv->mac);
dpaa2_switch_port_disconnect_mac():
if (dpaa2_switch_port_is_type_phy(port_priv))
dpaa2_mac_disconnect(port_priv->mac);
priv->mac can be NULL, but in that case, dpaa2_eth_is_type_phy() returns
false, and dpaa2_mac_disconnect() is never called. Similar for
dpaa2-switch.
When priv->mac is non-NULL, it means that dpaa2_mac_connect() returned
zero (success), and therefore, priv->mac->phylink is also a valid
pointer.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The phylink handling is intended to be hidden inside the dpaa2_mac
object. Move the phylink_start() call into dpaa2_mac_start(), and
phylink_stop() into dpaa2_mac_stop().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>