IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 4abbd2e3c1db671fa1286390f1310aec78386f1d upstream.
In the smc_listen_work(), if smc_listen_prfx_check() failed,
the real reason: SMC_CLC_DECL_DIFFPREFIX was dropped, and
SMC_CLC_DECL_NOSMCDEV was returned.
Althrough this is also kind of SMC_CLC_DECL_NOSMCDEV, but return
the real reason is much friendly for debugging.
Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20231012123729.29307-1-dust.li@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9d1a3c74746428102d55371fbf74b484733937d9 upstream.
bacmp() is a wrapper around memcpy(), which contain compile-time
checks for buffer overflow. Since the hci_conn_request_evt() also calls
bt_dev_dbg() with an implicit NULL pointer check, the compiler is now
aware of a case where 'hdev' is NULL and treats this as meaning that
zero bytes are available:
In file included from net/bluetooth/hci_event.c:32:
In function 'bacmp',
inlined from 'hci_conn_request_evt' at net/bluetooth/hci_event.c:3276:7:
include/net/bluetooth/bluetooth.h:364:16: error: 'memcmp' specified bound 6 exceeds source size 0 [-Werror=stringop-overread]
364 | return memcmp(ba1, ba2, sizeof(bdaddr_t));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Add another NULL pointer check before the bacmp() to ensure the compiler
understands the code flow enough to not warn about it. Since the patch
that introduced the warning is marked for stable backports, this one
should also go that way to avoid introducing build regressions.
Fixes: 1ffc6f8cc332 ("Bluetooth: Reject connection with the device which has same BD_ADDR")
Cc: Kees Cook <keescook@chromium.org>
Cc: "Lee, Chun-Yi" <jlee@suse.com>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 35d91d95a0cd61ebb90e0246dc917fd25e519b8c upstream.
This fixes the following code style problem:
ERROR: that open brace { should be on the previous line
+ if (!bacmp(&hdev->bdaddr, &ev->bdaddr))
+ {
Fixes: 1ffc6f8cc332 ("Bluetooth: Reject connection with the device which has same BD_ADDR")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c7f59461f5a78994613afc112cdd73688aef9076 upstream.
Syzbot reports a warning as follows:
WARNING: CPU: 1 PID: 26946 at net/bluetooth/hci_conn.c:619
hci_conn_timeout+0x122/0x210 net/bluetooth/hci_conn.c:619
...
Call Trace:
<TASK>
process_one_work+0x884/0x15c0 kernel/workqueue.c:2630
process_scheduled_works kernel/workqueue.c:2703 [inline]
worker_thread+0x8b9/0x1290 kernel/workqueue.c:2784
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
</TASK>
It is because the HCI_EV_SIMPLE_PAIR_COMPLETE event handler drops
hci_conn directly without check Simple Pairing whether be enabled. But
the Simple Pairing process can only be used if both sides have the
support enabled in the host stack.
Add hci_conn_ssp_enabled() for hci_conn in HCI_EV_IO_CAPA_REQUEST and
HCI_EV_SIMPLE_PAIR_COMPLETE event handlers to fix the problem.
Fixes: 0493684ed239 ("[Bluetooth] Disable disconnect timer during Simple Pairing")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1ffc6f8cc33268731fcf9629fc4438f6db1191fc upstream.
This change is used to relieve CVE-2020-26555. The description of
the CVE:
Bluetooth legacy BR/EDR PIN code pairing in Bluetooth Core Specification
1.0B through 5.2 may permit an unauthenticated nearby device to spoof
the BD_ADDR of the peer device to complete pairing without knowledge
of the PIN. [1]
The detail of this attack is in IEEE paper:
BlueMirror: Reflections on Bluetooth Pairing and Provisioning Protocols
[2]
It's a reflection attack. The paper mentioned that attacker can induce
the attacked target to generate null link key (zero key) without PIN
code. In BR/EDR, the key generation is actually handled in the controller
which is below HCI.
A condition of this attack is that attacker should change the
BR_ADDR of his hacking device (Host B) to equal to the BR_ADDR with
the target device being attacked (Host A).
Thus, we reject the connection with device which has same BD_ADDR
both on HCI_Create_Connection and HCI_Connection_Request to prevent
the attack. A similar implementation also shows in btstack project.
[3][4]
Cc: stable@vger.kernel.org
Link: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26555 [1]
Link: https://ieeexplore.ieee.org/abstract/document/9474325/authors#authors [2]
Link: https://github.com/bluekitchen/btstack/blob/master/src/hci.c#L3523 [3]
Link: https://github.com/bluekitchen/btstack/blob/master/src/hci.c#L7297 [4]
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 33155c4aae5260475def6f7438e4e35564f4f3ba upstream.
This change is used to relieve CVE-2020-26555. The description of the
CVE:
Bluetooth legacy BR/EDR PIN code pairing in Bluetooth Core Specification
1.0B through 5.2 may permit an unauthenticated nearby device to spoof
the BD_ADDR of the peer device to complete pairing without knowledge
of the PIN. [1]
The detail of this attack is in IEEE paper:
BlueMirror: Reflections on Bluetooth Pairing and Provisioning Protocols
[2]
It's a reflection attack. The paper mentioned that attacker can induce
the attacked target to generate null link key (zero key) without PIN
code. In BR/EDR, the key generation is actually handled in the controller
which is below HCI.
Thus, we can ignore null link key in the handler of "Link Key Notification
event" to relieve the attack. A similar implementation also shows in
btstack project. [3]
v3: Drop the connection when null link key be detected.
v2:
- Used Link: tag instead of Closes:
- Used bt_dev_dbg instead of BT_DBG
- Added Fixes: tag
Cc: stable@vger.kernel.org
Fixes: 55ed8ca10f35 ("Bluetooth: Implement link key handling for the management interface")
Link: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26555 [1]
Link: https://ieeexplore.ieee.org/abstract/document/9474325/authors#authors [2]
Link: https://github.com/bluekitchen/btstack/blob/master/src/hci.c#L3722 [3]
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5093bbfc10ab6636b32728e35813cbd79feb063c upstream.
Our current route lookups (mctp_route_lookup and mctp_route_lookup_null)
traverse the net's route list without the RCU read lock held. This means
the route lookup is subject to preemption, resulting in an potential
grace period expiry, and so an eventual kfree() while we still have the
route pointer.
Add the proper read-side critical section locks around the route
lookups, preventing premption and a possible parallel kfree.
The remaining net->mctp.routes accesses are already under a
rcu_read_lock, or protected by the RTNL for updates.
Based on an analysis from Sili Luo <rootlab@huawei.com>, where
introducing a delay in the route lookup could cause a UAF on
simultaneous sendmsg() and route deletion.
Reported-by: Sili Luo <rootlab@huawei.com>
Fixes: 889b7da23abf ("mctp: Add initial routing framework")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/29c4b0e67dc1bf3571df3982de87df90cae9b631.1696837310.git.jk@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7563cf17dce0a875ba3d872acdc63a78ea344019 upstream.
Direct calls to ops->connect() can overwrite the address parameter when
used in conjunction with BPF SOCK_ADDR hooks. Recent changes to
kernel_connect() ensure that callers are insulated from such side
effects. This patch wraps the direct call to ops->connect() with
kernel_connect() to prevent unexpected changes to the address passed to
ceph_tcp_connect().
This change was originally part of a larger patch targeting the net tree
addressing all instances of unprotected calls to ops->connect()
throughout the kernel, but this change was split up into several patches
targeting various trees.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b650d953cd391595e536153ce30b4aab385643ac ]
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.
To reproduce: Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()). This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].
As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.
The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender. This problem has previously been identified in
RFC 7323, appendix F [1].
The Linux kernel currently adheres to never shrinking the window.
In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session. A receive buffer full condition should instead result
in a zero window and an indefinite wait.
In practice, this problem is largely hidden for most flows. It
is not applicable to mice flows. Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.
But this problem does show up for other types of flows. Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time. In these cases, we directly
encounter the problem described in [1].
RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].
This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf). This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window
Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323
Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 354a6e707e29cb0c007176ee5b8db8be7bd2dee0 ]
The protocol is used in a bit mask to determine if the protocol is
supported. Assert the provided protocol is less than the maximum
defined so it doesn't potentially perform a shift-out-of-bounds and
provide a clearer error for undefined protocols vs unsupported ones.
Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation")
Reported-and-tested-by: syzbot+0839b78e119aae1fec78@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0839b78e119aae1fec78
Signed-off-by: Jeremy Cline <jeremy@jcline.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231009200054.82557-1-jeremy@jcline.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a950a5921db450c74212327f69950ff03419483a ]
SMC_STAT_PAYLOAD_SUB(_smc_stats, _tech, key, _len, _rc) will calculate
wrong bucket positions for payloads of exactly 4096 bytes and
(1 << (m + 12)) bytes, with m == SMC_BUF_MAX - 1.
Intended bucket distribution:
Assume l == size of payload, m == SMC_BUF_MAX - 1.
Bucket 0 : 0 < l <= 2^13
Bucket n, 1 <= n <= m-1 : 2^(n+12) < l <= 2^(n+13)
Bucket m : l > 2^(m+12)
Current solution:
_pos = fls64((l) >> 13)
[...]
_pos = (_pos < m) ? ((l == 1 << (_pos + 12)) ? _pos - 1 : _pos) : m
For l == 4096, _pos == -1, but should be _pos == 0.
For l == (1 << (m + 12)), _pos == m, but should be _pos == m - 1.
In order to avoid special treatment of these corner cases, the
calculation is adjusted. The new solution first subtracts the length by
one, and then calculates the correct bucket by shifting accordingly,
i.e. _pos = fls64((l - 1) >> 13), l > 0.
This not only fixes the issues named above, but also makes the whole
bucket assignment easier to follow.
Same is done for SMC_STAT_RMB_SIZE_SUB(_smc_stats, _tech, k, _len),
where the calculation of the bucket position is similar to the one
named above.
Fixes: e0e4b8fa5338 ("net/smc: Add SMC statistics support")
Suggested-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 31c07dffafce914c1d1543c135382a11ff058d93 ]
Sili Luo reported a race in nfc_llcp_sock_get(), leading to UAF.
Getting a reference on the socket found in a lookup while
holding a lock should happen before releasing the lock.
nfc_llcp_sock_get_sn() has a similar problem.
Finally nfc_llcp_recv_snl() needs to make sure the socket
found by nfc_llcp_sock_from_sn() does not disappear.
Fixes: 8f50020ed9b8 ("NFC: LLCP late binding")
Reported-by: Sili Luo <rootlab@huawei.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20231009123110.3735515-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 26c29961b142444cd99361644c30fa1e9b3da6be ]
syzbot uses panic_on_warn.
This means that the skb_dump() I added in the blamed commit are
not even called.
Rewrite this so that we get the needed skb dump before syzbot crashes.
Fixes: eeee4b77dc52 ("net: add more debug info in skb_checksum_help()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20231006173355.2254983-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d9c2ba65e651467de739324d978b04ed8729f483 ]
With patch [1], isotp_poll was updated to also queue the poller in the
so->wait queue, which is used for send state changes. Since the queue
now also contains polling tasks that are not interested in sending, the
queue fill state can no longer be used as an indication of send
readiness. As a consequence, nonblocking writes can lead to a race and
lock-up of the socket if there is a second task polling the socket in
parallel.
With this patch, isotp_sendmsg does not consult wq_has_sleepers but
instead tries to atomically set so->tx.state and waits on so->wait if it
is unable to do so. This behavior is in alignment with isotp_poll, which
also checks so->tx.state to determine send readiness.
V2:
- Revert direct exit to goto err_event_drop
[1] https://lore.kernel.org/all/20230331125511.372783-1-michal.sojka@cvut.cz
Reported-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Closes: https://lore.kernel.org/linux-can/11328958-453f-447f-9af8-3b5824dfb041@munic.io/
Signed-off-by: Lukas Magel <lukas.magel@posteo.net>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Fixes: 79e19fa79cb5 ("can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT events")
Link: https://github.com/pylessard/python-udsoncan/issues/178#issuecomment-1743786590
Link: https://lore.kernel.org/all/20230827092205.7908-1-lukas.magel@posteo.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c889a99a21bf124c3db08d09df919f0eccc5ea4c upstream.
Similar to the change in commit 0bdf399342c5("net: Avoid address
overwrite in kernel_connect"), BPF hooks run on bind may rewrite the
address passed to kernel_bind(). This change
1) Makes a copy of the bind address in kernel_bind() to insulate
callers.
2) Replaces direct calls to sock->ops->bind() in net with kernel_bind()
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/
Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind")
Cc: stable@vger.kernel.org
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a5efdbcece83af94180e8d7c0a6e22947318499d ]
The delegated action infrastructure is prone to the following
race: different CPUs can try to schedule different delegated
actions on the same subflow at the same time.
Each of them will check different bits via mptcp_subflow_delegate(),
and will try to schedule the action on the related per-cpu napi
instance.
Depending on the timing, both can observe an empty delegated list
node, causing the same entry to be added simultaneously on two different
lists.
The root cause is that the delegated actions infra does not provide
a single synchronization point. Address the issue reserving an additional
bit to mark the subflow as scheduled for delegation. Acquiring such bit
guarantee the caller to own the delegated list node, and being able to
safely schedule the subflow.
Clear such bit only when the subflow scheduling is completed, ensuring
proper barrier in place.
Additionally swap the meaning of the delegated_action bitmask, to allow
the usage of the existing helper to set multiple bit at once.
Fixes: bcd97734318d ("mptcp: use delegate action to schedule 3rd ack retrans")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c486640aa710ddd06c13a7f7162126e1552e8842 upstream.
rt6_check_neigh() uses read_lock() to protect n->nud_state reading.
This seems overkill and causes false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c73a72f4cbb47672c8cc7f7d7aba52f1cb15baca upstream.
I've added a flex array to struct nlmsghdr in
commit 738136a0e375 ("netlink: split up copies in the ack construction")
to allow accessing the data easily. It leads to warnings with clang,
if user space wraps this structure into another struct and the flex
array is not at the end of the container.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/all/20221114023927.GA685@u2004-local/
Link: https://lore.kernel.org/r/20221118033903.1651026-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e69761483361f3df455bc493c99af0ef1744a14f ]
Fix coverity issue 'Resource leak'.
We should clean the skb resource if nlmsg_put/append failed.
Fixes: 738136a0e375 ("netlink: split up copies in the ack construction")
Signed-off-by: Tao Chen <chentao.kernel@linux.alibaba.com>
Link: https://lore.kernel.org/r/bff442d62c87de6299817fe1897cc5a5694ba9cc.1667638204.git.chentao.kernel@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 738136a0e3757a8534df3ad97d6ff6d7f429f6c1 ]
Clean up the use of unsafe_memcpy() by adding a flexible array
at the end of netlink message header and splitting up the header
and data copies.
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: d0f95894fda7 ("netlink: annotate data-races around sk->sk_err")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1f4e803cd9c9166eb8b6c8b0b8e4124f7499fc07 ]
Currently, when hb_interval is changed by users, it won't take effect
until the next expiry of hb timer. As the default value is 30s, users
have to wait up to 30s to wait its hb_interval update to work.
This becomes pretty bad in containers where a much smaller value is
usually set on hb_interval. This patch improves it by resetting the
hb timer immediately once the value of hb_interval is updated by users.
Note that we don't address the already existing 'problem' when sending
a heartbeat 'on demand' if one hb has just been sent(from the timer)
mentioned in:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2222a78075f0c19ca18db53fd6623afb4aff602d ]
During the 4-way handshake, the transport's state is set to ACTIVE in
sctp_process_init() when processing INIT_ACK chunk on client or
COOKIE_ECHO chunk on server.
In the collision scenario below:
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021]
when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
creates a new association and sets its transport to ACTIVE then updates
to the old association in sctp_assoc_update().
However, in sctp_assoc_update(), it will skip the transport update if it
finds a transport with the same ipaddr already existing in the old asoc,
and this causes the old asoc's transport state not to move to ACTIVE
after the handshake.
This means if DATA retransmission happens at this moment, it won't be able
to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
in sctp_do_8_2_transport_strike().
This patch fixes it by updating the transport in sctp_assoc_update() with
sctp_assoc_add_peer() where it updates the transport state if there is
already a transport with the same ipaddr exists in the old asoc.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4720852ed9afb1c5ab84e96135cb5b73d5afde6f ]
This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.
The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:
(1) If an app reads data when > 1*MSS is unacknowledged, then
tcp_cleanup_rbuf() ACKs immediately because of:
tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
(2) If an app reads all received data, and the packets were < 1*MSS,
and either (a) the app is not ping-pong or (b) we received two
packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
of:
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
!inet_csk_in_pingpong_mode(sk))) &&
(3) *However*: if an app reads exactly 1*MSS of data,
tcp_cleanup_rbuf() does not send an immediate ACK. This is true
even if the app is not ping-pong and the 1*MSS of data had the PSH
bit set, suggesting the sending application completed an
application write.
Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.
The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 059217c18be6757b95bfd77ba53fb50b48b8a816 ]
This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.
The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.
When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.
And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.
The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.
Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 08e50cf071847323414df0835109b6f3560d44f5 ]
It seems that tipc_crypto_key_revoke() could be be invoked by
wokequeue tipc_crypto_work_rx() under process context and
timer/rx callback under softirq context, thus the lock acquisition
on &tx->lock seems better use spin_lock_bh() to prevent possible
deadlock.
This flaw was found by an experimental static analysis tool I am
developing for irq-related deadlock.
tipc_crypto_work_rx() <workqueue>
--> tipc_crypto_key_distr()
--> tipc_bcast_xmit()
--> tipc_bcbase_xmit()
--> tipc_bearer_bc_xmit()
--> tipc_crypto_xmit()
--> tipc_ehdr_build()
--> tipc_crypto_key_revoke()
--> spin_lock(&tx->lock)
<timer interrupt>
--> tipc_disc_timeout()
--> tipc_bearer_xmit_skb()
--> tipc_crypto_xmit()
--> tipc_ehdr_build()
--> tipc_crypto_key_revoke()
--> spin_lock(&tx->lock) <deadlock here>
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
Link: https://lore.kernel.org/r/20230927181414.59928-1-dg573847474@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0add5c597f3253a9c6108a0a81d57f44ab0d9d30 ]
Due to a small omission, the offload_failed flag is missing from ipv4
fibmatch results. Make sure it is set correctly.
The issue can be witnessed using the following commands:
echo "1 1" > /sys/bus/netdevsim/new_device
ip link add dummy1 up type dummy
ip route add 192.0.2.0/24 dev dummy1
echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload
ip route add 198.51.100.0/24 dev dummy1
ip route
# 192.168.15.0/24 has rt_trap
# 198.51.100.0/24 has rt_offload_failed
ip route get 192.168.15.1 fibmatch
# Result has rt_trap
ip route get 198.51.100.1 fibmatch
# Result differs from the route shown by `ip route`, it is missing
# rt_offload_failed
ip link del dev dummy1
echo 1 > /sys/bus/netdevsim/del_device
Fixes: 36c5100e859d ("IPv4: Add "offload failed" indication to routes")
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 087388278e0f301f4c61ddffb1911d3a180f84b8 ]
nft_rbtree_gc_elem() walks back and removes the end interval element that
comes before the expired element.
There is a small chance that we've cached this element as 'rbe_ge'.
If this happens, we hold and test a pointer that has been queued for
freeing.
It also causes spurious insertion failures:
$ cat test-testcases-sets-0044interval_overlap_0.1/testout.log
Error: Could not process rule: File exists
add element t s { 0 - 2 }
^^^^^^
Failed to insert 0 - 2 given:
table ip t {
set s {
type inet_service
flags interval,timeout
timeout 2s
gc-interval 2s
}
}
The set (rbtree) is empty. The 'failure' doesn't happen on next attempt.
Reason is that when we try to insert, the tree may hold an expired
element that collides with the range we're adding.
While we do evict/erase this element, we can trip over this check:
if (rbe_ge && nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new))
return -ENOTEMPTY;
rbe_ge was erased by the synchronous gc, we should not have done this
check. Next attempt won't find it, so retry results in successful
insertion.
Restart in-kernel to avoid such spurious errors.
Such restart are rare, unless userspace intentionally adds very large
numbers of elements with very short timeouts while setting a huge
gc interval.
Even in this case, this cannot loop forever, on each retry an existing
element has been removed.
As the caller is holding the transaction mutex, its impossible
for a second entity to add more expiring elements to the tree.
After this it also becomes feasible to remove the async gc worker
and perform all garbage collection from the commit path.
Fixes: c9e6978e2725 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d880dc6f032e0b541520e9926f398a77d3d433c ]
When adding/updating an object, the transaction handler emits suitable
audit log entries already, the one in nft_obj_notify() is redundant. To
fix that (and retain the audit logging from objects' 'update' callback),
Introduce an "audit log free" variant for internal use.
Fixes: c520292f29b8 ("audit: log nftables configuration change events once per table")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com> (Audit)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8e56b063c86569e51eed1c5681ce6361fa97fc7a ]
In Scenario A and B below, as the delayed INIT_ACK always changes the peer
vtag, SCTP ct with the incorrect vtag may cause packet loss.
Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK
192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772]
192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151]
192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772]
192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] *
192.168.1.2 > 192.168.1.1: [COOKIE ECHO]
192.168.1.1 > 192.168.1.2: [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: [COOKIE ACK]
Scenario B: INIT_ACK is delayed until the peer completes its own handshake
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] *
This patch fixes it as below:
In SCTP_CID_INIT processing:
- clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] &&
ct->proto.sctp.init[!dir]. (Scenario E)
- set ct->proto.sctp.init[dir].
In SCTP_CID_INIT_ACK processing:
- drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] &&
ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C)
- drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] &&
ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A)
In SCTP_CID_COOKIE_ACK processing:
- clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir].
(Scenario D)
Also, it's important to allow the ct state to move forward with cookie_echo
and cookie_ack from the opposite dir for the collision scenarios.
There are also other Scenarios where it should allow the packet through,
addressed by the processing above:
Scenario C: new CT is created by INIT_ACK.
Scenario D: start INIT on the existing ESTABLISHED ct.
Scenario E: start INIT after the old collision on the existing ESTABLISHED
ct.
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
(both side are stopped, then start new connection again in hours)
192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742]
Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9593c7cb6cf670ef724d17f7f9affd7a8d2ad0c5 ]
Commit b0e214d21203 ("netfilter: keep conntrack reference until
IPsecv6 policy checks are done") is a direct copy of the old
commit b59c270104f0 ("[NETFILTER]: Keep conntrack reference until
IPsec policy checks are done") but for IPv6. However, it also
copies a bug that this old commit had. That is: when the third
packet of 3WHS connection establishment contains payload, it is
added into socket receive queue without the XFRM check and the
drop of connection tracking context.
That leads to nf_conntrack module being impossible to unload as
it waits for all the conntrack references to be dropped while
the packet release is deferred in per-cpu cache indefinitely, if
not consumed by the application.
The issue for IPv4 was fixed in commit 6f0012e35160 ("tcp: add a
missing nf_reset_ct() in 3WHS handling") by adding a missing XFRM
check and correctly dropping the conntrack context. However, the
issue was introduced to IPv6 code afterwards. Fixing it the
same way for IPv6 now.
Fixes: b0e214d21203 ("netfilter: keep conntrack reference until IPsecv6 policy checks are done")
Link: https://lore.kernel.org/netdev/d589a999-d4dd-2768-b2d5-89dec64a4a42@ovn.org/
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230922210530.2045146-1-i.maximets@ovn.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9d4c75800f61e5d75c1659ba201b6c0c7ead3070 ]
Including the transhdrlen in length is a problem when the packet is
partially filled (e.g. something like send(MSG_MORE) happened previously)
when appending to an IPv4 or IPv6 packet as we don't want to repeat the
transport header or account for it twice. This can happen under some
circumstances, such as splicing into an L2TP socket.
The symptom observed is a warning in __ip6_append_data():
WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
that occurs when MSG_SPLICE_PAGES is used to append more data to an already
partially occupied skbuff. The warning occurs when 'copy' is larger than
the amount of data in the message iterator. This is because the requested
length includes the transport header length when it shouldn't. This can be
triggered by, for example:
sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
bind(sfd, ...); // ::1
connect(sfd, ...); // ::1 port 7
send(sfd, buffer, 4100, MSG_MORE);
sendfile(sfd, dfd, NULL, 1024);
Fix this by only adding transhdrlen into the length if the write queue is
empty in l2tp_ip6_sendmsg(), analogously to how UDP does things.
l2tp_ip_sendmsg() looks like it won't suffer from this problem as it builds
the UDP packet itself.
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Reported-by: syzbot+62cbf263225ae13ff153@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@google.com/
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
cc: syzkaller-bugs@googlegroups.com
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5baa0433a15eadd729625004c37463acb982eca7 ]
n->output field can be read locklessly, while a writer
might change the pointer concurrently.
Add missing annotations to prevent load-store tearing.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 09eed1192cec1755967f2af8394207acdde579a1 ]
rcu_bh is no longer a win, especially for objects freed
with standard call_rcu().
Switch neighbour code to no longer disable BH when not necessary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 5baa0433a15e ("neighbour: fix data-races around n->output")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b071af523579df7341cabf0f16fc661125e9a13f ]
We have many lockless accesses to n->nud_state.
Before adding another one in the following patch,
add annotations to readers and writers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 5baa0433a15e ("neighbour: fix data-races around n->output")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 31de4105f00d64570139bc5494a201b0bd57349f ]
The bpf_fib_lookup() also looks up the neigh table.
This was done before bpf_redirect_neigh() was added.
In the use case that does not manage the neigh table
and requires bpf_fib_lookup() to lookup a fib to
decide if it needs to redirect or not, the bpf prog can
depend only on using bpf_redirect_neigh() to lookup the
neigh. It also keeps the neigh entries fresh and connected.
This patch adds a bpf_fib_lookup flag, SKIP_NEIGH, to avoid
the double neigh lookup when the bpf prog always call
bpf_redirect_neigh() to do the neigh lookup. The params->smac
output is skipped together when SKIP_NEIGH is set because
bpf_redirect_neigh() will figure out the smac also.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230217205515.3583372-1-martin.lau@linux.dev
Stable-dep-of: 5baa0433a15e ("neighbour: fix data-races around n->output")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 25563b581ba3a1f263a00e8c9a97f5e7363be6fd ]
While looking at a related syzbot report involving neigh_periodic_work(),
I found that I forgot to add an annotation when deleting an
RCU protected item from a list.
Readers use rcu_deference(*np), we need to use either
rcu_assign_pointer() or WRITE_ONCE() on writer side
to prevent store tearing.
I use rcu_assign_pointer() to have lockdep support,
this was the choice made in neigh_flush_dev().
Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b80e31baa43614e086a9d29dc1151932b1bd7fc5 ]
With a SOCKMAP/SOCKHASH map and an sk_msg program user can steer messages
sent from one TCP socket (s1) to actually egress from another TCP
socket (s2):
tcp_bpf_sendmsg(s1) // = sk_prot->sendmsg
tcp_bpf_send_verdict(s1) // __SK_REDIRECT case
tcp_bpf_sendmsg_redir(s2)
tcp_bpf_push_locked(s2)
tcp_bpf_push(s2)
tcp_rate_check_app_limited(s2) // expects tcp_sock
tcp_sendmsg_locked(s2) // ditto
There is a hard-coded assumption in the call-chain, that the egress
socket (s2) is a TCP socket.
However in commit 122e6c79efe1 ("sock_map: Update sock type checks for
UDP") we have enabled redirects to non-TCP sockets. This was done for the
sake of BPF sk_skb programs. There was no indention to support sk_msg
send-to-egress use case.
As a result, attempts to send-to-egress through a non-TCP socket lead to a
crash due to invalid downcast from sock to tcp_sock:
BUG: kernel NULL pointer dereference, address: 000000000000002f
...
Call Trace:
<TASK>
? show_regs+0x60/0x70
? __die+0x1f/0x70
? page_fault_oops+0x80/0x160
? do_user_addr_fault+0x2d7/0x800
? rcu_is_watching+0x11/0x50
? exc_page_fault+0x70/0x1c0
? asm_exc_page_fault+0x27/0x30
? tcp_tso_segs+0x14/0xa0
tcp_write_xmit+0x67/0xce0
__tcp_push_pending_frames+0x32/0xf0
tcp_push+0x107/0x140
tcp_sendmsg_locked+0x99f/0xbb0
tcp_bpf_push+0x19d/0x3a0
tcp_bpf_sendmsg_redir+0x55/0xd0
tcp_bpf_send_verdict+0x407/0x550
tcp_bpf_sendmsg+0x1a1/0x390
inet_sendmsg+0x6a/0x70
sock_sendmsg+0x9d/0xc0
? sockfd_lookup_light+0x12/0x80
__sys_sendto+0x10e/0x160
? syscall_enter_from_user_mode+0x20/0x60
? __this_cpu_preempt_check+0x13/0x20
? lockdep_hardirqs_on+0x82/0x110
__x64_sys_sendto+0x1f/0x30
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reject selecting a non-TCP sockets as redirect target from a BPF sk_msg
program to prevent the crash. When attempted, user will receive an EACCES
error from send/sendto/sendmsg() syscall.
Fixes: 122e6c79efe1 ("sock_map: Update sock type checks for UDP")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20230920102055.42662-1-jakub@cloudflare.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit da9e915eaf5dadb1963b7738cdfa42ed55212445 ]
When data is peek'd off the receive queue we shouldn't considered it
copied from tcp_sock side. When we increment copied_seq this will confuse
tcp_data_ready() because copied_seq can be arbitrarily increased. From
application side it results in poll() operations not waking up when
expected.
Notice tcp stack without BPF recvmsg programs also does not increment
copied_seq.
We broke this when we moved copied_seq into recvmsg to only update when
actual copy was happening. But, it wasn't working correctly either before
because the tcp_data_ready() tried to use the copied_seq value to see
if data was read by user yet. See fixes tags.
Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq")
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230926035300.135096-3-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9b7177b1df64b8d7f85700027c324aadd6aded00 ]
Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq
value. This (as described in the commit) would cause an error for apps
because once that is incremented the application might believe there is no
data to be read. Then some apps would stall or abort believing no data is
available.
However, the fix is incomplete because it introduces another issue in
the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume
as many skbs as possible. The problem is the call is ...
tcp_recv_skb(sk, seq, &offset)
... where 'seq' is:
u32 seq = tp->copied_seq;
Now we can hit a case where we've yet incremented copied_seq from BPF side,
but then tcp_recv_skb() fails this test ...
if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
... so that instead of returning the skb we call tcp_eat_recv_skb() which
frees the skb. This is because the routine believes the SKB has been collapsed
per comment:
/* This looks weird, but this can happen if TCP collapsing
* splitted a fat GRO packet, while we released socket lock
* in skb_splice_bits()
*/
This can't happen here we've unlinked the full SKB and orphaned it. Anyways
it would confuse any BPF programs if the data were suddenly moved underneath
it.
To fix this situation do simpler operation and just skb_peek() the data
of the queue followed by the unlink. It shouldn't need to check this
condition and tcp_read_skb() reads entire skbs so there is no need to
handle the 'offset!=0' case as we would see in tcp_read_sock().
Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq")
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230926035300.135096-2-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 31db78a4923ef5e2008f2eed321811ca79e7f71b ]
When ieee80211_key_link() is called by ieee80211_gtk_rekey_add()
but returns 0 due to KRACK protection (identical key reinstall),
ieee80211_gtk_rekey_add() will still return a pointer into the
key, in a potential use-after-free. This normally doesn't happen
since it's only called by iwlwifi in case of WoWLAN rekey offload
which has its own KRACK protection, but still better to fix, do
that by returning an error code and converting that to success on
the cfg80211 boundary only, leaving the error for bad callers of
ieee80211_gtk_rekey_add().
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: fdf7cb4185b6 ("mac80211: accept key reinstall without changing anything")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e0275ea52169412b8faccb4e2f4fed8a057844c6 ]
iso_listen_cis shall only return -EADDRINUSE if the listening socket has
the destination set to BDADDR_ANY otherwise if the destination is set to
a specific address it is for broadcast which shall be ignored.
Fixes: f764a6c2c1e4 ("Bluetooth: ISO: Add broadcast support")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cbaabbcdcbd355f0a1ccc09a925575c51c270750 ]
hci_req_prepare_suspend() has been deprecated in favor of
hci_suspend_sync().
Fixes: 182ee45da083 ("Bluetooth: hci_sync: Rework hci_suspend_notifier")
Signed-off-by: Yao Xiao <xiaoyao@rock-chips.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 37c20b2effe987b806c8de6d12978e4ffeff026f ]
Max Schulze reports crashes with brcmfmac. The reason seems
to be a race between userspace removing the CQM config and
the driver calling cfg80211_cqm_rssi_notify(), where if the
data is freed while cfg80211_cqm_rssi_notify() runs it will
crash since it assumes wdev->cqm_config is set. This can't
be fixed with a simple non-NULL check since there's nothing
we can do for locking easily, so use RCU instead to protect
the pointer, but that requires pulling the updates out into
an asynchronous worker so they can sleep and call back into
the driver.
Since we need to change the free anyway, also change it to
go back to the old settings if changing the settings fails.
Reported-and-tested-by: Max Schulze <max.schulze@online.de>
Closes: https://lore.kernel.org/r/ac96309a-8d8d-4435-36e6-6d152eb31876@online.de
Fixes: 4a4b8169501b ("cfg80211: Accept multiple RSSI thresholds for CQM")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a3ee4dc84c4e9d14cb34dad095fd678127aca5b6 ]
Add a work abstraction at the cfg80211 level that will always
hold the wiphy_lock() for any work executed and therefore also
can be canceled safely (without waiting) while holding that.
This improves on what we do now as with the new wiphy works we
don't have to worry about locking while cancelling them safely.
Also, don't let such works run while the device is suspended,
since they'll likely need to interact with the device. Flush
them before suspend though.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Stable-dep-of: 37c20b2effe9 ("wifi: cfg80211: fix cqm_config access race")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a993df0f9143e63eca38c96a30daf08db99a98a3 ]
This is a driver callback, and the driver should be able
to assume that it's called with the wiphy lock held. Move
the call up so that's true, it has no other effect since
the device is already unregistering and we cannot reach
this function through other paths.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Stable-dep-of: 37c20b2effe9 ("wifi: cfg80211: fix cqm_config access race")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e9da6df7492a981b071bafd169fb4c35b45f5ebf ]
Most code paths in cfg80211 already hold the wiphy lock,
mostly by virtue of being called from nl80211, so make
the auto-disconnect worker also hold it, aligning the
locking promises between different parts of cfg80211.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Stable-dep-of: 37c20b2effe9 ("wifi: cfg80211: fix cqm_config access race")
Signed-off-by: Sasha Levin <sashal@kernel.org>