IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
[ Upstream commit 300b655db1b5152d6101bcb6801d50899b20c2d6 ]
The initial default value of 0 for tp->rate_app_limited was incorrect,
since a flow is indeed application-limited until it first sends
data. Fixing the default to be 1 is generally correct but also
specifically will help user-space applications avoid using the initial
tcpi_delivery_rate value of 0 that persists until the connection has
some non-zero bandwidth sample.
Fixes: eb8329e0a04d ("tcp: export data delivery rate")
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3f4ca5fafc08881d7a57daa20449d171f2887043 ]
While one cpu is working on looking up the right socket from ehash
table, another cpu is done deleting the request socket and is about
to add (or is adding) the big socket from the table. It means that
we could miss both of them, even though it has little chance.
Let me draw a call trace map of the server side.
CPU 0 CPU 1
----- -----
tcp_v4_rcv() syn_recv_sock()
inet_ehash_insert()
-> sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
-> __sk_nulls_add_node_rcu(sk, list)
Notice that the CPU 0 is receiving the data after the final ack
during 3-way shakehands and CPU 1 is still handling the final ack.
Why could this be a real problem?
This case is happening only when the final ack and the first data
receiving by different CPUs. Then the server receiving data with
ACK flag tries to search one proper established socket from ehash
table, but apparently it fails as my map shows above. After that,
the server fetches a listener socket and then sends a RST because
it finds a ACK flag in the skb (data), which obeys RST definition
in RFC 793.
Besides, Eric pointed out there's one more race condition where it
handles tw socket hashdance. Only by adding to the tail of the list
before deleting the old one can we avoid the race if the reader has
already begun the bucket traversal and it would possibly miss the head.
Many thanks to Eric for great help from beginning to end.
Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/lkml/20230112065336.41034-1-kerneljasonxing@gmail.com/
Link: https://lore.kernel.org/r/20230118015941.1313-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b9fb10d131b8c84af9bb14e2078d5c63600c7dea ]
lockdep complains with the following lock/unlock sequence:
lock_sock(sk);
write_lock_bh(&sk->sk_callback_lock);
[1] release_sock(sk);
[2] write_unlock_bh(&sk->sk_callback_lock);
We need to swap [1] and [2] to fix this issue.
Fixes: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Reported-by: syzbot+bbd35b345c7cab0d9a08@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/20230114030137.672706-1-xiyou.wangcong@gmail.com/T/#m1164ff20628671b0f326a24cb106ab3239c70ce3
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1d80d57ffcb55488f0ec0b77928d4f82d16b6a90 ]
syzbot reports a possible deadlock in rfcomm_sk_state_change [1].
While rfcomm_sock_connect acquires the sk lock and waits for
the rfcomm lock, rfcomm_sock_release could have the rfcomm
lock and hit a deadlock for acquiring the sk lock.
Here's a simplified flow:
rfcomm_sock_connect:
lock_sock(sk)
rfcomm_dlc_open:
rfcomm_lock()
rfcomm_sock_release:
rfcomm_sock_shutdown:
rfcomm_lock()
__rfcomm_dlc_close:
rfcomm_k_state_change:
lock_sock(sk)
This patch drops the sk lock before calling rfcomm_dlc_open to
avoid the possible deadlock and holds sk's reference count to
prevent use-after-free after rfcomm_dlc_open completes.
Reported-by: syzbot+d7ce59...@syzkaller.appspotmail.com
Fixes: 1804fdf6e494 ("Bluetooth: btintel: Combine setting up MSFT extension")
Link: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218 [1]
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0b2c59720e65885a394a017d0cf9cab118914682 ]
The code in l2tp_tunnel_register() is racy in several ways:
1. It modifies the tunnel socket _after_ publishing it.
2. It calls setup_udp_tunnel_sock() on an existing socket without
locking.
3. It changes sock lock class on fly, which triggers many syzbot
reports.
This patch amends all of them by moving socket initialization code
before publishing and under sock lock. As suggested by Jakub, the
l2tp lockdep class is not necessary as we can just switch to
bh_lock_sock_nested().
Fixes: 37159ef2c1ae ("l2tp: fix a lockdep splat")
Fixes: 6b9f34239b00 ("l2tp: fix races in tunnel creation")
Reported-by: syzbot+52866e24647f9a23403f@syzkaller.appspotmail.com
Reported-by: syzbot+94cc2a66fc228b23f360@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c4d48a58f32c5972174a1d01c33b296fe378cce0 ]
l2tp uses l2tp_tunnel_list to track all registered tunnels and
to allocate tunnel ID's. IDR can do the same job.
More importantly, with IDR we can hold the ID before a successful
registration so that we don't need to worry about late error
handling, it is not easy to rollback socket changes.
This is a preparation for the following fix.
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af295e854a4e3813ffbdef26dbb6a4d6226c3ea1 ]
When holding a reader-writer spin lock we cannot sleep. Calling
setup_udp_tunnel_sock() with write lock held violates this rule, because we
end up calling percpu_down_read(), which might sleep, as syzbot reports
[1]:
__might_resched.cold+0x222/0x26b kernel/sched/core.c:9890
percpu_down_read include/linux/percpu-rwsem.h:49 [inline]
cpus_read_lock+0x1b/0x140 kernel/cpu.c:310
static_key_slow_inc+0x12/0x20 kernel/jump_label.c:158
udp_tunnel_encap_enable include/net/udp_tunnel.h:187 [inline]
setup_udp_tunnel_sock+0x43d/0x550 net/ipv4/udp_tunnel_core.c:81
l2tp_tunnel_register+0xc51/0x1210 net/l2tp/l2tp_core.c:1509
pppol2tp_connect+0xcdc/0x1a10 net/l2tp/l2tp_ppp.c:723
Trim the writer-side critical section for sk_callback_lock down to the
minimum, so that it covers only operations on sk_user_data.
Also, when grabbing the sk_callback_lock, we always need to disable BH, as
Eric points out. Failing to do so leads to deadlocks because we acquire
sk_callback_lock in softirq context, which can get stuck waiting on us if:
1) it runs on the same CPU, or
CPU0
----
lock(clock-AF_INET6);
<Interrupt>
lock(clock-AF_INET6);
2) lock ordering leads to priority inversion
CPU0 CPU1
---- ----
lock(clock-AF_INET6);
local_irq_disable();
lock(&tcp_hashinfo.bhash[i].lock);
lock(clock-AF_INET6);
<Interrupt>
lock(&tcp_hashinfo.bhash[i].lock);
... as syzbot reports [2,3]. Use the _bh variants for write_(un)lock.
[1] https://lore.kernel.org/netdev/0000000000004e78ec05eda79749@google.com/
[2] https://lore.kernel.org/netdev/000000000000e38b6605eda76f98@google.com/
[3] https://lore.kernel.org/netdev/000000000000dfa31e05eda76f75@google.com/
v2:
- Check and set sk_user_data while holding sk_callback_lock for both
L2TP encapsulation types (IP and UDP) (Tetsuo)
Cc: Tom Parkin <tparkin@katalix.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: b68777d54fac ("l2tp: Serialize access to sk_user_data with sk_callback_lock")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+703d9e154b3b58277261@syzkaller.appspotmail.com
Reported-by: syzbot+50680ced9e98a61f7698@syzkaller.appspotmail.com
Reported-by: syzbot+de987172bb74a381879b@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b68777d54fac21fc833ec26ea1a2a84f975ab035 ]
sk->sk_user_data has multiple users, which are not compatible with each
other. Writers must synchronize by grabbing the sk->sk_callback_lock.
l2tp currently fails to grab the lock when modifying the underlying tunnel
socket fields. Fix it by adding appropriate locking.
We err on the side of safety and grab the sk_callback_lock also inside the
sk_destruct callback overridden by l2tp, even though there should be no
refs allowing access to the sock at the time when sk_destruct gets called.
v4:
- serialize write to sk_user_data in l2tp sk_destruct
v3:
- switch from sock lock to sk_callback_lock
- document write-protection for sk_user_data
v2:
- update Fixes to point to origin of the bug
- use real names in Reported/Tested-by tags
Cc: Tom Parkin <tparkin@katalix.com>
Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core")
Reported-by: Haowei Yan <g1042620637@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 0b2c59720e65 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3a415d59c1dbec9d772dbfab2d2520d98360caae ]
syzbot reported a nasty crash [1] in net_tx_action() which
made little sense until we got a repro.
This repro installs a taprio qdisc, but providing an
invalid TCA_RATE attribute.
qdisc_create() has to destroy the just initialized
taprio qdisc, and taprio_destroy() is called.
However, the hrtimer used by taprio had already fired,
therefore advance_sched() called __netif_schedule().
Then net_tx_action was trying to use a destroyed qdisc.
We can not undo the __netif_schedule(), so we must wait
until one cpu serviced the qdisc before we can proceed.
Many thanks to Alexander Potapenko for his help.
[1]
BUG: KMSAN: uninit-value in queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
BUG: KMSAN: uninit-value in do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
BUG: KMSAN: uninit-value in __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
BUG: KMSAN: uninit-value in _raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
__raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
_raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
spin_trylock include/linux/spinlock.h:359 [inline]
qdisc_run_begin include/net/sch_generic.h:187 [inline]
qdisc_run+0xee/0x540 include/net/pkt_sched.h:125
net_tx_action+0x77c/0x9a0 net/core/dev.c:5086
__do_softirq+0x1cc/0x7fb kernel/softirq.c:571
run_ksoftirqd+0x2c/0x50 kernel/softirq.c:934
smpboot_thread_fn+0x554/0x9f0 kernel/smpboot.c:164
kthread+0x31b/0x430 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
Uninit was created at:
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3258 [inline]
__kmalloc_node_track_caller+0x814/0x1250 mm/slub.c:4970
kmalloc_reserve net/core/skbuff.c:358 [inline]
__alloc_skb+0x346/0xcf0 net/core/skbuff.c:430
alloc_skb include/linux/skbuff.h:1257 [inline]
nlmsg_new include/net/netlink.h:953 [inline]
netlink_ack+0x5f3/0x12b0 net/netlink/af_netlink.c:2436
netlink_rcv_skb+0x55d/0x6c0 net/netlink/af_netlink.c:2507
rtnetlink_rcv+0x30/0x40 net/core/rtnetlink.c:6108
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0xabc/0xe90 net/socket.c:2482
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2536
__sys_sendmsg net/socket.c:2565 [inline]
__do_sys_sendmsg net/socket.c:2574 [inline]
__se_sys_sendmsg net/socket.c:2572 [inline]
__x64_sys_sendmsg+0x367/0x540 net/socket.c:2572
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 6.0.0-rc2-syzkaller-47461-gac3859c02d7f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a22b7388d658ecfcd226600c8c34ce4481e88655 ]
Peek at old qdisc and graft only when deleting a leaf class in the htb,
rather than when deleting the htb itself. Do not peek at the qdisc of the
netdev queue when destroying the htb. The caller may already have grafted a
new qdisc that is not part of the htb structure being destroyed.
This fix resolves two use cases.
1. Using tc to destroy the htb.
- Netdev was being prematurely activated before the htb was fully
destroyed.
2. Using tc to replace the htb with another qdisc (which also leads to
the htb being destroyed).
- Premature netdev activation like previous case. Newly grafted qdisc
was also getting accidentally overwritten when destroying the htb.
Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230113005528.302625-1-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4bb4db7f3187c6e3de6b229ffc87cdb30a2d22b6 ]
Fix a use-after-free that occurs in kfree_skb() called from
local_cleanup(). This could happen when killing nfc daemon (e.g. neard)
after detaching an nfc device.
When detaching an nfc device, local_cleanup() called from
nfc_llcp_unregister_device() frees local->rx_pending and decreases
local->ref by kref_put() in nfc_llcp_local_put().
In the terminating process, nfc daemon releases all sockets and it leads
to decreasing local->ref. After the last release of local->ref,
local_cleanup() called from local_release() frees local->rx_pending
again, which leads to the bug.
Setting local->rx_pending to NULL in local_cleanup() could prevent
use-after-free when local_cleanup() is called twice.
Found by a modified version of syzkaller.
BUG: KASAN: use-after-free in kfree_skb()
Call Trace:
dump_stack_lvl (lib/dump_stack.c:106)
print_address_description.constprop.0.cold (mm/kasan/report.c:306)
kasan_check_range (mm/kasan/generic.c:189)
kfree_skb (net/core/skbuff.c:955)
local_cleanup (net/nfc/llcp_core.c:159)
nfc_llcp_local_put.part.0 (net/nfc/llcp_core.c:172)
nfc_llcp_local_put (net/nfc/llcp_core.c:181)
llcp_sock_destruct (net/nfc/llcp_sock.c:959)
__sk_destruct (net/core/sock.c:2133)
sk_destruct (net/core/sock.c:2181)
__sk_free (net/core/sock.c:2192)
sk_free (net/core/sock.c:2203)
llcp_sock_release (net/nfc/llcp_sock.c:646)
__sock_release (net/socket.c:650)
sock_close (net/socket.c:1365)
__fput (fs/file_table.c:306)
task_work_run (kernel/task_work.c:179)
ptrace_notify (kernel/signal.c:2354)
syscall_exit_to_user_mode_prepare (kernel/entry/common.c:278)
syscall_exit_to_user_mode (kernel/entry/common.c:296)
do_syscall_64 (arch/x86/entry/common.c:86)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:106)
Allocated by task 4719:
kasan_save_stack (mm/kasan/common.c:45)
__kasan_slab_alloc (mm/kasan/common.c:325)
slab_post_alloc_hook (mm/slab.h:766)
kmem_cache_alloc_node (mm/slub.c:3497)
__alloc_skb (net/core/skbuff.c:552)
pn533_recv_response (drivers/nfc/pn533/usb.c:65)
__usb_hcd_giveback_urb (drivers/usb/core/hcd.c:1671)
usb_giveback_urb_bh (drivers/usb/core/hcd.c:1704)
tasklet_action_common.isra.0 (kernel/softirq.c:797)
__do_softirq (kernel/softirq.c:571)
Freed by task 1901:
kasan_save_stack (mm/kasan/common.c:45)
kasan_set_track (mm/kasan/common.c:52)
kasan_save_free_info (mm/kasan/genericdd.c:518)
__kasan_slab_free (mm/kasan/common.c:236)
kmem_cache_free (mm/slub.c:3809)
kfree_skbmem (net/core/skbuff.c:874)
kfree_skb (net/core/skbuff.c:931)
local_cleanup (net/nfc/llcp_core.c:159)
nfc_llcp_unregister_device (net/nfc/llcp_core.c:1617)
nfc_unregister_device (net/nfc/core.c:1179)
pn53x_unregister_nfc (drivers/nfc/pn533/pn533.c:2846)
pn533_usb_disconnect (drivers/nfc/pn533/usb.c:579)
usb_unbind_interface (drivers/usb/core/driver.c:458)
device_release_driver_internal (drivers/base/dd.c:1279)
bus_remove_device (drivers/base/bus.c:529)
device_del (drivers/base/core.c:3665)
usb_disable_device (drivers/usb/core/message.c:1420)
usb_disconnect (drivers/usb/core.c:2261)
hub_event (drivers/usb/core/hub.c:5833)
process_one_work (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/workqueue.h:108 kernel/workqueue.c:2281)
worker_thread (include/linux/list.h:282 kernel/workqueue.c:2423)
kthread (kernel/kthread.c:319)
ret_from_fork (arch/x86/entry/entry_64.S:301)
Fixes: 3536da06db0b ("NFC: llcp: Clean local timers and works when removing a device")
Signed-off-by: Jisoo Jang <jisoo.jang@yonsei.ac.kr>
Link: https://lore.kernel.org/r/20230111131914.3338838-1-jisoo.jang@yonsei.ac.kr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 8ccc99362b60c6f27bb46f36fdaaccf4ef0303de upstream.
The referenced commit changed the error code returned by the kernel
when preventing a non-established socket from attaching the ktls
ULP. Before to such a commit, the user-space got ENOTCONN instead
of EINVAL.
The existing self-tests depend on such error code, and the change
caused a failure:
RUN global.non_established ...
tls.c:1673:non_established:Expected errno (22) == ENOTCONN (107)
non_established: Test failed at step #3
FAIL global.non_established
In the unlikely event existing applications do the same, address
the issue by restoring the prior error code in the above scenario.
Note that the only other ULP performing similar checks at init
time - smc_ulp_ops - also fails with ENOTCONN when trying to attach
the ULP to a non-established socket.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 2c02d41d71f9 ("net/ulp: prevent ULP without clone op from entering the LISTEN status")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/7bb199e7a93317fb6f8bf8b9b2dc71c18f337cde.1674042685.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9deb1e9fb88b1120a908676fa33bdf9e2eeaefce ]
It's not very useful to copy back an empty ethtool_stats struct and
return 0 if we didn't actually have any stats. This also allows for
further simplification of this function in the future commits.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9e17f99220d111ea031b44153fdfe364b0024ff2 ]
The 'TCA_MPLS_LABEL' attribute is of 'NLA_U32' type, but has a
validation type of 'NLA_VALIDATE_FUNCTION'. This is an invalid
combination according to the comment above 'struct nla_policy':
"
Meaning of `validate' field, use via NLA_POLICY_VALIDATE_FN:
NLA_BINARY Validation function called for the attribute.
All other Unused - but note that it's a union
"
This can trigger the warning [1] in nla_get_range_unsigned() when
validation of the attribute fails. Despite being of 'NLA_U32' type, the
associated 'min'/'max' fields in the policy are negative as they are
aliased by the 'validate' field.
Fix by changing the attribute type to 'NLA_BINARY' which is consistent
with the above comment and all other users of NLA_POLICY_VALIDATE_FN().
As a result, move the length validation to the validation function.
No regressions in MPLS tests:
# ./tdc.py -f tc-tests/actions/mpls.json
[...]
# echo $?
0
[1]
WARNING: CPU: 0 PID: 17743 at lib/nlattr.c:118
nla_get_range_unsigned+0x1d8/0x1e0 lib/nlattr.c:117
Modules linked in:
CPU: 0 PID: 17743 Comm: syz-executor.0 Not tainted 6.1.0-rc8 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:nla_get_range_unsigned+0x1d8/0x1e0 lib/nlattr.c:117
[...]
Call Trace:
<TASK>
__netlink_policy_dump_write_attr+0x23d/0x990 net/netlink/policy.c:310
netlink_policy_dump_write_attr+0x22/0x30 net/netlink/policy.c:411
netlink_ack_tlv_fill net/netlink/af_netlink.c:2454 [inline]
netlink_ack+0x546/0x760 net/netlink/af_netlink.c:2506
netlink_rcv_skb+0x1b7/0x240 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6109
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x5e9/0x6b0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x739/0x860 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0x38f/0x500 net/socket.c:2482
___sys_sendmsg net/socket.c:2536 [inline]
__sys_sendmsg+0x197/0x230 net/socket.c:2565
__do_sys_sendmsg net/socket.c:2574 [inline]
__se_sys_sendmsg net/socket.c:2572 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2572
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Link: https://lore.kernel.org/netdev/CAO4mrfdmjvRUNbDyP0R03_DrD_eFCLCguz6OxZ2TYRSv0K9gxA@mail.gmail.com/
Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Tested-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230107171004.608436-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c244c092f1ed2acfb5af3d3da81e22367d3dd733 ]
This unexpected behavior is observed:
node 1 | node 2
------ | ------
link is established | link is established
reboot | link is reset
up | send discovery message
receive discovery message |
link is established | link is established
send discovery message |
| receive discovery message
| link is reset (unexpected)
| send reset message
link is reset |
It is due to delayed re-discovery as described in function
tipc_node_check_dest(): "this link endpoint has already reset
and re-established contact with the peer, before receiving a
discovery message from that node."
However, commit 598411d70f85 has changed the condition for calling
tipc_node_link_down() which was the acceptance of new media address.
This commit fixes this by restoring the old and correct behavior.
Fixes: 598411d70f85 ("tipc: make resetting of links non-atomic")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9ea4b476cea1b7d461d16dda25ca3c7e616e2d15 upstream.
When first_ip is 0, last_ip is 0xFFFFFFFF, and netmask is 31, the value of
an arithmetic expression 2 << (netmask - mask_bits - 1) is subject
to overflow due to a failure casting operands to a larger data type
before performing the arithmetic.
Note that it's harmless since the value will be checked at the next step.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes: b9fed748185a ("netfilter: ipset: Check and reject crazy /0 input parameters")
Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cb3e9864cdbe35ff6378966660edbcbac955fe17 upstream.
The total cork length created by ip6_append_data includes extension
headers, so we must exclude them when comparing them against the
IPV6_CHECKSUM offset which does not include extension headers.
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Fixes: 357b40a18b04 ("[IPV6]: IPV6_CHECKSUM socket option can corrupt kernel memory")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 696e1a48b1a1b01edad542a1ef293665864a4dd0 upstream.
If the offset + length goes over the ethernet + vlan header, then the
length is adjusted to copy the bytes that are within the boundaries of
the vlan_ethhdr scratchpad area. The remaining bytes beyond ethernet +
vlan header are copied directly from the skbuff data area.
Fix incorrect arithmetic operator: subtract, not add, the size of the
vlan header in case of double-tagged packets to adjust the length
accordingly to address CVE-2023-0179.
Reported-by: Davide Ornaghi <d.ornaghi97@gmail.com>
Fixes: f6ae9f120dad ("netfilter: nft_payload: add C-VLAN support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c02d41d71f90a5168391b6a5f2954112ba2307c upstream.
When an ULP-enabled socket enters the LISTEN status, the listener ULP data
pointer is copied inside the child/accepted sockets by sk_clone_lock().
The relevant ULP can take care of de-duplicating the context pointer via
the clone() operation, but only MPTCP and SMC implement such op.
Other ULPs may end-up with a double-free at socket disposal time.
We can't simply clear the ULP data at clone time, as TLS replaces the
socket ops with custom ones assuming a valid TLS ULP context is
available.
Instead completely prevent clone-less ULP sockets from entering the
LISTEN status.
Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
Reported-by: slipper <slipper.alive@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/4b80c3d1dbe3d0ab072f80450c202d9bc88b4b03.1672740602.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 96398560f26aa07e8f2969d73c8197e6a6d10407 upstream.
While experimenting with applying noqueue to a classful queue discipline,
we discovered a NULL pointer dereference in the __dev_queue_xmit()
path that generates a kernel OOPS:
# dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
# tc qdisc add dev $dev parent 1:1 handle 10: noqueue
# ping -I $dev -w 1 -c 1 1.1.1.1
[ 2.172856] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 2.173217] #PF: supervisor instruction fetch in kernel mode
...
[ 2.178451] Call Trace:
[ 2.178577] <TASK>
[ 2.178686] htb_enqueue+0x1c8/0x370
[ 2.178880] dev_qdisc_enqueue+0x15/0x90
[ 2.179093] __dev_queue_xmit+0x798/0xd00
[ 2.179305] ? _raw_write_lock_bh+0xe/0x30
[ 2.179522] ? __local_bh_enable_ip+0x32/0x70
[ 2.179759] ? ___neigh_create+0x610/0x840
[ 2.179968] ? eth_header+0x21/0xc0
[ 2.180144] ip_finish_output2+0x15e/0x4f0
[ 2.180348] ? dst_output+0x30/0x30
[ 2.180525] ip_push_pending_frames+0x9d/0xb0
[ 2.180739] raw_sendmsg+0x601/0xcb0
[ 2.180916] ? _raw_spin_trylock+0xe/0x50
[ 2.181112] ? _raw_spin_unlock_irqrestore+0x16/0x30
[ 2.181354] ? get_page_from_freelist+0xcd6/0xdf0
[ 2.181594] ? sock_sendmsg+0x56/0x60
[ 2.181781] sock_sendmsg+0x56/0x60
[ 2.181958] __sys_sendto+0xf7/0x160
[ 2.182139] ? handle_mm_fault+0x6e/0x1d0
[ 2.182366] ? do_user_addr_fault+0x1e1/0x660
[ 2.182627] __x64_sys_sendto+0x1b/0x30
[ 2.182881] do_syscall_64+0x38/0x90
[ 2.183085] entry_SYSCALL_64_after_hwframe+0x63/0xcd
...
[ 2.187402] </TASK>
Previously in commit d66d6c3152e8 ("net: sched: register noqueue
qdisc"), NULL was set for the noqueue discipline on noqueue init
so that __dev_queue_xmit() falls through for the noqueue case. This
also sets a bypass of the enqueue NULL check in the
register_qdisc() function for the struct noqueue_disc_ops.
Classful queue disciplines make it past the NULL check in
__dev_queue_xmit() because the discipline is set to htb (in this case),
and then in the call to __dev_xmit_skb(), it calls into htb_enqueue()
which grabs a leaf node for a class and then calls qdisc_enqueue() by
passing in a queue discipline which assumes ->enqueue() is not set to NULL.
Fix this by not allowing classes to be assigned to the noqueue
discipline. Linux TC Notes states that classes cannot be set to
the noqueue discipline. [1] Let's enforce that here.
Links:
1. https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt
Fixes: d66d6c3152e8 ("net: sched: register noqueue qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Frederick Lawler <fred@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20230109163906.706000-1-fred@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d3295fee3c756ece33ac0d935e172e68c0a4161b upstream.
Before, only the destructor from TCP request sock in IPv4 was called
even if the subflow was IPv6.
It is important to use the right destructor to avoid memory leaks with
some advanced IPv6 features, e.g. when the request socks contain
specific IPv6 options.
Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 34b21d1ddc8ace77a8fa35c1b1e06377209e0dae upstream.
tcp_request_sock_ops structure is specific to IPv4. It should then not
be used with MPTCP subflows on top of IPv6.
For example, it contains the 'family' field, initialised to AF_INET.
This 'family' field is used by TCP FastOpen code to generate the cookie
but also by TCP Metrics, SELinux and SYN Cookies. Using the wrong family
will not lead to crashes but displaying/using/checking wrong things.
Note that 'send_reset' callback from request_sock_ops structure is used
in some error paths. It is then also important to use the correct one
for IPv4 or IPv6.
The slab name can also be different in IPv4 and IPv6, it will be used
when printing some log messages. The slab pointer will anyway be the
same because the object size is the same for both v4 and v6. A
BUILD_BUG_ON() has also been added to make sure this size is the same.
Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5e29dc36bd5e2166b834ceb19990d9e68a734d7d ]
When adding/deleting large number of elements in one step in ipset, it can
take a reasonable amount of time and can result in soft lockup errors. The
patch 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of
consecutive elements to add/delete") tried to fix it by limiting the max
elements to process at all. However it was not enough, it is still possible
that we get hung tasks. Lowering the limit is not reasonable, so the
approach in this patch is as follows: rely on the method used at resizing
sets and save the state when we reach a smaller internal batch limit,
unlock/lock and proceed from the saved state. Thus we can avoid long
continuous tasks and at the same time removed the limit to add/delete large
number of elements in one step.
The nfnl mutex is held during the whole operation which prevents one to
issue other ipset commands in parallel.
Fixes: 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of consecutive elements to add/delete")
Reported-by: syzbot+9204e7399656300bf271@syzkaller.appspotmail.com
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a31d47be64b9b74f8cfedffe03e0a8a1f9e51f23 ]
The hash:net,port,net set type supports /0 subnets. However, the patch
commit 5f7b51bf09baca8e titled "netfilter: ipset: Limit the maximal range
of consecutive elements to add/delete" did not take into account it and
resulted in an endless loop. The bug is actually older but the patch
5f7b51bf09baca8e brings it out earlier.
Handle /0 subnets properly in hash:net,port,net set types.
Fixes: 5f7b51bf09ba ("netfilter: ipset: Limit the maximal range of consecutive elements to add/delete")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a2965c7be0522eaa18808684b7b82b248515511b ]
If asked to drop a packet via TC_ACT_SHOT it is unsafe to assume
res.class contains a valid pointer
Fixes: b0188d4dbe5f ("[NET_SCHED]: sch_atm: Lindent")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit df49908f3c52d211aea5e2a14a93bbe67a2cb3af ]
nfc_get_device() take reference for the device, add missing
nfc_put_device() to release it when not need anymore.
Also fix the style warnning by use error EOPNOTSUPP instead of
ENOTSUPP.
Fixes: 5ce3f32b5264 ("NFC: netlink: SE API implementation")
Fixes: 29e76924cf08 ("nfc: netlink: Add capability to reply to vendor_cmd with data")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 399ab7fe0fa0d846881685fd4e57e9a8ef7559f7 ]
Syzkaller reports a memory leak as follows:
====================================
BUG: memory leak
unreferenced object 0xffff88810c287f00 (size 256):
comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
[<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline]
[<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline]
[<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline]
[<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline]
[<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342
[<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553
[<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147
[<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082
[<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540
[<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
[<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
[<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
[<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline]
[<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734
[<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482
[<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536
[<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622
[<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline]
[<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline]
[<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648
[<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
[<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
====================================
Kernel uses tcindex_change() to change an existing
filter properties.
Yet the problem is that, during the process of changing,
if `old_r` is retrieved from `p->perfect`, then
kernel uses tcindex_alloc_perfect_hash() to newly
allocate filter results, uses tcindex_filter_result_init()
to clear the old filter result, without destroying
its tcf_exts structure, which triggers the above memory leak.
To be more specific, there are only two source for the `old_r`,
according to the tcindex_lookup(). `old_r` is retrieved from
`p->perfect`, or `old_r` is retrieved from `p->h`.
* If `old_r` is retrieved from `p->perfect`, kernel uses
tcindex_alloc_perfect_hash() to newly allocate the
filter results. Then `r` is assigned with `cp->perfect + handle`,
which is newly allocated. So condition `old_r && old_r != r` is
true in this situation, and kernel uses tcindex_filter_result_init()
to clear the old filter result, without destroying
its tcf_exts structure
* If `old_r` is retrieved from `p->h`, then `p->perfect` is NULL
according to the tcindex_lookup(). Considering that `cp->h`
is directly copied from `p->h` and `p->perfect` is NULL,
`r` is assigned with `tcindex_lookup(cp, handle)`, whose value
should be the same as `old_r`, so condition `old_r && old_r != r`
is false in this situation, kernel ignores using
tcindex_filter_result_init() to clear the old filter result.
So only when `old_r` is retrieved from `p->perfect` does kernel use
tcindex_filter_result_init() to clear the old filter result, which
triggers the above memory leak.
Considering that there already exists a tc_filter_wq workqueue
to destroy the old tcindex_data by tcindex_partial_destroy_work()
at the end of tcindex_set_parms(), this patch solves
this memory leak bug by removing this old filter result
clearing part and delegating it to the tc_filter_wq workqueue.
Note that this patch doesn't introduce any other issues. If
`old_r` is retrieved from `p->perfect`, this patch just
delegates old filter result clearing part to the
tc_filter_wq workqueue; If `old_r` is retrieved from `p->h`,
kernel doesn't reach the old filter result clearing part, so
removing this part has no effect.
[Thanks to the suggestion from Jakub Kicinski, Cong Wang, Paolo Abeni
and Dmitry Vyukov]
Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/
Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 123b99619cca94bdca0bf7bde9abe28f0a0dfe06 ]
Set timeout and garbage collection interval updates are ignored on
updates. Add transaction to update global set element timeout and
garbage collection interval.
Fixes: 96518518cc41 ("netfilter: add nftables")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f6594c372afd5cec8b1e9ee9ea8f8819d59c6fb1 ]
If a ruleset declares a set name that matches an existing set in the
kernel, then validate that this declaration really refers to the same
set, otherwise bail out with EEXIST.
Currently, the kernel reports success when adding a set that already
exists in the kernel. This usually results in EINVAL errors at a later
stage, when the user adds elements to the set, if the set declaration
mismatches the existing set representation in the kernel.
Add a new function to check that the set declaration really refers to
the same existing set in the kernel.
Fixes: 96518518cc41 ("netfilter: add nftables")
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a8fe4154fa5a1bae590b243ed60f871e5a5e1378 ]
Add a helper function to allocate and initialize the stateful expressions
that are defined in a set.
This patch allows to reuse this code from the set update path, to check
that type of the update matches the existing set in the kernel.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Stable-dep-of: f6594c372afd ("netfilter: nf_tables: perform type checking for existing sets")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bed4a63ea4ae77cfe5aae004ef87379f0655260a ]
Add the following fields to the set description:
- key type
- data type
- object type
- policy
- gc_int: garbage collection interval)
- timeout: element timeout
This prepares for stricter set type checks on updates in a follow up
patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Stable-dep-of: f6594c372afd ("netfilter: nf_tables: perform type checking for existing sets")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 54c3f1a81421f85e60ae2eaae7be3727a09916ee ]
Anand hit a BUG() when pulling off headers on egress to a SW tunnel.
We get to skb_checksum_help() with an invalid checksum offset
(commit d7ea0d9df2a6 ("net: remove two BUG() from skb_checksum_help()")
converted those BUGs to WARN_ONs()).
He points out oddness in how skb_postpull_rcsum() gets used.
Indeed looks like we should pull before "postpull", otherwise
the CHECKSUM_PARTIAL fixup from skb_postpull_rcsum() will not
be able to do its job:
if (skb->ip_summed == CHECKSUM_PARTIAL &&
skb_checksum_start_offset(skb) < 0)
skb->ip_summed = CHECKSUM_NONE;
Reported-by: Anand Parthasarathy <anpartha@meta.com>
Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221220004701.402165-1-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b18cba09e374637a0a3759d856a6bca94c133952 ]
Commit 9130b8dbc6ac ("SUNRPC: allow for upcalls for the same uid
but different gss service") introduced `auth` argument to
__gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
since it (and auth->service) was not (yet) determined.
When multiple upcalls with the same uid and different service are
ongoing, it could happen that __gss_find_upcall(), which returns the
first match found in the pipe->in_downcall list, could not find the
correct gss_msg corresponding to the downcall we are looking for.
Moreover, it might return a msg which is not sent to rpc.gssd yet.
We could see mount.nfs process hung in D state with multiple mount.nfs
are executed in parallel. The call trace below is of CentOS 7.9
kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
elrepo kernel-ml-6.0.7-1.el7.
PID: 71258 TASK: ffff91ebd4be0000 CPU: 36 COMMAND: "mount.nfs"
#0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
#1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
#2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
#3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
[sunrpc]
#4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
#5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
#6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
#7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
#8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
#9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]
The scenario is like this. Let's say there are two upcalls for
services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.
When rpc.gssd reads pipe to get the upcall msg corresponding to
service B from pipe->pipe and then writes the response, in
gss_pipe_downcall the msg corresponding to service A will be picked
because only uid is used to find the msg and it is before the one for
B in pipe->in_downcall. And the process waiting for the msg
corresponding to service A will be woken up.
Actual scheduing of that process might be after rpc.gssd processes the
next msg. In rpc_pipe_generic_upcall it clears msg->errno (for A).
The process is scheduled to see gss_msg->ctx == NULL and
gss_msg->msg.errno == 0, therefore it cannot break the loop in
gss_create_upcall and is never woken up after that.
This patch adds a simple check to ensure that a msg which is not
sent to rpc.gssd yet is not chosen as the matching upcall upon
receiving a downcall.
Signed-off-by: minoura makoto <minoura@valinux.co.jp>
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
Tested-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Fixes: 9130b8dbc6ac ("SUNRPC: allow for upcalls for same uid but different gss service")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 3fff88186f047627bb128d65155f42517f8e448f upstream.
To ease the maintenance, it is often recommended to avoid having #ifdef
preprocessor conditions.
Here the section related to CONFIG_MPTCP was quite short but the next
commit needs to add more code around. It is then cleaner to move
specific MPTCP code to functions located in net/mptcp directory.
Now that mptcp_subflow_request_sock_ops structure can be static, it can
also be marked as "read only after init".
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51fa7f8ebf0e25c7a9039fa3988a623d5f3855aa upstream.
These structures are initialised from the init hooks, so we can't make
them 'const'. But no writes occur afterwards, so we can use ro_after_init.
Also, remove bogus EXPORT_SYMBOL, the only access comes from ip
stack, not from kernel modules.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dfed913e8b55a0c2c4906f1242fd38fd9a116e49 upstream.
Currently, the kernel drops GSO VLAN tagged packet if it's created with
socket(AF_PACKET, SOCK_RAW, 0) plus virtio_net_hdr.
The reason is AF_PACKET doesn't adjust the skb network header if there is
a VLAN tag. Then after virtio_net_hdr_set_proto() called, the skb->protocol
will be set to ETH_P_IP/IPv6. And in later inet/ipv6_gso_segment() the skb
is dropped as network header position is invalid.
Let's handle VLAN packets by adjusting network header position in
packet_parse_headers(). The adjustment is safe and does not affect the
later xmit as tap device also did that.
In packet_snd(), packet_parse_headers() need to be moved before calling
virtio_net_hdr_set_proto(), so we can set correct skb->protocol and
network header first.
There is no need to update tpacket_snd() as it calls packet_parse_headers()
in tpacket_fill_skb(), which is already before calling virtio_net_hdr_*
functions.
skb->no_fcs setting is also moved upper to make all skb settings together
and keep consistency with function packet_sendmsg_spkt().
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20220425014502.985464-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 26273ade77f54716e30dfd40ac6e85ceb54ac0f9 upstream.
When a new request is allocated, the refcount will be zero if it is
reused, but if the request is newly allocated from slab, it is not fully
initialized before being added to idr.
If the p9_read_work got a response before the refcount initiated. It will
use a uninitialized req, which will result in a bad request data struct.
Here is the logs from syzbot.
Corrupted memory at 0xffff88807eade00b [ 0xff 0x07 0x00 0x00 0x00 0x00
0x00 0x00 . . . . . . . . ] (in kfence-#110):
p9_fcall_fini net/9p/client.c:248 [inline]
p9_req_put net/9p/client.c:396 [inline]
p9_req_put+0x208/0x250 net/9p/client.c:390
p9_client_walk+0x247/0x540 net/9p/client.c:1165
clone_fid fs/9p/fid.h:21 [inline]
v9fs_fid_xattr_set+0xe4/0x2b0 fs/9p/xattr.c:118
v9fs_xattr_set fs/9p/xattr.c:100 [inline]
v9fs_xattr_handler_set+0x6f/0x120 fs/9p/xattr.c:159
__vfs_setxattr+0x119/0x180 fs/xattr.c:182
__vfs_setxattr_noperm+0x129/0x5f0 fs/xattr.c:216
__vfs_setxattr_locked+0x1d3/0x260 fs/xattr.c:277
vfs_setxattr+0x143/0x340 fs/xattr.c:309
setxattr+0x146/0x160 fs/xattr.c:617
path_setxattr+0x197/0x1c0 fs/xattr.c:636
__do_sys_setxattr fs/xattr.c:652 [inline]
__se_sys_setxattr fs/xattr.c:648 [inline]
__ia32_sys_setxattr+0xc0/0x160 fs/xattr.c:648
do_syscall_32_irqs_on arch/x86/entry/common.c:112 [inline]
__do_fast_syscall_32+0x65/0xf0 arch/x86/entry/common.c:178
do_fast_syscall_32+0x33/0x70 arch/x86/entry/common.c:203
entry_SYSENTER_compat_after_hwframe+0x70/0x82
Below is a similar scenario, the scenario in the syzbot log looks more
complicated than this one, but this patch can fix it.
T21124 p9_read_work
======================== second trans =================================
p9_client_walk
p9_client_rpc
p9_client_prepare_req
p9_tag_alloc
req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
tag = idr_alloc
<< preempted >>
req->tc.tag = tag;
/* req->[refcount/tag] == uninitialized */
m->rreq = p9_tag_lookup(m->client, m->rc.tag);
/* increments uninitalized refcount */
refcount_set(&req->refcount, 2);
/* cb drops one ref */
p9_client_cb(req)
/* reader thread drops its ref:
request is incorrectly freed */
p9_req_put(req)
/* use after free and ref underflow */
p9_req_put(req)
To fix it, we can initialize the refcount to zero before add to idr.
Link: https://lkml.kernel.org/r/20221201033310.18589-1-schspa@gmail.com
Cc: stable@vger.kernel.org # 6.0+ due to 6cda12864cb0 ("9p: Drop kref usage")
Fixes: 728356dedeff ("9p: Add refcount to p9_req_t")
Reported-by: syzbot+8f1060e2aaf8ca55220b@syzkaller.appspotmail.com
Signed-off-by: Schspa Shi <schspa@gmail.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 64a8f8f7127da228d59a39e2c5e75f86590f90b4 ]
The value of an arithmetic expression "n * id.data" is subject
to possible overflow due to a failure to cast operands to a larger data
type before performing arithmetic. Used macro for multiplication instead
operator for avoiding overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20221122122901.22294-1-korotkov.maxim.s@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ab0377803dafc58f1e22296708c1c28e309414d6 ]
The caller of del_timer_sync must prevent restarting of the timer, If
we have no this synchronization, there is a small probability that the
cancellation will not be successful.
And syzbot report the fellowing crash:
==================================================================
BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:929 [inline]
BUG: KASAN: use-after-free in enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
Write at addr f9ff000024df6058 by task syz-fuzzer/2256
Pointer tag: [f9], memory tag: [fe]
CPU: 1 PID: 2256 Comm: syz-fuzzer Not tainted 6.1.0-rc5-syzkaller-00008-
ge01d50cbd6ee #0
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xe0/0xf0 arch/arm64/kernel/stacktrace.c:156
dump_backtrace arch/arm64/kernel/stacktrace.c:162 [inline]
show_stack+0x18/0x40 arch/arm64/kernel/stacktrace.c:163
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x68/0x84 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x1a8/0x4a0 mm/kasan/report.c:395
kasan_report+0x94/0xb4 mm/kasan/report.c:495
__do_kernel_fault+0x164/0x1e0 arch/arm64/mm/fault.c:320
do_bad_area arch/arm64/mm/fault.c:473 [inline]
do_tag_check_fault+0x78/0x8c arch/arm64/mm/fault.c:749
do_mem_abort+0x44/0x94 arch/arm64/mm/fault.c:825
el1_abort+0x40/0x60 arch/arm64/kernel/entry-common.c:367
el1h_64_sync_handler+0xd8/0xe4 arch/arm64/kernel/entry-common.c:427
el1h_64_sync+0x64/0x68 arch/arm64/kernel/entry.S:576
hlist_add_head include/linux/list.h:929 [inline]
enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
mod_timer+0x14/0x20 kernel/time/timer.c:1161
mrp_periodic_timer_arm net/802/mrp.c:614 [inline]
mrp_periodic_timer+0xa0/0xc0 net/802/mrp.c:627
call_timer_fn.constprop.0+0x24/0x80 kernel/time/timer.c:1474
expire_timers+0x98/0xc4 kernel/time/timer.c:1519
To fix it, we can introduce a new active flags to make sure the timer will
not restart.
Reported-by: syzbot+6fd64001c20aa99e34a4@syzkaller.appspotmail.com
Signed-off-by: Schspa Shi <schspa@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cb34b7cf17ecf33499c9298943f85af247abc1e9 ]
syzbot/KCSAN reported that multiple cpus are updating dev->stats.tx_error
concurrently.
This is because sit tunnels are NETIF_F_LLTX, meaning their ndo_start_xmit()
is not protected by a spinlock.
While original KCSAN report was about tx path, rx path has the same issue.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6c1c5097781f563b70a81683ea6fdac21637573b ]
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 07ec7b502800ba9f7b8b15cb01dd6556bb41aaca ]
syzkaller managed to trigger another case where skb->len == 0
when we enter __dev_queue_xmit:
WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline]
WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295
Call Trace:
dev_queue_xmit+0x17/0x20 net/core/dev.c:4406
__bpf_tx_skb net/core/filter.c:2115 [inline]
__bpf_redirect_no_mac net/core/filter.c:2140 [inline]
__bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163
____bpf_clone_redirect net/core/filter.c:2447 [inline]
bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419
bpf_prog_48159a89cb4a9a16+0x59/0x5e
bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline]
__bpf_prog_run include/linux/filter.h:596 [inline]
bpf_prog_run include/linux/filter.h:603 [inline]
bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402
bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170
bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648
__sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005
__do_sys_bpf kernel/bpf/syscall.c:5091 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5089 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089
do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48
entry_SYSCALL_64_after_hwframe+0x61/0xc6
The reproducer doesn't really reproduce outside of syzkaller
environment, so I'm taking a guess here. It looks like we
do generate correct ETH_HLEN-sized packet, but we redirect
the packet to the tunneling device. Before we do so, we
__skb_pull l2 header and arrive again at skb->len == 0.
Doesn't seem like we can do anything better than having
an explicit check after __skb_pull?
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+f635e86ec3fa0a37e019@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221027225537.353077-1-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3 ]
Changheon Lee reported TCP socket leaks, with a nice repro.
It seems we leak TCP sockets with the following sequence:
1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.
Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
__skb_tstamp_tx() is using skb_clone(), unless
SOF_TIMESTAMPING_OPT_TSONLY was also requested.
2) If the application is also using MSG_ZEROCOPY, then we put in the
error queue cloned skbs that had a struct ubuf_info attached to them.
Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
does a sock_hold().
As long as the cloned skbs are still in sk_error_queue,
socket refcount is kept elevated.
3) Application closes the socket, while error queue is not empty.
Since tcp_close() no longer purges the socket error queue,
we might end up with a TCP socket with at least one skb in
error queue keeping the socket alive forever.
This bug can be (ab)used to consume all kernel memory
and freeze the host.
We need to purge the error queue, with proper synchronization
against concurrent writers.
Fixes: 24bcbe1cc69f ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
Reported-by: Changheon Lee <darklight2357@icloud.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4feb2c44629e6f9b459b41a5a60491069d346a95 ]
One of the error paths in rxrpc_do_sendmsg() doesn't unlock the call mutex
before returning. Fix it to do this.
Note that this still doesn't get rid of the checker warning:
../net/rxrpc/sendmsg.c:617:5: warning: context imbalance in 'rxrpc_do_sendmsg' - wrong count at exit
I think the interplay between the socket lock and the call's user_mutex may
be too complicated for checker to analyse, especially as
rxrpc_new_client_call_for_sendmsg(), which it calls, returns with the
call's user_mutex if successful but unconditionally drops the socket lock.
Fixes: e754eba685aa ("rxrpc: Provide a cmsg to specify the amount of Tx data for a call")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>