linux

iv/linux

Author	SHA1	Message	Date
John Stultz	295f3fcaa8	pstore: Revert pmsg_lock back to a normal mutex [ Upstream commit 5239a89b06d6b199f133bf0ffea421683187f257 ] This reverts commit 76d62f24db07f22ccf9bc18ca793c27d4ebef721. So while priority inversion on the pmsg_lock is an occasional problem that an rt_mutex would help with, in uses where logging is writing to pmsg heavily from multiple threads, the pmsg_lock can be heavily contended. After this change landed, it was reported that cases where the mutex locking overhead was commonly adding on the order of 10s of usecs delay had suddenly jumped to ~msec delay with rtmutex. It seems the slight differences in the locks under this level of contention causes the normal mutexes to utilize the spinning optimizations, while the rtmutexes end up in the sleeping slowpath (which allows additional threads to pile on trying to take the lock). In this case, it devolves to a worse case senerio where the lock acquisition and scheduling overhead dominates, and each thread is waiting on the order of ~ms to do ~us of work. Obviously, having tons of threads all contending on a single lock for logging is non-optimal, so the proper fix is probably reworking pstore pmsg to have per-cpu buffers so we don't have contention. Additionally, Steven Rostedt has provided some furhter optimizations for rtmutexes that improves the rtmutex spinning path, but at least in my testing, I still see the test tripping into the sleeping path on rtmutexes while utilizing the spinning path with mutexes. But in the short term, lets revert the change to the rt_mutex and go back to normal mutexes to avoid a potentially major performance regression. And we can work on optimizations to both rtmutexes and finer-grained locking for pstore pmsg in the future. Cc: Wei Wang <wvw@google.com> Cc: Midas Chien<midaschieh@google.com> Cc: "Chunhui Li (李春辉)" <chunhui.li@mediatek.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kees Cook <keescook@chromium.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tony Luck <tony.luck@intel.com> Cc: kernel-team@android.com Fixes: 76d62f24db07 ("pstore: Switch pmsg_lock to an rt_mutex to avoid priority inversion") Reported-by: "Chunhui Li (李春辉)" <chunhui.li@mediatek.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230308204043.2061631-1-jstultz@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Hans de Goede	2491b999a6	drivers: staging: rtl8723bs: Fix locking in rtw_scan_timeout_handler() [ Upstream commit 3f467036093fedd7e231924327455fc609b5ef02 ] Commit cc7ad0d77b51 ("drivers: staging: rtl8723bs: Fix deadlock in rtw_surveydone_event_callback()") besides fixing the deadlock also modified rtw_scan_timeout_handler() to use spin_[un]lock_irq() instead of spin_[un]lock_bh(). Disabling the IRQs is not necessary since all code taking this lock runs from either user contexts or from softirqs rtw_scan_timeout_handler() is the only function taking pmlmepriv->lock which uses spin_[un]lock_irq() for this. Switch back to spin_[un]lock_bh() to make it consistent with the rest of the code. Fixes: cc7ad0d77b51 ("drivers: staging: rtl8723bs: Fix deadlock in rtw_surveydone_event_callback()") Cc: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230221145326.7808-2-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Hans de Goede	209850f177	drivers: staging: rtl8723bs: Fix locking in _rtw_join_timeout_handler() [ Upstream commit 215792eda008f6a1e7ed9d77fa20d582d22bb114 ] Commit 041879b12ddb ("drivers: staging: rtl8192bs: Fix deadlock in rtw_joinbss_event_prehandle()") besides fixing the deadlock also modified _rtw_join_timeout_handler() to use spin_[un]lock_irq() instead of spin_[un]lock_bh(). _rtw_join_timeout_handler() calls rtw_do_join() which takes pmlmepriv->scanned_queue.lock using spin_[un]lock_bh(). This spin_unlock_bh() call re-enables softirqs which triggers an oops in kernel/softirq.c: __local_bh_enable_ip() when it calls lockdep_assert_irqs_enabled(): [ 244.506087] WARNING: CPU: 2 PID: 0 at kernel/softirq.c:376 __local_bh_enable_ip+0xa6/0x100 ... [ 244.509022] Call Trace: [ 244.509048] <IRQ> [ 244.509100] _rtw_join_timeout_handler+0x134/0x170 [r8723bs] [ 244.509468] ? __pfx__rtw_join_timeout_handler+0x10/0x10 [r8723bs] [ 244.509772] ? __pfx__rtw_join_timeout_handler+0x10/0x10 [r8723bs] [ 244.510076] call_timer_fn+0x95/0x2a0 [ 244.510200] __run_timers.part.0+0x1da/0x2d0 This oops is causd by the switch to spin_[un]lock_irq() which disables the IRQs for the entire duration of _rtw_join_timeout_handler(). Disabling the IRQs is not necessary since all code taking this lock runs from either user contexts or from softirqs, switch back to spin_[un]lock_bh() to fix this. Fixes: 041879b12ddb ("drivers: staging: rtl8192bs: Fix deadlock in rtw_joinbss_event_prehandle()") Cc: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230221145326.7808-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Randy Dunlap	1ffb2ca650	ipmi: ASPEED_BT_IPMI_BMC: select REGMAP_MMIO instead of depending on it [ Upstream commit 2a587b9ad052e7e92e508aea90c1e2ae433c1908 ] REGMAP is a hidden (not user visible) symbol. Users cannot set it directly thru "make *config", so drivers should select it instead of depending on it if they need it. Consistently using "select" or "depends on" can also help reduce Kconfig circular dependency issues. Therefore, change the use of "depends on REGMAP_MMIO" to "select REGMAP_MMIO", which will also set REGMAP. Fixes: eb994594bc22 ("ipmi: bt-bmc: Use a regmap for register access") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Andrew Jeffery <andrew@aj.id.au> Cc: Corey Minyard <minyard@acm.org> Cc: openipmi-developer@lists.sourceforge.net Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Message-Id: <20230226053953.4681-2-rdunlap@infradead.org> Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Kuniyuki Iwashima	43e4197dd5	tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. [ Upstream commit 50749f2dd6854a41830996ad302aef2ffaf011d8 ] syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY skbs. We can reproduce the problem with these sequences: sk = socket(AF_INET, SOCK_DGRAM, 0) sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE) sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1) sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53)) sk.close() sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets skb->cb->ubuf.refcnt to 1, and calls sock_hold(). Here, struct ubuf_info_msgzc indirectly holds a refcnt of the socket. When the skb is sent, __skb_tstamp_tx() clones it and puts the clone into the socket's error queue with the TX timestamp. When the original skb is received locally, skb_copy_ubufs() calls skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt. This additional count is decremented while freeing the skb, but struct ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is not called. The last refcnt is not released unless we retrieve the TX timestamped skb by recvmsg(). Since we clear the error queue in inet_sock_destruct() after the socket's refcnt reaches 0, there is a circular dependency. If we close() the socket holding such skbs, we never call sock_put() and leak the count, sk, and skb. TCP has the same problem, and commit e0c8bccd40fc ("net: stream: purge sk_error_queue in sk_stream_kill_queues()") tried to fix it by calling skb_queue_purge() during close(). However, there is a small chance that skb queued in a qdisc or device could be put into the error queue after the skb_queue_purge() call. In __skb_tstamp_tx(), the cloned skb should not have a reference to the ubuf to remove the circular dependency, but skb_clone() does not call skb_copy_ubufs() for zerocopy skb. So, we need to call skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs(). [0]: BUG: memory leak unreferenced object 0xffff88800c6d2d00 (size 1152): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00 ................ 02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............ backtrace: [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024 [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083 [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline] [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245 [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515 [<00000000b9b11231>] sock_create net/socket.c:1566 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline] [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636 [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline] [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline] [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd BUG: memory leak unreferenced object 0xffff888017633a00 (size 240): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff .........-m..... backtrace: [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497 [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline] [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596 [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline] [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370 [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037 [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652 [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253 [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819 [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline] [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734 [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117 [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline] [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline] [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY") Fixes: b5947e5d1e71 ("udp: msg_zerocopy") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Gencen Gan	1d2f799c16	net: amd: Fix link leak when verifying config failed [ Upstream commit d325c34d9e7e38d371c0a299d415e9b07f66a1fb ] After failing to verify configuration, it returns directly without releasing link, which may cause memory leak. Paolo Abeni thinks that the whole code of this driver is quite "suboptimal" and looks unmainatained since at least ~15y, so he suggests that we could simply remove the whole driver, please take it into consideration. Simon Horman suggests that the fix label should be set to "Linux-2.6.12-rc2" considering that the problem has existed since the driver was introduced and the commit above doesn't seem to exist in net/net-next. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Gan Gecen <gangecen@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Kuniyuki Iwashima	5d6e5c054e	netlink: Use copy_to_user() for optval in netlink_getsockopt(). [ Upstream commit d913d32cc2707e9cd24fe6fa6d7d470e9c728980 ] Brad Spencer provided a detailed report [0] that when calling getsockopt() for AF_NETLINK, some SOL_NETLINK options set only 1 byte even though such options require at least sizeof(int) as length. The options return a flag value that fits into 1 byte, but such behaviour confuses users who do not initialise the variable before calling getsockopt() and do not strictly check the returned value as char. Currently, netlink_getsockopt() uses put_user() to copy data to optlen and optval, but put_user() casts the data based on the pointer, char *optval. As a result, only 1 byte is set to optval. To avoid this behaviour, we need to use copy_to_user() or cast optval for put_user(). Note that this changes the behaviour on big-endian systems, but we document that the size of optval is int in the man page. $ man 7 netlink ... Socket options To set or get a netlink socket option, call getsockopt(2) to read or setsockopt(2) to write the option with the option level argument set to SOL_NETLINK. Unless otherwise noted, optval is a pointer to an int. Fixes: 9a4595bc7e67 ("[NETLINK]: Add set/getsockopt options to support more than 32 groups") Fixes: be0c22a46cfb ("netlink: add NETLINK_BROADCAST_ERROR socket option") Fixes: 38938bfe3489 ("netlink: add NETLINK_NO_ENOBUFS socket flag") Fixes: 0a6a3a23ea6e ("netlink: add NETLINK_CAP_ACK socket option") Fixes: 2d4bc93368f5 ("netlink: extended ACK reporting") Fixes: 89d35528d17d ("netlink: Add new socket option to enable strict checking on dumps") Reported-by: Brad Spencer <bspencer@blackberry.com> Link: https://lore.kernel.org/netdev/ZD7VkNWFfp22kTDt@datsun.rim.net/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Link: https://lore.kernel.org/r/20230421185255.94606-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Liu Jian	a789192f36	Revert "Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work" [ Upstream commit db2bf510bd5d57f064d9e1db395ed86a08320c54 ] This reverts commit 1e9ac114c4428fdb7ff4635b45d4f46017e8916f. This patch introduces a possible null-ptr-def problem. Revert it. And the fixed bug by this patch have resolved by commit 73f7b171b7c0 ("Bluetooth: btsdio: fix use after free bug in btsdio_remove due to race condition"). Fixes: 1e9ac114c442 ("Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Ziyang Xuan	a54ec573d9	ipv4: Fix potential uninit variable access bug in __ip_make_skb() [ Upstream commit 99e5acae193e369b71217efe6f1dad42f3f18815 ] Like commit ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()"). icmphdr does not in skb linear region under the scenario of SOCK_RAW socket. Access icmp_hdr(skb)->type directly will trigger the uninit variable access bug. Use a local variable icmp_type to carry the correct value in different scenarios. Fixes: 96793b482540 ("[IPV4]: Add ICMPMsgStats MIB (RFC 4293)") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Davide Caratti	d0b43125ec	net/sched: sch_fq: fix integer overflow of "credit" [ Upstream commit 7041101ff6c3073fd8f2e99920f535b111c929cb ] if sch_fq is configured with "initial quantum" having values greater than INT_MAX, the first assignment of "credit" does signed integer overflow to a very negative value. In this situation, the syzkaller script provided by Cristoph triggers the CPU soft-lockup warning even with few sockets. It's not an infinite loop, but "credit" wasn't probably meant to be minus 2Gb for each new flow. Capping "initial quantum" to INT_MAX proved to fix the issue. v2: validation of "initial quantum" is done in fq_policy, instead of open coding in fq_change() _ suggested by Jakub Kicinski Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/377 Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/r/7b3a3c7e36d03068707a021760a194a8eb5ad41a.1682002300.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:31 +09:00
Florian Westphal	7a45b4e1c8	netfilter: nf_tables: don't write table validation state without mutex [ Upstream commit 9a32e9850686599ed194ccdceb6cd3dd56b2d9b9 ] The ->cleanup callback needs to be removed, this doesn't work anymore as the transaction mutex is already released in the ->abort function. Just do it after a successful validation pass, this either happens from commit or abort phases where transaction mutex is held. Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Stanislav Fomichev	8913abddad	bpf: Don't EFAULT for getsockopt with optval=NULL [ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ] Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7 Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Yan Wang	77f245ce05	net: stmmac:fix system hang when setting up tag_8021q VLAN for DSA ports [ Upstream commit 35226750f7ab9d49140d95bc7d38a2a9b0f4fdfc ] The system hang because of dsa_tag_8021q_port_setup()-> stmmac_vlan_rx_add_vid(). I found in stmmac_drv_probe() that cailing pm_runtime_put() disabled the clock. First, when the kernel is compiled with CONFIG_PM=y,The stmmac's resume/suspend is active. Secondly,stmmac as DSA master,the dsa_tag_8021q_port_setup() function will callback stmmac_vlan_rx_add_vid when DSA dirver starts. However, The system is hanged for the stmmac_vlan_rx_add_vid() accesses its registers after stmmac's clock is closed. I would suggest adding the pm_runtime_resume_and_get() to the stmmac_vlan_rx_add_vid().This guarantees that resuming clock output while in use. Fixes: b3dcb3127786 ("net: stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid()") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Yan Wang <rk.code@outlook.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Chris Mi	a9e96eef82	net/mlx5: E-switch, Don't destroy indirect table in split rule [ Upstream commit 4c8189302567f75099a336b0efcff8291ec86ff4 ] Source port rewrite (forward to ovs internal port or statck device) isn't supported in the rule of split action. So there is no indirect table in split rule. The cited commit destroyes indirect table in split rule. The indirect table for other rules will be destroyed wrongly. It will cause traffic loss. Fix it by removing the destroy function in split rule. And also remove the destroy function in error flow. Fixes: 10742efc20a4 ("net/mlx5e: VF tunnel TX traffic offloading") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Joe Damato	05cf6f353d	ixgbe: Enable setting RSS table to default values [ Upstream commit e85d3d55875f7a1079edfbc4e4e98d6f8aea9ac7 ] ethtool uses `ETHTOOL_GRXRINGS` to compute how many queues are supported by RSS. The driver should return the smaller of either: - The maximum number of RSS queues the device supports, OR - The number of RX queues configured Prior to this change, running `ethtool -X $iface default` fails if the number of queues configured is larger than the number supported by RSS, even though changing the queue count correctly resets the flowhash to use all supported queues. Other drivers (for example, i40e) will succeed but the flow hash will reset to support the maximum number of queues supported by RSS, even if that amount is smaller than the configured amount. Prior to this change: $ sudo ethtool -L eth1 combined 20 $ sudo ethtool -x eth1 RX flow hash indirection table for eth1 with 20 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9 10 11 12 13 14 15 16: 0 1 2 3 4 5 6 7 24: 8 9 10 11 12 13 14 15 32: 0 1 2 3 4 5 6 7 ... You can see that the flowhash was correctly set to use the maximum number of queues supported by the driver (16). However, asking the NIC to reset to "default" fails: $ sudo ethtool -X eth1 default Cannot set RX flow hash configuration: Invalid argument After this change, the flowhash can be reset to default which will use all of the available RSS queues (16) or the configured queue count, whichever is smaller. Starting with eth1 which has 10 queues and a flowhash distributing to all 10 queues: $ sudo ethtool -x eth1 RX flow hash indirection table for eth1 with 10 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9 0 1 2 3 4 5 16: 6 7 8 9 0 1 2 3 ... Increasing the queue count to 48 resets the flowhash to distribute to 16 queues, as it did before this patch: $ sudo ethtool -L eth1 combined 48 $ sudo ethtool -x eth1 RX flow hash indirection table for eth1 with 16 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9 10 11 12 13 14 15 16: 0 1 2 3 4 5 6 7 ... Due to the other bugfix in this series, the flowhash can be set to use queues 0-5: $ sudo ethtool -X eth1 equal 5 $ sudo ethtool -x eth1 RX flow hash indirection table for eth1 with 16 RX ring(s): 0: 0 1 2 3 4 0 1 2 8: 3 4 0 1 2 3 4 0 16: 1 2 3 4 0 1 2 3 ... Due to this bugfix, the flowhash can be reset to default and use 16 queues: $ sudo ethtool -X eth1 default $ sudo ethtool -x eth1 RX flow hash indirection table for eth1 with 16 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9 10 11 12 13 14 15 16: 0 1 2 3 4 5 6 7 ... Fixes: 91cd94bfe4f0 ("ixgbe: add basic support for setting and getting nfc controls") Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Joe Damato	624b73f776	ixgbe: Allow flow hash to be set via ethtool [ Upstream commit 4f3ed1293feb9502dc254b05802faf1ad3317ac6 ] ixgbe currently returns `EINVAL` whenever the flowhash it set by ethtool because the ethtool code in the kernel passes a non-zero value for hfunc that ixgbe should allow. When ethtool is called with `ETHTOOL_SRXFHINDIR`, `ethtool_set_rxfh_indir` will call ixgbe's set_rxfh function with `ETH_RSS_HASH_NO_CHANGE`. This value should be accepted. When ethtool is called with `ETHTOOL_SRSSH`, `ethtool_set_rxfh` will call ixgbe's set_rxfh function with `rxfh.hfunc`, which appears to be hardcoded in ixgbe to always be `ETH_RSS_HASH_TOP`. This value should also be accepted. Before this patch: $ sudo ethtool -L eth1 combined 10 $ sudo ethtool -X eth1 default Cannot set RX flow hash configuration: Invalid argument After this patch: $ sudo ethtool -L eth1 combined 10 $ sudo ethtool -X eth1 default $ sudo ethtool -x eth1 RX flow hash indirection table for eth1 with 10 RX ring(s): 0: 0 1 2 3 4 5 6 7 8: 8 9 0 1 2 3 4 5 16: 6 7 8 9 0 1 2 3 24: 4 5 6 7 8 9 0 1 ... Fixes: 1c7cf0784e4d ("ixgbe: support for ethtool set_rxfh") Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Johannes Berg	e302e9ca14	wifi: iwlwifi: fw: fix memory leak in debugfs [ Upstream commit 3d90d2f4a018fe8cfd65068bc6350b6222be4852 ] Fix a memory leak that occurs when reading the fw_info file all the way, since we return NULL indicating no more data, but don't free the status tracking object. Fixes: 36dfe9ac6e8b ("iwlwifi: dump api version in yaml format") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230418122405.239e501b3b8d.I4268f87809ef91209cbcd748eee0863195e70fa2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Johannes Berg	53b3b1f563	wifi: iwlwifi: mvm: check firmware response size [ Upstream commit 13513cec93ac9902d0b896976d8bab3758a9881c ] Check the firmware response size for responses to the memory read/write command in debugfs before using it. Fixes: 2b55f43f8e47 ("iwlwifi: mvm: Add mem debugfs entry") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230417113648.0d56fcaf68ee.I70e9571f3ed7263929b04f8fabad23c9b999e4ea@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Quan Zhou	aa11a89445	wifi: mt76: mt7921e: improve reliability of dma reset [ Upstream commit 87714bf6ed1589813e473db5471e6e9857755764 ] The hardware team has advised the driver that it is necessary to first put WFDMA into an idle state before resetting the WFDMA. Otherwise, the WFDMA may enter an unknown state where it cannot be polled with the right state successfully. To ensure that the DMA can work properly while a stressful cold reboot test was being made, we have reordered the programming sequence in the driver based on the hardware team's guidance. The patch would modify the WFDMA disabling flow from "DMA reset -> disabling DMASHDL -> disabling WFDMA -> polling and waiting until DMA idle" to "disabling WFDMA -> polling and waiting for DMA idle -> disabling DMASHDL -> DMA reset. Where he polling and waiting until WFDMA is idle is coordinated with the operation of disabling WFDMA. Even while WFDMA is being disabled, it can still handle Tx/Rx requests. The additional polling allows sufficient time for WFDMA to process the last T/Rx request. When the idle state of WFDMA is reached, it is a reliable indication that DMASHDL is also idle to ensure it is safe to disable it and perform the DMA reset. Fixes: 0a1059d0f060 ("mt76: mt7921: move mt7921_dma_reset in dma.c") Co-developed-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Sean Wang <sean.wang@mediatek.com> Co-developed-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Co-developed-by: Wang Zhao <wang.zhao@mediatek.com> Signed-off-by: Wang Zhao <wang.zhao@mediatek.com> Signed-off-by: Quan Zhou <quan.zhou@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Ming Yen Hsieh	f8923ad9dd	wifi: mt76: fix 6GHz high channel not be scanned [ Upstream commit 23792cedaff02351b57bddd8957fc917fa88f2e0 ] mt76 scan command only support 64 channels currently. If the channel count is larger than 64(for 2+5+6GHz), some channels will not be scanned. Hence change the scan type to full channel scan in case of the command cannot include proper list for chip. Fixes: 399090ef9605 ("mt76: mt76_connac: move hw_scan and sched_scan routine in mt76_connac_mcu module") Reported-by: Ben Greear <greearb@candelatech.com> Tested-by: Isaac Konikoff <konikofi@candelatech.com> Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Quan Zhou	613b51663f	wifi: mt76: mt7921e: fix probe timeout after reboot [ Upstream commit c397fc1e6365a2a9e5540a85b2c1d4ea412aa0e2 ] In system warm reboot scene, due to the polling timeout(now 1000us) is too short to wait dma idle in time, it may make driver probe fail with error code -ETIMEDOUT. Meanwhile, we also found the dma may take around 70ms to enter idle state. Change the polling idle timeout to 100ms to avoid the probabilistic probe fail. Tested pass with 5000 times warm reboot on x86 platform. [4.477496] pci 0000:01:00.0: attach allowed to drvr mt7921e [internal device] [4.478306] mt7921e 0000:01:00.0: ASIC revision: 79610010 [4.480063] mt7921e: probe of 0000:01:00.0 failed with error -110 Fixes: 0a1059d0f060 ("mt76: mt7921: move mt7921_dma_reset in dma.c") Signed-off-by: Quan Zhou <quan.zhou@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Deren Wu	5279aaf9f5	wifi: mt76: add flexible polling wait-interval support [ Upstream commit 35effe6c0c24adcf0f732bb1c3d75573d4c88e63 ] The default waiting unit is 10ms and the value is too much for data path related control. Provide a new API mt76_poll_msec_tick() to support different cases, such as 1ms polling waiting kick. Reviewed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Stable-dep-of: c397fc1e6365 ("wifi: mt76: mt7921e: fix probe timeout after reboot") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Kang Chen	ac9fec5b56	wifi: mt76: handle failure of vzalloc in mt7615_coredump_work [ Upstream commit 9e47dd9f64a47ae00ca0123017584c37209ee900 ] vzalloc may fails, dump might be null and will cause illegal address access later. Link: https://lore.kernel.org/all/Y%2Fy5Asxw3T3m4jCw@lore-desk Fixes: d2bf7959d9c0 ("mt76: mt7663: introduce coredump support") Signed-off-by: Kang Chen <void0red@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Emmanuel Grumbach	210e6d01cc	wifi: iwlwifi: make the loop for card preparation effective [ Upstream commit 28965ec0b5d9112585f725660e2ff13218505ace ] Since we didn't reset t to 0, only the first iteration of the loop did checked the ready bit several times. From the second iteration and on, we just tested the bit once and continued to the next iteration. Reported-and-tested-by: Lorenzo Zolfanelli <lorenzo@zolfa.nl> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216452 Fixes: 289e5501c314 ("iwlwifi: fix the preparation of the card") Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230416154301.615b683ab9c8.Ic52c3229d3345b0064fa34263293db095d88daf8@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:30 +09:00
Jan Kara	dff2a7b330	jdb2: Don't refuse invalidation of already invalidated buffers [ Upstream commit bd159398a2d2234de07d310132865706964aaaa7 ] When invalidating buffers under the partial tail page, jbd2_journal_invalidate_folio() returns -EBUSY if the buffer is part of the committing transaction as we cannot safely modify buffer state. However if the buffer is already invalidated (due to previous invalidation attempts from ext4_wait_for_tail_page_commit()), there's nothing to do and there's no point in returning -EBUSY. This fixes occasional warnings from ext4_journalled_invalidate_folio() triggered by generic/051 fstest when blocksize < pagesize. Fixes: 53e872681fed ("ext4: fix deadlock in journal_unmap_buffer()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Tom Rix	358317ad9c	wifi: iwlwifi: fw: move memset before early return [ Upstream commit 8ce437dd5b2e4adef13aa4ecce07392f9966b1ab ] Clang static analysis reports this representative issue dbg.c:1455:6: warning: Branch condition evaluates to a garbage value if (!rxf_data.size) ^~~~~~~~~~~~~~ This check depends on iwl_ini_get_rxf_data() to clear rxf_data but the function can return early without doing the clear. So move the memset before the early return. Fixes: cc9b6012d34b ("iwlwifi: yoyo: use hweight_long instead of bit manipulating") Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230414130637.872a7175f1ff.I33802a77a91998276992b088fbe25f61c87c33ac@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Tom Rix	cccf85e047	wifi: iwlwifi: mvm: initialize seq variable [ Upstream commit 11e94d2bcd88dea5d9ce99555b6b172f5232d3e2 ] Clang static analysis reports this issue d3.c:567:22: warning: The left operand of '>' is a garbage value if (seq.tkip.iv32 > cur_rx_iv32) ~~~~~~~~~~~~~ ^ seq is never initialized. Call ieee80211_get_key_rx_seq() to initialize seq. Fixes: 0419e5e672d6 ("iwlwifi: mvm: d3: separate TKIP data from key iteration") Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230414130637.6dd372f84f93.If1f708c90e6424a935b4eba3917dfb7582e0dd0a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Daniel Gabay	b3cecbb257	wifi: iwlwifi: yoyo: Fix possible division by zero [ Upstream commit ba30415118eee374a08b39a0460a1d1e52c24a25 ] Don't allow buffer allocation TLV with zero req_size since it leads later to division by zero in iwl_dbg_tlv_alloc_fragments(). Also, NPK/SRAM locations are allowed to have zero buffer req_size, don't discard them. Fixes: a9248de42464 ("iwlwifi: dbg_ini: add TLV allocation new API support") Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230413213309.5d6688ed74d8.I5c2f3a882b50698b708d54f4524dc5bdf11e3d32@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Daniel Gabay	4636c35b7e	wifi: iwlwifi: yoyo: skip dump correctly on hw error [ Upstream commit 11195ab0d6f3202cf7af1a4c69570f59c377d8ad ] When NIC is in a bad state, reading data will return 28 bits as 0xa5a5a5a and the lowest 4 bits are not fixed value. Mask these bits in a few places to skip the dump correctly. Fixes: 89639e06d0f3 ("iwlwifi: yoyo: support for new DBGI_SRAM region") Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230413213309.df6c0663179d.I36d8487b2419c6fefa65e5514855d94327c3b1eb@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Yu Kuai	34222897e0	md/raid10: don't call bio_start_io_acct twice for bio which experienced read error [ Upstream commit 7cddb055bfda5f7b0be931e8ea750fc28bc18a27 ] handle_read_error() will resumit r10_bio by raid10_read_request(), which will call bio_start_io_acct() again, while bio_end_io_acct() will only be called once. Fix the problem by don't account io again from handle_read_error(). Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230314012258.2395894-1-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Yu Kuai	d6cfcf98b8	md/raid10: fix memleak of md thread [ Upstream commit f0ddb83da3cbbf8a1f9087a642c448ff52ee9abd ] In raid10_run(), if setup_conf() succeed and raid10_run() failed before setting 'mddev->thread', then in the error path 'conf->thread' is not freed. Fix the problem by setting 'mddev->thread' right after setup_conf(). Fixes: 43a521238aca ("md-cluster: choose correct label when clustered layout is not supported") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-7-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Yu Kuai	7f673fa34c	md/raid10: fix memleak for 'conf->bio_split' [ Upstream commit c9ac2acde53f5385de185bccf6aaa91cf9ac1541 ] In the error path of raid10_run(), 'conf' need be freed, however, 'conf->bio_split' is missed and memory will be leaked. Since there are 3 places to free 'conf', factor out a helper to fix the problem. Fixes: fc9977dd069e ("md/raid10: simplify the splitting of requests.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-6-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Yu Kuai	8d09065802	md/raid10: fix leak of 'r10bio->remaining' for recovery [ Upstream commit 26208a7cffd0c7cbf14237ccd20c7270b3ffeb7e ] raid10_sync_request() will add 'r10bio->remaining' for both rdev and replacement rdev. However, if the read io fails, recovery_request_write() returns without issuing the write io, in this case, end_sync_request() is only called once and 'remaining' is leaked, cause an io hang. Fix the problem by decreasing 'remaining' according to if 'bio' and 'repl_bio' is valid. Fixes: 24afd80d99f8 ("md/raid10: handle recovery of replacement devices.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-5-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Li Nan	901b4918fa	md/raid10: fix task hung in raid10d [ Upstream commit 72c215ed8731c88b2d7e09afc51fffc207ae47b8 ] commit fe630de009d0 ("md/raid10: avoid deadlock on recovery.") allowed normal io and sync io to exist at the same time. Task hung will occur as below: T1 T2 T3 T4 raid10d handle_read_error allow_barrier conf->nr_pending-- -> 0 //submit sync io raid10_sync_request raise_barrier ->will not be blocked ... //submit to drivers raid10_read_request wait_barrier conf->nr_pending++ -> 1 //retry read fail raid10_end_read_request reschedule_retry add to retry_list conf->nr_queued++ -> 1 //sync io fail end_sync_read __end_sync_read reschedule_retry add to retry_list conf->nr_queued++ -> 2 ... handle_read_error get form retry_list conf->nr_queued-- freeze_array wait nr_pending == nr_queued+1 ->1 ->2 //task hung retry read and sync io will be added to retry_list(nr_queued->2) if they fails. raid10d() called handle_read_error() and hung in freeze_array(). nr_queued will not decrease because raid10d is blocked, nr_pending will not increase because conf->barrier is not released. Fix it by moving allow_barrier() after raid10_read_request(). raise_barrier() will wait for nr_waiting to become 0. Therefore, sync io and regular io will not be issued at the same time. Also remove the check of nr_queued in stop_waiting_barrier. It can be 0 but don't need to be blocking. Remove the check for MD_RECOVERY_RUNNING as the check is redundent. Fixes: fe630de009d0 ("md/raid10: avoid deadlock on recovery.") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230222041000.3341651-2-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Yu Kuai	fc04998351	md/raid10: factor out code from wait_barrier() to stop_waiting_barrier() [ Upstream commit ed2e063f92c44c891ccd883e289dde6ca870edcc ] Currently the nasty condition in wait_barrier() is hard to read. This patch factors out the condition into a function. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 72c215ed8731 ("md/raid10: fix task hung in raid10d") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Vishal Verma	39db562b3f	md: raid10 add nowait support [ Upstream commit c9aa889b035fca4598ae985a0f0c76ebbb547ad2 ] This adds nowait support to the RAID10 driver. Very similar to raid1 driver changes. It makes RAID10 driver return with EAGAIN for situations where it could wait for eg: - Waiting for the barrier, - Reshape operation, - Discard operation. wait_barrier() and regular_request_wait() fn are modified to return bool to support error for wait barriers. They returns true in case of wait or if wait is not required and returns false if wait was required but not performed to support nowait. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 72c215ed8731 ("md/raid10: fix task hung in raid10d") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Mariusz Tkaczyk	74af08efa5	md: drop queue limitation for RAID1 and RAID10 [ Upstream commit a92ce0feffeed8b91f02dac85246d1205e4a64b6 ] As suggested by Neil Brown[1], this limitation seems to be deprecated. With plugging in use, writes are processed behind the raid thread and conf->pending_count is not increased. This limitation occurs only if caller doesn't use plugs. It can be avoided and often it is (with plugging). There are no reports that queue is growing to enormous size so remove queue limitation for non-plugged IOs too. [1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@noble.neil.brown.name Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 72c215ed8731 ("md/raid10: fix task hung in raid10d") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Daniel Borkmann	337d1d88be	bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap [ Upstream commit 8c5c2a4898e3d6bad86e29d471e023c8a19ba799 ] syzbot reported a splat and bisected it to recent commit ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap"): [...] WARNING: CPU: 1 PID: 9280 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 Modules linked in: CPU: 1 PID: 9280 Comm: syz-executor.1 Not tainted 6.2.0-syzkaller-13249-gd319f344561d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376 [...] Call Trace: <TASK> spin_unlock_bh include/linux/spinlock.h:395 [inline] sock_map_del_link+0x2ea/0x510 net/core/sock_map.c:165 sock_map_unref+0xb0/0x1d0 net/core/sock_map.c:184 sock_hash_delete_elem+0x1ec/0x2a0 net/core/sock_map.c:945 map_delete_elem kernel/bpf/syscall.c:1536 [inline] __sys_bpf+0x2edc/0x53e0 kernel/bpf/syscall.c:5053 __do_sys_bpf kernel/bpf/syscall.c:5166 [inline] __se_sys_bpf kernel/bpf/syscall.c:5164 [inline] __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5164 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fe8f7c8c169 </TASK> [...] Revert for now until we have a proper solution. Fixes: ed17aa92dc56 ("bpf, sockmap: fix deadlocks in the sockhash and sockmap") Reported-by: syzbot+49f6cef45247ff249498@syzkaller.appspotmail.com Cc: Hsin-Wei Hung <hsinweih@uci.edu> Cc: Xin Liu <liuxin350@huawei.com> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/000000000000f1db9605f939720e@google.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:29 +09:00
Song Liu	12e70c6f4e	selftests/bpf: Fix leaked bpf_link in get_stackid_cannot_attach [ Upstream commit c1e07a80cf23d3a6e96172bc9a73bfa912a9fcbc ] skel->links.oncpu is leaked in one case. This causes test perf_branches fails when it runs after get_stackid_cannot_attach: ./test_progs -t get_stackid_cannot_attach,perf_branches 84 get_stackid_cannot_attach:OK test_perf_branches_common:PASS:test_perf_branches_load 0 nsec test_perf_branches_common:PASS:attach_perf_event 0 nsec test_perf_branches_common:PASS:set_affinity 0 nsec check_good_sample:FAIL:output not valid no valid sample from prog 146/1 perf_branches/perf_branches_hw:FAIL 146/2 perf_branches/perf_branches_no_hw:OK 146 perf_branches:FAIL All error logs: test_perf_branches_common:PASS:test_perf_branches_load 0 nsec test_perf_branches_common:PASS:attach_perf_event 0 nsec test_perf_branches_common:PASS:set_affinity 0 nsec check_good_sample:FAIL:output not valid no valid sample from prog 146/1 perf_branches/perf_branches_hw:FAIL 146 perf_branches:FAIL Summary: 1/1 PASSED, 0 SKIPPED, 1 FAILED Fix this by adding the missing bpf_link__destroy(). Fixes: 346938e9380c ("selftests/bpf: Add get_stackid_cannot_attach") Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230412210423.900851-3-song@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Ming Lei	103a427542	nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage" [ Upstream commit 4f86a6ff6fbd891232dda3ca97fd1b9630b59809 ] fcloop_fcp_op() could be called from flush request's ->end_io(flush_end_io) in which the spinlock of fq->mq_flush_lock is grabbed with irq saved/disabled. So fcloop_fcp_op() can't call spin_unlock_irq(&tfcp_req->reqlock) simply which enables irq unconditionally. Fixes the warning by switching to spin_lock_irqsave()/spin_unlock_irqrestore() Fixes: c38dbbfab1bc ("nvme-fcloop: fix inconsistent lock state warnings") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Keith Busch	9fe41e6482	nvme: fix async event trace event [ Upstream commit 6622b76fe922b94189499a90ccdb714a4a8d0773 ] Mixing AER Event Type and Event Info has masking clashes. Just print the event type, but also include the event info of the AER result in the trace. Fixes: 09bd1ff4b15143b ("nvme-core: add async event trace helper") Reported-by: Nate Thornton <nate.thornton@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Michael Kelley	13475e6391	nvme: handle the persistent internal error AER [ Upstream commit 2c61c97fb12b806e1c8eb15f04c277ad097ec95e ] In the NVM Express Revision 1.4 spec, Figure 145 describes possible values for an AER with event type "Error" (value 000b). For a Persistent Internal Error (value 03h), the host should perform a controller reset. Add support for this error using code that already exists for doing a controller reset. As part of this support, introduce two utility functions for parsing the AER type and subtype. This new support was tested in a lab environment where we can generate the persistent internal error on demand, and observe both the Linux side and NVMe controller side to see that the controller reset has been done. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 6622b76fe922 ("nvme: fix async event trace event") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Damien Le Moal	30b9073583	nvmet: fix I/O Command Set specific Identify Controller [ Upstream commit a5a6ab0950b46ab1ef4a5c83c80234018b81b38a ] For an identify command with cns set to NVME_ID_CNS_CS_CTRL, the NVMe 2.0 specification states that: If the I/O Command Set specified by the CSI field does not have an Identify Controller data structure, then the controller shall return a zero filled data structure. If the host requests a data structure for an I/O Command Set that the controller does not support, the controller shall abort the command with a status code of Invalid Field in Command. However, the current implementation of this identify command in nvmet_execute_identify() only handles the ZNS command set, returning an error for the NVM command set, which is not compliant with the specifications as we do support this command set. Fix this by: 1) Renaming nvmet_execute_identify_cns_cs_ctrl() to nvmet_execute_identify_ctrl_zns() to continue handling the ZNS command set as is. 2) Introduce a nvmet_execute_identify_ctrl_ns() helper to handle the NVM command set, returning a zero filled nvme_id_ctrl_nvm data structure. 3) Modify nvmet_execute_identify() to call these helpers based on the csi specified, returning an error for unsupported command sets. Fixes: aaf2e048af27 ("nvmet: add ZBD over ZNS backend support") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Damien Le Moal	42bcbc2a90	nvmet: fix Identify Active Namespace ID list handling [ Upstream commit 97416f67d55fb8b866ff1815ca7ef26b6dfa6a5e ] The identify command with cns set to NVME_ID_CNS_NS_ACTIVE_LIST does not depend on the command set. The execution of this command should thus not look at the csi field specified in the command. Simplify nvmet_execute_identify() to directly call nvmet_execute_identify_nslist() without the csi switch-case. Fixes: ab5d0b38c047 ("nvmet: add Command Set Identifier support") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Damien Le Moal	92cf81746e	nvmet: fix Identify Controller handling [ Upstream commit 62904b3b333e7f3c0f879dc3513295eee5765c9f ] The identify command with cns set to NVME_ID_CNS_CTRL does not depend on the command set. The execution of this command should thus not look at the csi specified in the command. Simplify nvmet_execute_identify() to directly call nvmet_execute_identify_ctrl() without the csi switch-case. Fixes: ab5d0b38c047 ("nvmet: add Command Set Identifier support") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Damien Le Moal	ac86d59eaa	nvmet: fix Identify Namespace handling [ Upstream commit 8c098aa00118c35108f0c19bd3cdc45e11574948 ] The identify command with cns set to NVME_ID_CNS_NS does not directly depend on the command set. The NVMe specifications is rather confusing here as it appears that this command only applies to the NVM command set. However, footnote 8 of Figure 273 in the NVMe 2.0 base specifications clearly state that this command applies to NVM command sets that support logical blocks, that is, NVM and ZNS. Both the NVM and ZNS command set specifications also list this identify as mandatory. The command handling should thus not look at the csi field since it is defined as unused for this command. Given that we do not support the KV command set, simply remove the csi switch-case for that command handling and call directly nvmet_execute_identify_ns() in nvmet_execute_identify(). Fixes: ab5d0b38c047 ("nvmet: add Command Set Identifier support") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Damien Le Moal	c7e98afeca	nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns() [ Upstream commit ab76e7206b672b2e8818cb121a04506956d6b223 ] Nvme specifications state that: If the I/O Command Set associated with the namespace identified by the NSID field does not support the Identify Namespace data structure specified by the CSI field, the controller shall abort the command with a status code of Invalid Field in Command. In other words, if nvmet_execute_identify_cns_cs_ns() is called for a target with a block device that is not zoned, we should not return any data and set the status to NVME_SC_INVALID_FIELD. While at it, it is also better to revalidate the ns block devie before checking if the block device is zoned, to ensure that nvmet_execute_identify_cns_cs_ns() operates against updated device characteristics. Fixes: aaf2e048af27 ("nvmet: add ZBD over ZNS backend support") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Christoph Hellwig	537083b127	nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate [ Upstream commit da7837339641601f202f27515771dc0646083938 ] nvmet_ns_changed states via lockdep that the ns->subsys->lock must be held. The only caller of nvmet_ns_changed which does not acquire that lock is nvmet_ns_revalidate. nvmet_ns_revalidate has 3 callers, of which 2 do not acquire that lock: nvmet_execute_identify_cns_cs_ns and nvmet_execute_identify_ns. The other caller nvmet_ns_revalidate_size_store does acquire the lock. Move the call to nvmet_ns_changed from nvmet_ns_revalidate to the callers so that they can perform the correct locking as needed. This issue was found using a static type-based analyser and manually verified. Reported-by: Niels Dossche <dossche.niels@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Stable-dep-of: ab76e7206b67 ("nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Chaitanya Kulkarni	080826d167	nvmet: use i_size_read() to set size for file-ns [ Upstream commit 2caecd62ea5160803b25d96cb1a14ce755c2c259 ] Instead of calling vfs_getattr() use i_size_read() to read the size of file so we can read the size of not only file type but also block type with one call. This is needed to implement buffered_io support for the NVMeOF block device backend. We also change return type of function nvmet_file_ns_revalidate() from int to void, since this function does not return any meaning value. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Stable-dep-of: ab76e7206b67 ("nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00
Xin Liu	f333854dce	bpf, sockmap: fix deadlocks in the sockhash and sockmap [ Upstream commit ed17aa92dc56b6d8883e4b7a8f1c6fbf5ed6cd29 ] When huang uses sched_switch tracepoint, the tracepoint does only one thing in the mounted ebpf program, which deletes the fixed elements in sockhash ([0]) It seems that elements in sockhash are rarely actively deleted by users or ebpf program. Therefore, we do not pay much attention to their deletion. Compared with hash maps, sockhash only provides spin_lock_bh protection. This causes it to appear to have self-locking behavior in the interrupt context. [0]:https://lore.kernel.org/all/CABcoxUayum5oOqFMMqAeWuS8+EzojquSOSyDA3J_2omY=2EeAg@mail.gmail.com/ Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Xin Liu <liuxin350@huawei.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230406122622.109978-1-liuxin350@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:00:28 +09:00

1 2 3 4 5 ...

1060977 Commits