IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
[ Upstream commit 60be976ac45137657b7b505d7e0d44d0e51accb7 ]
dev_hard_header() returns the length of the header, so
we need to test for negative errors rather than non-zero.
Fixes: 889b7da23abf ("mctp: Add initial routing framework")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bd8c624c0cd59de0032752ba3001c107bba97f7b ]
VSI is set as default forwarding one when promisc mode is set for
PF interface, when PF is switched to switchdev mode or when VF
driver asks to enable allmulticast or promisc mode for the VF
interface (when vf-true-promisc-support priv flag is off).
The third case is buggy because in that case VSI associated with
VF remains as default one after VF removal.
Reproducer:
1. Create VF
echo 1 > sys/class/net/ens7f0/device/sriov_numvfs
2. Enable allmulticast or promisc mode on VF
ip link set ens7f0v0 allmulticast on
ip link set ens7f0v0 promisc on
3. Delete VF
echo 0 > sys/class/net/ens7f0/device/sriov_numvfs
4. Try to enable promisc mode on PF
ip link set ens7f0 promisc on
Although it looks that promisc mode on PF is enabled the opposite
is true because ice_vsi_sync_fltr() responsible for IFF_PROMISC
handling first checks if any other VSI is set as default forwarding
one and if so the function does not do anything. At this point
it is not possible to enable promisc mode on PF without re-probe
device.
To resolve the issue this patch clear default forwarding VSI
during ice_vsi_release() when the VSI to be released is the default
one.
Fixes: 01b5e89aab49 ("ice: Add VF promiscuous support")
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1effe8ca4e34c34cdd9318436a4232dcb582ebf4 ]
Fix a use-after-free when using page_pool with page fragments. We
encountered this problem during normal RX in the hns3 driver:
(1) Initially we have three descriptors in the RX queue. The first one
allocates PAGE1 through page_pool, and the other two allocate one
half of PAGE2 each. Page references look like this:
RX_BD1 _______ PAGE1
RX_BD2 _______ PAGE2
RX_BD3 _________/
(2) Handle RX on the first descriptor. Allocate SKB1, eventually added
to the receive queue by tcp_queue_rcv().
(3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
netif_receive_skb():
netif_receive_skb(SKB2)
ip_rcv(SKB2)
SKB3 = skb_clone(SKB2)
SKB2 and SKB3 share a reference to PAGE2 through
skb_shinfo()->dataref. The other ref to PAGE2 is still held by
RX_BD3:
SKB2 ---+- PAGE2
SKB3 __/ /
RX_BD3 _________/
(3b) Now while handling TCP, coalesce SKB3 with SKB1:
tcp_v4_rcv(SKB3)
tcp_try_coalesce(to=SKB1, from=SKB3) // succeeds
kfree_skb_partial(SKB3)
skb_release_data(SKB3) // drops one dataref
SKB1 _____ PAGE1
\____
SKB2 _____ PAGE2
/
RX_BD3 _________/
In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
PAGE2, where it should instead have increased the page_pool frag
reference, pp_frag_count. Without coalescing, when releasing both
SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
when releasing SKB1 and SKB2, two references to PAGE2 will be
dropped, resulting in underflow.
(3c) Drop SKB2:
af_packet_rcv(SKB2)
consume_skb(SKB2)
skb_release_data(SKB2) // drops second dataref
page_pool_return_skb_page(PAGE2) // drops one pp_frag_count
SKB1 _____ PAGE1
\____
PAGE2
/
RX_BD3 _________/
(4) Userspace calls recvmsg()
Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
release the SKB3 page as well:
tcp_eat_recv_skb(SKB1)
skb_release_data(SKB1)
page_pool_return_skb_page(PAGE1)
page_pool_return_skb_page(PAGE2) // drops second pp_frag_count
(5) PAGE2 is freed, but the third RX descriptor was still using it!
In our case this causes IOMMU faults, but it would silently corrupt
memory if the IOMMU was disabled.
Change the logic that checks whether pp_recycle SKBs can be coalesced.
We still reject differing pp_recycle between 'from' and 'to' SKBs, but
in order to avoid the situation described above, we also reject
coalescing when both 'from' and 'to' are pp_recycled and 'from' is
cloned.
The new logic allows coalescing a cloned pp_recycle SKB into a page
refcounted one, because in this case the release (4) will drop the right
reference, the one taken by skb_try_coalesce().
Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 012d69fbfcc739f846766c1da56ef8b493b803b5 ]
in commit 048939088220
("vrf: add mac header for tunneled packets when sniffer is attached")
an Ethernet header was cooked for traffic originating from tunnel devices.
However, the header is added based on whether the mac_header is unset
and ignores cases where the device doesn't expose a mac header to upper
layers, such as in ip tunnels like ipip and gre.
Traffic originating from such devices still appears garbled when capturing
on the vrf device.
Fix by observing whether the original device exposes a header to upper
layers, similar to the logic done in af_packet.
In addition, skb->mac_len needs to be adjusted after adding the Ethernet
header for the skb_push/pull() surrounding dev_queue_xmit_nit() to work
on these packets.
Fixes: 048939088220 ("vrf: add mac header for tunneled packets when sniffer is attached")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9381fe8c849cfbe50245ac01fc077554f6eaa0e2 ]
The memory size of tls_ctx->rx.iv for AES128-CCM is 12 setting in
tls_set_sw_offload(). The return value of crypto_aead_ivsize()
for "ccm(aes)" is 16. So memcpy() require 16 bytes from 12 bytes
memory space will trigger slab-out-of-bounds bug as following:
==================================================================
BUG: KASAN: slab-out-of-bounds in decrypt_internal+0x385/0xc40 [tls]
Read of size 16 at addr ffff888114e84e60 by task tls/10911
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
print_report.cold+0x5e/0x5db
? decrypt_internal+0x385/0xc40 [tls]
kasan_report+0xab/0x120
? decrypt_internal+0x385/0xc40 [tls]
kasan_check_range+0xf9/0x1e0
memcpy+0x20/0x60
decrypt_internal+0x385/0xc40 [tls]
? tls_get_rec+0x2e0/0x2e0 [tls]
? process_rx_list+0x1a5/0x420 [tls]
? tls_setup_from_iter.constprop.0+0x2e0/0x2e0 [tls]
decrypt_skb_update+0x9d/0x400 [tls]
tls_sw_recvmsg+0x3c8/0xb50 [tls]
Allocated by task 10911:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x81/0xa0
tls_set_sw_offload+0x2eb/0xa20 [tls]
tls_setsockopt+0x68c/0x700 [tls]
__sys_setsockopt+0xfe/0x1b0
Replace the crypto_aead_ivsize() with prot->iv_size + prot->salt_size
when memcpy() iv value in TLS_1_3_VERSION scenario.
Fixes: f295b3ae9f59 ("net/tls: Add support of AES128-CCM based ciphers")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 55ebf0d60e3cc6c9e8593399e185842c00e12f36 ]
A userspace triggerable infinite loop could happen in
mlx5_cvq_kick_handler() if userspace keeps sending a huge amount of
cvq requests.
Fixing this by introducing a quota and re-queue the work if we're out
of the budget (currently the implicit budget is one) . While at it,
using a per device work struct to avoid on demand memory allocation
for cvq.
Fixes: 5262912ef3cfc ("vdpa/mlx5: Add support for control VQ and MAC setting")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220329042109.4029-1-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit edf747affc41a18ccc3a616813d4c2b6d38b46ce ]
Add code to register to hardware asynchronous events. Use this
mechanism to track link status events coming from the device and update
the config struct.
After doing link status change, call the vdpa callback to notify of the
link status change.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210909123635.30884-4-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 218bdd20e56cab41a68481bc10c551ae3e0a24fb ]
A subesequent patch will use the same workqueue for executing other
work not related to control VQ. Rename the workqueue and the work queue
entry used to convey information to the workqueue.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Link: https://lore.kernel.org/r/20210909123635.30884-3-elic@nvidia.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 16ed828b872d12ccba8f07bcc446ae89ba662f9c ]
The error handling path of the probe releases a resource that is not freed
in the remove function. In some cases, a ioremap() must be undone.
Add the missing iounmap() call in the remove function.
Link: https://lore.kernel.org/r/247066a3104d25f9a05de8b3270fc3c848763bcc.1647673264.git.christophe.jaillet@wanadoo.fr
Fixes: 45804fbb00ee ("[SCSI] 53c700: Amiga Zorro NCR53c710 SCSI")
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eaba83b5b8506bbc9ee7ca2f10aeab3fff3719e7 ]
In commit edb854a3680b ("scsi: core: Reallocate device's budget map on
queue depth change"), the sbitmap for the device budget map may be
reallocated after the slave device depth is configured.
When the sbitmap is reallocated we use the result from
scsi_device_max_queue_depth() for the sbitmap size, but don't resize to
match the actual device queue depth.
Fix by resizing the sbitmap after reallocating the budget sbitmap. We do
this instead of init'ing the sbitmap to the device queue depth as the user
may want to change the queue depth later via sysfs or other.
Link: https://lore.kernel.org/r/1647423870-143867-1-git-send-email-john.garry@huawei.com
Fixes: edb854a3680b ("scsi: core: Reallocate device's budget map on queue depth change")
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bc5519c18a32ce855bb51b9f5eceb77a9489d080 ]
Commit 2e27f576abc6 ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from
scsi_ioctl()") seems to have a typo as it is checking ret instead of cmd in
the if statement checking for CDROMCLOSETRAY and CDROMEJECT. This changes
the behaviour of these ioctls as the cdrom_ioctl handling of these is more
restrictive than the scsi_ioctl version.
Link: https://lore.kernel.org/r/20220323002242.21157-1-kgroeneveld@lenbrook.com
Fixes: 2e27f576abc6 ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from scsi_ioctl()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Groeneveld <kgroeneveld@lenbrook.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 792f232d57ff28bbd5f9c4abe0466b23d5879dc8 ]
The vmbus driver relies on the panic notifier infrastructure to perform
some operations when a panic event is detected. Since vmbus can be built
as module, it is required that the driver handles both registering and
unregistering such panic notifier callback.
After commit 74347a99e73a ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
though, the panic notifier registration is done unconditionally in the module
initialization routine whereas the unregistering procedure is conditionally
guarded and executes only if HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE capability
is set.
This patch fixes that by unconditionally unregistering the panic notifier
in the module's exit routine as well.
Fixes: 74347a99e73a ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20220315203535.682306-1-gpiccoli@igalia.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1647b54ed55d4d48c7199d439f8834626576cbe9 ]
This post-op should be a pre-op so that we do not pass -1 as the bit
number to test_bit(). The current code will loop downwards from 63 to
-1. After changing to a pre-op, it loops from 63 to 0.
Fixes: 71c37505e7ea ("drm/amdgpu/gfx: move more common KIQ code to amdgpu_gfx.c")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ea6fa4961aab8f90a8aa03575a98b4bda368d4b6 ]
To prevent an infinite loop in mc146818_get_time(),
commit 211e5db19d15 ("rtc: mc146818: Detect and handle broken RTCs")
added a check for RTC availability. Together with a later fix, it
checked if bit 6 in register 0x0d is cleared.
This, however, caused a false negative on a motherboard with an AMD
SB710 southbridge; according to the specification [1], bit 6 of register
0x0d of this chipset is a scratchbit. This caused a regression in Linux
5.11 - the RTC was determined broken by the kernel and not used by
rtc-cmos.c [3]. This problem was also reported in Fedora [4].
As a better alternative, check whether the UIP ("Update-in-progress")
bit is set for longer then 10ms. If that is the case, then apparently
the RTC is either absent (and all register reads return 0xff) or broken.
Also limit the number of loop iterations in mc146818_get_time() to 10 to
prevent an infinite loop there.
The functions mc146818_get_time() and mc146818_does_rtc_work() will be
refactored later in this patch series, in order to fix a separate
problem with reading / setting the RTC alarm time. This is done so to
avoid a confusion about what is being fixed when.
In a previous approach to this problem, I implemented a check whether
the RTC_HOURS register contains a value <= 24. This, however, sometimes
did not work correctly on my Intel Kaby Lake laptop. According to
Intel's documentation [2], "the time and date RAM locations (0-9) are
disconnected from the external bus" during the update cycle so reading
this register without checking the UIP bit is incorrect.
[1] AMD SB700/710/750 Register Reference Guide, page 308,
https://developer.amd.com/wordpress/media/2012/10/43009_sb7xx_rrg_pub_1.00.pdf
[2] 7th Generation Intel ® Processor Family I/O for U/Y Platforms [...] Datasheet
Volume 1 of 2, page 209
Intel's Document Number: 334658-006,
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/7th-and-8th-gen-core-family-mobile-u-y-processor-lines-i-o-datasheet-vol-1.pdf
[3] Functions in arch/x86/kernel/rtc.c apparently were using it.
[4] https://bugzilla.redhat.com/show_bug.cgi?id=1936688
Fixes: 211e5db19d15 ("rtc: mc146818: Detect and handle broken RTCs")
Fixes: ebb22a059436 ("rtc: mc146818: Dont test for bit 0-5 in Register D")
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20211210200131.153887-5-mat.jonczyk@o2.pl
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0dd8d6cb9eddfe637bcd821bbfd40ebd5a0737b9 ]
There are 4 users of mc146818_get_time() and none of them was checking
the return value from this function. Change this.
Print the appropriate warnings in callers of mc146818_get_time() instead
of in the function mc146818_get_time() itself, in order not to add
strings to rtc-mc146818-lib.c, which is kind of a library.
The callers of alpha_rtc_read_time() and cmos_read_time() may use the
contents of (struct rtc_time *) even when the functions return a failure
code. Therefore, set the contents of (struct rtc_time *) to 0x00,
which looks more sensible then 0xff and aligns with the (possibly
stale?) comment in cmos_read_time:
/*
* If pm_trace abused the RTC for storage, set the timespec to 0,
* which tells the caller that this RTC value is unusable.
*/
For consistency, do this in mc146818_get_time().
Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a
second. It is very unlikely, though, that the RTC suddenly stops
working and mc146818_get_time() would consistently fail.
Only compile-tested on alpha.
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: linux-alpha@vger.kernel.org
Cc: x86@kernel.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d35786b3a28dee20b12962ae2dd365892a99ed1a ]
No function is checking mc146818_get_time() return values yet, so
correct them to make them more customary.
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20211210200131.153887-3-mat.jonczyk@o2.pl
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream.
Problem:
=======
Userspace might read the zero-page instead of actual data from a direct IO
read on a block device if the buffers have been called madvise(MADV_FREE)
on earlier (this is discussed below) due to a race between page reclaim on
MADV_FREE and blkdev direct IO read.
- Race condition:
==============
During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
if the page is not dirty, then discards its rmap PTE(s) (vs. remap back
if the page is dirty).
However, after try_to_unmap_one() returns to shrink_page_list(), it might
keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
_one_ page reference, from the isolation for page reclaim).
Well, blkdev_direct_IO() gets references for all pages, and on READ
operations it only sets them dirty _later_.
So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
IO read from block devices, and page reclaim happens during
__blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
returns, but BEFORE the pages are set dirty, the situation happens.
The direct IO read eventually completes. Now, when userspace reads the
buffers, the PTE is no longer there and the page fault handler
do_anonymous_page() services that with the zero-page, NOT the data!
A synthetic reproducer is provided.
- Page faults:
===========
If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
happen, because that faults-in all pages as writeable, so
do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
direct IO. The userspace reads don't fault as the PTE is there (thus
zero-page is not used/setup).
But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
is no longer there; the subsequent page faults can't help:
The data-read from the block device probably won't generate faults due to
DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
different virtual addresses (not user-mapped addresses) because `struct
bio_vec` stores `struct page` to figure addresses out (which are different
from user-mapped addresses) for the read.
Thus userspace reads (to user-mapped addresses) still fault, then
do_anonymous_page() gets another `struct page` that would address/ map to
other memory than the `struct page` used by `struct bio_vec` for the read.
(The original `struct page` is not available, since it wasn't freed, as
page_ref_freeze() failed due to more page refs. And even if it were
available, its data cannot be trusted anymore.)
Solution:
========
One solution is to check for the expected page reference count in
try_to_unmap_one().
There should be one reference from the isolation (that is also checked in
shrink_page_list() with page_ref_freeze()) plus one or more references
from page mapping(s) (put in discard: label). Further references mean
that rmap/PTE cannot be unmapped/nuked.
(Note: there might be more than one reference from mapping due to
fork()/clone() without CLONE_VM, which use the same `struct page` for
references, until the copy-on-write page gets copied.)
So, additional page references (e.g., from direct IO read) now prevent the
rmap/PTE from being unmapped/dropped; similarly to the page is not freed
per shrink_page_list()/page_ref_freeze()).
- Races and Barriers:
==================
The new check in try_to_unmap_one() should be safe in races with
bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
done under the PTE lock.
The fast path doesn't take the lock, but it checks if the PTE has changed
and if so, it drops the reference and leaves the page for the slow path
(which does take that lock).
The fast path requires synchronization w/ full memory barrier: it writes
the page reference count first then it reads the PTE later, while
try_to_unmap() writes PTE first then it reads page refcount.
And a second barrier is needed, as the page dirty flag should not be read
before the page reference count (as in __remove_mapping()). (This can be
a load memory barrier only; no writes are involved.)
Call stack/comments:
- try_to_unmap_one()
- page_vma_mapped_walk()
- map_pte() # see pte_offset_map_lock():
pte_offset_map()
spin_lock()
- ptep_get_and_clear() # write PTE
- smp_mb() # (new barrier) GUP fast path
- page_ref_count() # (new check) read refcount
- page_vma_mapped_walk_done() # see pte_unmap_unlock():
pte_unmap()
spin_unlock()
- bio_iov_iter_get_pages()
- __bio_iov_iter_get_pages()
- iov_iter_get_pages()
- get_user_pages_fast()
- internal_get_user_pages_fast()
# fast path
- lockless_pages_from_mm()
- gup_{pgd,p4d,pud,pmd,pte}_range()
ptep = pte_offset_map() # not _lock()
pte = ptep_get_lockless(ptep)
page = pte_page(pte)
try_grab_compound_head(page) # inc refcount
# (RMW/barrier
# on success)
if (pte_val(pte) != pte_val(*ptep)) # read PTE
put_compound_head(page) # dec refcount
# go slow path
# slow path
- __gup_longterm_unlocked()
- get_user_pages_unlocked()
- __get_user_pages_locked()
- __get_user_pages()
- follow_{page,p4d,pud,pmd}_mask()
- follow_page_pte()
ptep = pte_offset_map_lock()
pte = *ptep
page = vm_normal_page(pte)
try_grab_page(page) # inc refcount
pte_unmap_unlock()
- Huge Pages:
==========
Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
(aka lazyfree) pages are PageAnon() && !PageSwapBacked()
(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
thus should reach shrink_page_list() -> split_huge_page_to_list() before
try_to_unmap[_one](), so it deals with normal pages only.
(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
which should not or be rare, the page refcount should be greater than
mapcount: the head page is referenced by tail pages. That also prevents
checking the head `page` then incorrectly call page_remove_rmap(subpage)
for a tail page, that isn't even in the shrink_page_list()'s page_list (an
effect of split huge pmd/pmvw), as it might happen today in this unlikely
scenario.)
MADV_FREE'd buffers:
===================
So, back to the "if MADV_FREE pages are used as buffers" note. The case
is arguable, and subject to multiple interpretations.
The madvise(2) manual page on the MADV_FREE advice value says:
1) 'After a successful MADV_FREE ... data will be lost when
the kernel frees the pages.'
2) 'the free operation will be canceled if the caller writes
into the page' / 'subsequent writes ... will succeed and
then [the] kernel cannot free those dirtied pages'
3) 'If there is no subsequent write, the kernel can free the
pages at any time.'
Thoughts, questions, considerations... respectively:
1) Since the kernel didn't actually free the page (page_ref_freeze()
failed), should the data not have been lost? (on userspace read.)
2) Should writes performed by the direct IO read be able to cancel
the free operation?
- Should the direct IO read be considered as 'the caller' too,
as it's been requested by 'the caller'?
- Should the bio technique to dirty pages on return to userspace
(bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
be considered in another/special way here?
3) Should an upcoming write from a previously requested direct IO
read be considered as a subsequent write, so the kernel should
not free the pages? (as it's known at the time of page reclaim.)
And lastly:
Technically, the last point would seem a reasonable consideration and
balance, as the madvise(2) manual page apparently (and fairly) seem to
assume that 'writes' are memory access from the userspace process (not
explicitly considering writes from the kernel or its corner cases; again,
fairly).. plus the kernel fix implementation for the corner case of the
largely 'non-atomic write' encompassed by a direct IO read operation, is
relatively simple; and it helps.
Reproducer:
==========
@ test.c (simplified, but works)
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
int main() {
int fd, i;
char *buf;
fd = open(DEV, O_RDONLY | O_DIRECT);
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
buf[i] = 1; // init to non-zero
madvise(buf, BUF_SIZE, MADV_FREE);
read(fd, buf, BUF_SIZE);
for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
printf("%p: 0x%x\n", &buf[i], buf[i]);
return 0;
}
@ block/fops.c (formerly fs/block_dev.c)
+#include <linux/swap.h>
...
... __blkdev_direct_IO[_simple](...)
{
...
+ if (!strcmp(current->comm, "good"))
+ shrink_all_memory(ULONG_MAX);
+
ret = bio_iov_iter_get_pages(...);
+
+ if (!strcmp(current->comm, "bad"))
+ shrink_all_memory(ULONG_MAX);
...
}
@ shell
# NUM_PAGES=4
# PAGE_SIZE=$(getconf PAGE_SIZE)
# yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
# DEV=$(losetup -f --show test.img)
# gcc -DDEV=\"$DEV\" \
-DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
-DPAGE_SIZE=${PAGE_SIZE} \
test.c -o test
# od -tx1 $DEV
0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
*
0040000
# mv test good
# ./good
0x7f7c10418000: 0x79
0x7f7c10419000: 0x79
0x7f7c1041a000: 0x79
0x7f7c1041b000: 0x79
# mv good bad
# ./bad
0x7fa1b8050000: 0x0
0x7fa1b8051000: 0x0
0x7fa1b8052000: 0x0
0x7fa1b8053000: 0x0
Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap
do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
- v5.17-rc3:
# for i in {1..1000}; do ./good; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
# mv good bad
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
4000 0x0
# free | grep Swap
Swap: 0 0 0
- v4.5:
# for i in {1..1000}; do ./good; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
# mv good bad
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
2702 0x0
1298 0x79
# swapoff -av
swapoff /swap
# for i in {1..1000}; do ./bad; done \
| cut -d: -f2 | sort | uniq -c
4000 0x79
Ceph/TCMalloc:
=============
For documentation purposes, the use case driving the analysis/fix is Ceph
on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
release unused memory to the system from the mmap'ed page heap (might be
committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
-> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
TCMalloc_SystemCommit() -> do nothing.
Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
just 'disappeared' on Ceph on later Ubuntu releases but is still present
in the kernel, and can be hit by other use cases.
The observed issue seems to be the old Ceph bug #22464 [1], where checksum
mismatches are observed (and instrumentation with buffer dumps shows
zero-pages read from mmap'ed/MADV_FREE'd page ranges).
The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
mostly worked around with a retry mechanism, but other parts of Ceph could
still hit that (rocksdb). Anyway, it's less likely to be hit again as
TCMalloc switched out of MADV_FREE by default.
(Some kernel versions/reports from the Ceph bug, and relation with
the MADV_FREE introduction/changes; TCMalloc versions not checked.)
- 4.4 good
- 4.5 (madv_free: introduction)
- 4.9 bad
- 4.10 good? maybe a swapless system
- 4.12 (madv_free: no longer free instantly on swapless systems)
- 4.13 bad
[1] https://tracker.ceph.com/issues/22464
Thanks:
======
Several people contributed to analysis/discussions/tests/reproducers in
the first stages when drilling down on ceph/tcmalloc/linux kernel:
- Dan Hill
- Dan Streetman
- Dongdong Tao
- Gavin Guo
- Gerald Yang
- Heitor Alves de Siqueira
- Ioanna Alifieraki
- Jay Vosburgh
- Matthew Ruffell
- Ponnuvel Palaniyappan
Reviews, suggestions, corrections, comments:
- Minchan Kim
- Yu Zhao
- Huang, Ying
- John Hubbard
- Christoph Hellwig
[mfo@canonical.com: v4]
Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Hill <daniel.hill@canonical.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Cc: Dongdong Tao <dongdong.tao@canonical.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Gerald Yang <gerald.yang@canonical.com>
Cc: Heitor Alves de Siqueira <halves@canonical.com>
Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
Cc: <stable@vger.kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[mfo: backport: replace folio/test_flag with page/flag equivalents;
real Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)") in v4.]
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a9fe7fa7d874a536e0540469f314772c054a0323 ]
This change fixes the following:
1) The flags variable is not initialized. Always use raw_spin_lock_irqsave
and raw_spin_unlock_irqrestore to serialize patching.
2) flush_kernel_vmap_range is primarily intended for DMA flushes. Since
__patch_text_multiple is often called with interrupts disabled, it is
better to directly call flush_kernel_dcache_range_asm and
flush_kernel_icache_range_asm. This avoids an extra call.
3) The final call to flush_icache_range is unnecessary.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 939fc856676c266c3bc347c1c1661872a3725c0f ]
Add the missing logic to allow Lasi, WAX and Dino to set the
CPU affinity. This fixes IRQ migration to other CPUs when a
CPU is shutdown which currently holds the IRQs for one of those
chips.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d9142e1cf3bbdaf21337767114ecab26fe702d47 ]
selftest net tls test cases need TLS=m without this the test hangs.
Enabling config TLS solves this problem and runs to complete.
- CONFIG_TLS=m
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0bae835b63c53f86cdc524f5962e39409585b22c ]
In a low memory situation, allow the NFS writeback code to fail without
getting stuck in infinite loops in mempool_alloc().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 515dcdcd48736576c6f5c197814da6f81c60a21e ]
The concern is that since nfsiod is sometimes required to kick off a
commit, it can get locked up waiting forever in mempool_alloc() instead
of failing gracefully and leaving the commit until later.
Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY,
then fall back to a non-blocking attempt to allocate from the memory
pool.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7496b59f588dd52886fdbac7633608097543a0a5 ]
The socket layer requires that we use the socket lock to protect changes
to the sock->sk_write_pending field and others.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a53046291020ec41e09181396c1e829287b48d47 ]
Add validation check for JFS_IP(ipimap)->i_imap to prevent a NULL deref
in diFree since diFree uses it without do any validations.
When function jfs_mount calls diMount to initialize fileset inode
allocation map, it can fail and JFS_IP(ipimap)->i_imap won't be
initialized. Then it calls diFreeSpecial to close fileset inode allocation
map inode and it will flow into jfs_evict_inode. Function jfs_evict_inode
just validates JFS_SBI(inode->i_sb)->ipimap, then calls diFree. diFree use
JFS_IP(ipimap)->i_imap directly, then it will cause a NULL deref.
Reported-by: TCS Robot <tcs_robot@tencent.com>
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fefb8a2a941338d871e2d83fbd65fbfa068857bd ]
Eliminate anonymous module_init() and module_exit(), which can lead to
confusion or ambiguity when reading System.map, crashes/oops/bugs,
or an initcall_debug log.
Give each of these init and exit functions unique driver-specific
names to eliminate the anonymous names.
Example 1: (System.map)
ffffffff832fc78c t init
ffffffff832fc79e t init
ffffffff832fc8f8 t init
Example 2: (initcall_debug log)
calling init+0x0/0x12 @ 1
initcall init+0x0/0x12 returned 0 after 15 usecs
calling init+0x0/0x60 @ 1
initcall init+0x0/0x60 returned 0 after 2 usecs
calling init+0x0/0x9a @ 1
initcall init+0x0/0x9a returned 0 after 74 usecs
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Amit Shah <amit@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20220316192010.19001-3-rdunlap@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 988c7c00691008ea1daaa1235680a0da49dab4e8 ]
The commit c15c3747ee32 (serial: samsung: fix potential soft lockup
during uart write) added an unlock of port->lock before
uart_write_wakeup() and a lock after it. It was always problematic to
write data from tty_ldisc_ops::write_wakeup and it was even documented
that way. We fixed the line disciplines to conform to this recently.
So if there is still a missed one, we should fix them instead of this
workaround.
On the top of that, s3c24xx_serial_tx_dma_complete() in this driver
still holds the port->lock while calling uart_write_wakeup().
So revert the wrap added by the commit above.
Cc: Thomas Abraham <thomas.abraham@linaro.org>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Hyeonkook Kim <hk619.kim@samsung.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Link: https://lore.kernel.org/r/20220308115153.4225-1-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aaeed6ecc1253ce1463fa1aca0b70a4ccbc9fa75 ]
There are two outstanding issues with CONFIG_X86_X32_ABI and
llvm-objcopy, with similar root causes:
1. llvm-objcopy does not properly convert .note.gnu.property when going
from x86_64 to x86_x32, resulting in a corrupted section when
linking:
https://github.com/ClangBuiltLinux/linux/issues/1141
2. llvm-objcopy produces corrupted compressed debug sections when going
from x86_64 to x86_x32, also resulting in an error when linking:
https://github.com/ClangBuiltLinux/linux/issues/514
After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the
.note.gnu.property section is always generated when
CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become
visible with an allmodconfig build:
ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short
To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when
using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy,
this can be turned into a feature check.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220314194842.3452-3-nathan@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c265de257f558a05c1859ee9e3fed04883b9ec0e ]
The commit handling code is not safe against memory-pressure deadlocks
when writing to swap. In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.
swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does. However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.
For safety, make it explicit in the code that direct writes used for swap
must always use FLUSH_STABLE.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 64158668ac8b31626a8ce48db4cad08496eb8340 ]
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
possible deadlocks with "fs_reclaim". These deadlocks could, I believe,
eventuate if a buffered read on the swapfile was attempted.
We don't need coherence with the page cache for a swap file, and
buffered writes are forbidden anyway. There is no other need for
i_rwsem during direct IO. So never take it for swap_rw()
2/ generic_write_checks() explicitly forbids writes to swap, and
performs checks that are not needed for swap. So bypass it
for swap_rw().
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a80a8461868905823609be97f91776a26befe839 ]
Currently, tasks marked as "swapper" tasks get put to the front of
non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
the transport's ->xmit_queue.
This is pointless as currently *all* tasks for a mount that has swap
enabled on *any* file are marked as "swapper" tasks. So the net result
is that the non-priority rpc_queues are reverse-ordered (LIFO).
This scheduling boost is not necessary to avoid deadlocks, and hurts
fairness, so remove it. If there were a need to expedite some requests,
the tk_priority mechanism is a more appropriate tool.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a721035477fb5fb8abc738fbe410b07c12af3dc5 ]
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything. So it
must not block waiting for memory.
xprt_dynamic_alloc_slot can block indefinitely. This can tie up all
workqueue threads and NFS can deadlock. So when called from a
workqueue, set __GFP_NORETRY.
The rdma alloc_slot already does not block. However it sets the error
to -EAGAIN suggesting this will trigger a sleep. It does not. As we
can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN
causes immediate retry.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c487216bec83b0c5a8803e5c61433d33ad7b104d ]
When memory is short, new worker threads cannot be created and we depend
on the minimum one rpciod thread to be able to handle everything.
So it must not block waiting for memory.
mempools are particularly a problem as memory can only be released back
to the mempool by an async rpc task running. If all available
workqueue threads are waiting on the mempool, no thread is available to
return anything.
rpc_malloc() can block, and this might cause deadlocks.
So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
blocking is acceptable.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 10c46f2ea914202482d19cf80dcc9c321c9ff59b ]
If we were to have two users of the same clock, doing something like:
clk_set_rate_range(user1, 1000, 2000);
clk_set_rate_range(user2, 3000, 4000);
The second call would fail with -EINVAL, preventing from getting in a
situation where we end up with impossible limits.
However, this is never explicitly checked against and enforced, and
works by relying on an undocumented behaviour of clk_set_rate().
Indeed, on the first clk_set_rate_range will make sure the current clock
rate is within the new range, so it will be between 1000 and 2000Hz. On
the second clk_set_rate_range(), it will consider (rightfully), that our
current clock is outside of the 3000-4000Hz range, and will call
clk_core_set_rate_nolock() to set it to 3000Hz.
clk_core_set_rate_nolock() will then call clk_calc_new_rates() that will
eventually check that our rate 3000Hz rate is outside the min 3000Hz max
2000Hz range, will bail out, the error will propagate and we'll
eventually return -EINVAL.
This solely relies on the fact that clk_calc_new_rates(), and in
particular clk_core_determine_round_nolock(), won't modify the new rate
allowing the error to be reported. That assumption won't be true for all
drivers, and most importantly we'll break that assumption in a later
patch.
It can also be argued that we shouldn't even reach the point where we're
calling clk_core_set_rate_nolock().
Let's make an explicit check for disjoints range before we're doing
anything.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://lore.kernel.org/r/20220225143534.405820-4-maxime@cerno.tech
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 80864594ff2ad002e2755daf97d46ff0c86faf1f ]
In preparation for making use of the clock-output-names, we want to
keep node around in ti_dt_clocks_register().
This change should not needed as a fix currently.
Signed-off-by: Tony Lindgren <tony@atomide.com>
Link: https://lore.kernel.org/r/20220204071449.16762-3-tony@atomide.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eed05744322da07dd7e419432dcedf3c2e017179 ]
The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:
setup_arch()
-> init_hypervisor_platform()
-> x86_init.hyper.init_platform = xen_hvm_guest_init()
-> xen_hvm_init_time_ops()
-> xen_clocksource_read()
-> src = &__this_cpu_read(xen_vcpu)->time;
This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
This patch calls xen_hvm_init_time_ops() again later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.
This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:
"taskset -c 33 echo c > /proc/sysrq-trigger"
The bugfix for PVM is not implemented due to the lack of testing
environment.
[boris: xen_hvm_init_time_ops() returns on errors instead of jumping to end]
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20220302164032.14569-3-dongli.zhang@oracle.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit eb85eec858c1a5c11d3a0bff403f6440b05b40dc ]
This patch fixes what seems to be copy paste error.
We will have a memory leak if the host-resident shadow is NULL (which
will likely happen as the DR and HR are not dependent).
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3e17898aca293a24dae757a440a50aa63ca29671 ]
If memory allocation triggers a direct reclaim from the state recovery
thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure
that doesn't happen.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b7f114edd54326f730a754547e7cfb197b5bc132 ]
[You don't often get email from xiongx18@fudan.edu.cn. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
The reference counting issue happens in two error paths in the
function _nfs42_proc_copy_notify(). In both error paths, the function
simply returns the error code and forgets to balance the refcount of
object `ctx`, bumped by get_nfs_open_context() earlier, which may
cause refcount leaks.
Fix it by balancing refcount of the `ctx` object before the function
returns in both error paths.
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 41a92a89eee819298f805c40187ad8b02bb53426 ]
w1_seq was failing due to several devices responding to the
CHAIN_DONE at the same time. Now properly selects the current
device in the chain with MATCH_ROM. Also acknowledgment was
read twice.
Signed-off-by: Lucas Denefle <lucas.denefle@converge.io>
Link: https://lore.kernel.org/r/20220223113558.232750-1-lucas.denefle@converge.io
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60f1d3c92dc1ef1026e5b917a329a7fa947da036 ]
One error handler of wfx_init_common() return without calling
ieee80211_free_hw(hw), which may result in memory leak. And I add
one err label to unify the error handler, which is useful for the
subsequent changes.
Suggested-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com>
Link: https://lore.kernel.org/r/tencent_24A24A3EFF61206ECCC4B94B1C5C1454E108@qq.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 021dbecabc93b1610b5db989d52a94e0c6671136 ]
It is difficult to find which OPPs are active at the moment, specially
if there are multiple OPPs with same frequency available in the device
tree (controlled by supported hardware feature).
Expose name of the DT node to find out the exact OPP.
While at it, also expose level field.
Reported-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ec1c7ad47664f964c1101fe555b6fde0cb124b38 ]
CPUfreq governors request CPU frequencies using information
on current CPU usage. The CPPC driver converts them to
performance requests. Frequency targets are computed as:
target_freq = (util / cpu_capacity) * max_freq
target_freq is then clamped between [policy->min, policy->max].
The CPPC driver converts performance values to frequencies
(and vice-versa) using cppc_cpufreq_perf_to_khz() and
cppc_cpufreq_khz_to_perf(). These functions both use two different
factors depending on the range of the input value. For
cppc_cpufreq_khz_to_perf():
- (NOMINAL_PERF / NOMINAL_FREQ) or
- (LOWEST_PERF / LOWEST_FREQ)
and for cppc_cpufreq_perf_to_khz():
- (NOMINAL_FREQ / NOMINAL_PERF) or
- ((NOMINAL_PERF - LOWEST_FREQ) / (NOMINAL_PERF - LOWEST_PERF))
This means:
1- the functions are not inverse for some values:
(perf_to_khz(khz_to_perf(x)) != x)
2- cppc_cpufreq_perf_to_khz(LOWEST_PERF) can sometimes give
a different value from LOWEST_FREQ due to integer approximation
3- it is implied that performance and frequency are proportional
(NOMINAL_FREQ / NOMINAL_PERF) == (LOWEST_PERF / LOWEST_FREQ)
This patch changes the conversion functions to an affine function.
This fixes the 3 points above.
Suggested-by: Lukasz Luba <lukasz.luba@arm.com>
Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ff3187eabb5ce478d15b6ed62eb286756adefac3 ]
The pixel clocks dclk_vop[012] can be clocked from hpll, vpll, gpll or
cpll. gpll and cpll also drive many other clocks, so changing the
dclk_vop[012] clocks could change these other clocks as well. Drop
CLK_SET_RATE_PARENT to fix that. With this change the VOP2 driver can
only adjust the pixel clocks with the divider between the PLL and the
dclk_vop[012] which means the user may have to adjust the PLL clock to a
suitable rate using the assigned-clock-rate device tree property.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Link: https://lore.kernel.org/r/20220126145549.617165-25-s.hauer@pengutronix.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6f1dedf089ab1a4f03ea7aadc3c4a99885b4b4a0 ]
Use reset_control_rearm() call if an error occurs in case
phy_meson8b_usb2_power_on() fails after reset() has been called, or in
case phy_meson8b_usb2_power_off() is called i.e the resource is no longer
used and the reset line may be triggered again by other devices.
reset_control_rearm() keeps use of triggered_count sane in the reset
framework, use of reset_control_reset() on shared reset line should
be balanced with reset_control_rearm().
Signed-off-by: Amjad Ouled-Ameur <aouledameur@baylibre.com>
Reported-by: Jerome Brunet <jbrunet@baylibre.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
Link: https://lore.kernel.org/r/20220111095255.176141-4-aouledameur@baylibre.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2f87727130ce17ffefecd0895eeebf22d5a36f6f ]
Use reset_control_rearm() call if an error occurs in case
phy_meson_gxl_usb2_init() fails after reset() has been called ; or in case
phy_meson_gxl_usb2_exit() is called i.e the resource is no longer used
and the reset line may be triggered again by other devices.
reset_control_rearm() keeps use of triggered_count sane in the reset
framework. Therefore, use of reset_control_reset() on shared reset line
should be balanced with reset_control_rearm().
Signed-off-by: Amjad Ouled-Ameur <aouledameur@baylibre.com>
Reported-by: Jerome Brunet <jbrunet@baylibre.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
Link: https://lore.kernel.org/r/20220111095255.176141-2-aouledameur@baylibre.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>