linux

iv/linux

Author	SHA1	Message	Date
Mauricio Faria de Oliveira	c9f50e06ca	mm: fix race between MADV_FREE reclaim and blkdev direct IO read commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char buf; fd = open(DEV, O_RDONLY \| O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes \| dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x0 # free \| grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ \| cut -d: -f2 \| sort \| uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ \| cut -d: -f2 \| sort \| uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [mfo: backport: replace folio/test_flag with page/flag equivalents; real Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)") in v4.] Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:14 +02:00
John David Anglin	93a8347f72	parisc: Fix patch code locking and flushing [ Upstream commit a9fe7fa7d874a536e0540469f314772c054a0323 ] This change fixes the following: 1) The flags variable is not initialized. Always use raw_spin_lock_irqsave and raw_spin_unlock_irqrestore to serialize patching. 2) flush_kernel_vmap_range is primarily intended for DMA flushes. Since __patch_text_multiple is often called with interrupts disabled, it is better to directly call flush_kernel_dcache_range_asm and flush_kernel_icache_range_asm. This avoids an extra call. 3) The final call to flush_icache_range is unnecessary. Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:14 +02:00
Helge Deller	f77f482ec3	parisc: Fix CPU affinity for Lasi, WAX and Dino chips [ Upstream commit 939fc856676c266c3bc347c1c1661872a3725c0f ] Add the missing logic to allow Lasi, WAX and Dino to set the CPU affinity. This fixes IRQ migration to other CPUs when a CPU is shutdown which currently holds the IRQs for one of those chips. Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:14 +02:00
Naresh Kamboju	30dd4af48a	selftests: net: Add tls config dependency for tls selftests [ Upstream commit d9142e1cf3bbdaf21337767114ecab26fe702d47 ] selftest net tls test cases need TLS=m without this the test hangs. Enabling config TLS solves this problem and runs to complete. - CONFIG_TLS=m Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Trond Myklebust	ea029e4ce7	NFS: Avoid writeback threads getting stuck in mempool_alloc() [ Upstream commit 0bae835b63c53f86cdc524f5962e39409585b22c ] In a low memory situation, allow the NFS writeback code to fail without getting stuck in infinite loops in mempool_alloc(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Trond Myklebust	da747de685	NFS: nfsiod should not block forever in mempool_alloc() [ Upstream commit 515dcdcd48736576c6f5c197814da6f81c60a21e ] The concern is that since nfsiod is sometimes required to kick off a commit, it can get locked up waiting forever in mempool_alloc() instead of failing gracefully and leaving the commit until later. Try to allocate from the slab first, with GFP_KERNEL \| __GFP_NORETRY, then fall back to a non-blocking attempt to allocate from the memory pool. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Trond Myklebust	e04ef859d6	SUNRPC: Fix socket waits for write buffer space [ Upstream commit 7496b59f588dd52886fdbac7633608097543a0a5 ] The socket layer requires that we use the socket lock to protect changes to the sock->sk_write_pending field and others. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Haimin Zhang	d925b7e78b	jfs: prevent NULL deref in diFree [ Upstream commit a53046291020ec41e09181396c1e829287b48d47 ] Add validation check for JFS_IP(ipimap)->i_imap to prevent a NULL deref in diFree since diFree uses it without do any validations. When function jfs_mount calls diMount to initialize fileset inode allocation map, it can fail and JFS_IP(ipimap)->i_imap won't be initialized. Then it calls diFreeSpecial to close fileset inode allocation map inode and it will flow into jfs_evict_inode. Function jfs_evict_inode just validates JFS_SBI(inode->i_sb)->ipimap, then calls diFree. diFree use JFS_IP(ipimap)->i_imap directly, then it will cause a NULL deref. Reported-by: TCS Robot <tcs_robot@tencent.com> Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Randy Dunlap	44c2d5fbe7	virtio_console: eliminate anonymous module_init & module_exit [ Upstream commit fefb8a2a941338d871e2d83fbd65fbfa068857bd ] Eliminate anonymous module_init() and module_exit(), which can lead to confusion or ambiguity when reading System.map, crashes/oops/bugs, or an initcall_debug log. Give each of these init and exit functions unique driver-specific names to eliminate the anonymous names. Example 1: (System.map) ffffffff832fc78c t init ffffffff832fc79e t init ffffffff832fc8f8 t init Example 2: (initcall_debug log) calling init+0x0/0x12 @ 1 initcall init+0x0/0x12 returned 0 after 15 usecs calling init+0x0/0x60 @ 1 initcall init+0x0/0x60 returned 0 after 2 usecs calling init+0x0/0x9a @ 1 initcall init+0x0/0x9a returned 0 after 74 usecs Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Amit Shah <amit@kernel.org> Cc: virtualization@lists.linux-foundation.org Cc: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20220316192010.19001-3-rdunlap@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Jiri Slaby	053bbff873	serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() [ Upstream commit 988c7c00691008ea1daaa1235680a0da49dab4e8 ] The commit c15c3747ee32 (serial: samsung: fix potential soft lockup during uart write) added an unlock of port->lock before uart_write_wakeup() and a lock after it. It was always problematic to write data from tty_ldisc_ops::write_wakeup and it was even documented that way. We fixed the line disciplines to conform to this recently. So if there is still a missed one, we should fix them instead of this workaround. On the top of that, s3c24xx_serial_tx_dma_complete() in this driver still holds the port->lock while calling uart_write_wakeup(). So revert the wrap added by the commit above. Cc: Thomas Abraham <thomas.abraham@linaro.org> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Hyeonkook Kim <hk619.kim@samsung.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20220308115153.4225-1-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Nathan Chancellor	c393a9f4cb	x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy [ Upstream commit aaeed6ecc1253ce1463fa1aca0b70a4ccbc9fa75 ] There are two outstanding issues with CONFIG_X86_X32_ABI and llvm-objcopy, with similar root causes: 1. llvm-objcopy does not properly convert .note.gnu.property when going from x86_64 to x86_x32, resulting in a corrupted section when linking: https://github.com/ClangBuiltLinux/linux/issues/1141 2. llvm-objcopy produces corrupted compressed debug sections when going from x86_64 to x86_x32, also resulting in an error when linking: https://github.com/ClangBuiltLinux/linux/issues/514 After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the .note.gnu.property section is always generated when CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become visible with an allmodconfig build: ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy, this can be turned into a feature check. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220314194842.3452-3-nathan@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
Peter Zijlstra	e3c961c56a	x86: Annotate call_on_stack() [ Upstream commit be0075951fde739f14ee2b659e2fd6e2499c46c0 ] vmlinux.o: warning: objtool: page_fault_oops()+0x13c: unreachable instruction 0000 000000000005b460 <page_fault_oops>: ... 0128 5b588: 49 89 23 mov %rsp,(%r11) 012b 5b58b: 4c 89 dc mov %r11,%rsp 012e 5b58e: 4c 89 f2 mov %r14,%rdx 0131 5b591: 48 89 ee mov %rbp,%rsi 0134 5b594: 4c 89 e7 mov %r12,%rdi 0137 5b597: e8 00 00 00 00 call 5b59c <page_fault_oops+0x13c> 5b598: R_X86_64_PLT32 handle_stack_overflow-0x4 013c 5b59c: 5c pop %rsp vmlinux.o: warning: objtool: sysvec_reboot()+0x6d: unreachable instruction 0000 00000000000033f0 <sysvec_reboot>: ... 005d 344d: 4c 89 dc mov %r11,%rsp 0060 3450: e8 00 00 00 00 call 3455 <sysvec_reboot+0x65> 3451: R_X86_64_PLT32 irq_enter_rcu-0x4 0065 3455: 48 89 ef mov %rbp,%rdi 0068 3458: e8 00 00 00 00 call 345d <sysvec_reboot+0x6d> 3459: R_X86_64_PC32 .text+0x47d0c 006d 345d: e8 00 00 00 00 call 3462 <sysvec_reboot+0x72> 345e: R_X86_64_PLT32 irq_exit_rcu-0x4 0072 3462: 5c pop %rsp Both cases are due to a call_on_stack() calling a __noreturn function. Since that's an inline asm, GCC can't do anything about the instructions after the CALL. Therefore put in an explicit ASM_REACHABLE annotation to make sure objtool and gcc are consistently confused about control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154319.468805622@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
NeilBrown	6bb2270223	NFS: swap-out must always use STABLE writes. [ Upstream commit c265de257f558a05c1859ee9e3fed04883b9ec0e ] The commit handling code is not safe against memory-pressure deadlocks when writing to swap. In particular, nfs_commitdata_alloc() blocks indefinitely waiting for memory, and this can consume all available workqueue threads. swap-out most likely uses STABLE writes anyway as COND_STABLE indicates that a stable write should be used if the write fits in a single request, and it normally does. However if we ever swap with a small wsize, or gather unusually large numbers of pages for a single write, this might change. For safety, make it explicit in the code that direct writes used for swap must always use FLUSH_STABLE. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:13 +02:00
NeilBrown	24d28d9b0f	NFS: swap IO handling is slightly different for O_DIRECT IO [ Upstream commit 64158668ac8b31626a8ce48db4cad08496eb8340 ] 1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I believe, eventuate if a buffered read on the swapfile was attempted. We don't need coherence with the page cache for a swap file, and buffered writes are forbidden anyway. There is no other need for i_rwsem during direct IO. So never take it for swap_rw() 2/ generic_write_checks() explicitly forbids writes to swap, and performs checks that are not needed for swap. So bypass it for swap_rw(). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
NeilBrown	a553864050	SUNRPC: remove scheduling boost for "SWAPPER" tasks. [ Upstream commit a80a8461868905823609be97f91776a26befe839 ] Currently, tasks marked as "swapper" tasks get put to the front of non-priority rpc_queues, and are sorted earlier than non-swapper tasks on the transport's ->xmit_queue. This is pointless as currently all tasks for a mount that has swap enabled on any file are marked as "swapper" tasks. So the net result is that the non-priority rpc_queues are reverse-ordered (LIFO). This scheduling boost is not necessary to avoid deadlocks, and hurts fairness, so remove it. If there were a need to expedite some requests, the tk_priority mechanism is a more appropriate tool. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
NeilBrown	20700aa01b	SUNRPC/xprt: async tasks mustn't block waiting for memory [ Upstream commit a721035477fb5fb8abc738fbe410b07c12af3dc5 ] When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. xprt_dynamic_alloc_slot can block indefinitely. This can tie up all workqueue threads and NFS can deadlock. So when called from a workqueue, set __GFP_NORETRY. The rdma alloc_slot already does not block. However it sets the error to -EAGAIN suggesting this will trigger a sleep. It does not. As we can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN causes immediate retry. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
NeilBrown	a19fd1d617	SUNRPC/call_alloc: async tasks mustn't block waiting for memory [ Upstream commit c487216bec83b0c5a8803e5c61433d33ad7b104d ] When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. rpc_malloc() can block, and this might cause deadlocks. So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if blocking is acceptable. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
Maxime Ripard	b07387c476	clk: Enforce that disjoints limits are invalid [ Upstream commit 10c46f2ea914202482d19cf80dcc9c321c9ff59b ] If we were to have two users of the same clock, doing something like: clk_set_rate_range(user1, 1000, 2000); clk_set_rate_range(user2, 3000, 4000); The second call would fail with -EINVAL, preventing from getting in a situation where we end up with impossible limits. However, this is never explicitly checked against and enforced, and works by relying on an undocumented behaviour of clk_set_rate(). Indeed, on the first clk_set_rate_range will make sure the current clock rate is within the new range, so it will be between 1000 and 2000Hz. On the second clk_set_rate_range(), it will consider (rightfully), that our current clock is outside of the 3000-4000Hz range, and will call clk_core_set_rate_nolock() to set it to 3000Hz. clk_core_set_rate_nolock() will then call clk_calc_new_rates() that will eventually check that our rate 3000Hz rate is outside the min 3000Hz max 2000Hz range, will bail out, the error will propagate and we'll eventually return -EINVAL. This solely relies on the fact that clk_calc_new_rates(), and in particular clk_core_determine_round_nolock(), won't modify the new rate allowing the error to be reported. That assumption won't be true for all drivers, and most importantly we'll break that assumption in a later patch. It can also be argued that we shouldn't even reach the point where we're calling clk_core_set_rate_nolock(). Let's make an explicit check for disjoints range before we're doing anything. Signed-off-by: Maxime Ripard <maxime@cerno.tech> Link: https://lore.kernel.org/r/20220225143534.405820-4-maxime@cerno.tech Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
Tony Lindgren	15bfec9d80	clk: ti: Preserve node in ti_dt_clocks_register() [ Upstream commit 80864594ff2ad002e2755daf97d46ff0c86faf1f ] In preparation for making use of the clock-output-names, we want to keep node around in ti_dt_clocks_register(). This change should not needed as a fix currently. Signed-off-by: Tony Lindgren <tony@atomide.com> Link: https://lore.kernel.org/r/20220204071449.16762-3-tony@atomide.com Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
Dongli Zhang	5c0750cad7	xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 [ Upstream commit eed05744322da07dd7e419432dcedf3c2e017179 ] The sched_clock() can be used very early since commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition, with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump kernel in Xen HVM guest may panic at very early stage when accessing &__this_cpu_read(xen_vcpu)->time as in below: setup_arch() -> init_hypervisor_platform() -> x86_init.hyper.init_platform = xen_hvm_guest_init() -> xen_hvm_init_time_ops() -> xen_clocksource_read() -> src = &__this_cpu_read(xen_vcpu)->time; This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info' embedded inside 'shared_info' during early stage until xen_vcpu_setup() is used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address. However, when Xen HVM guest panic on vcpu >= 32, since xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic. This patch calls xen_hvm_init_time_ops() again later in xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is registered when the boot vcpu is >= 32. This issue can be reproduced on purpose via below command at the guest side when kdump/kexec is enabled: "taskset -c 33 echo c > /proc/sysrq-trigger" The bugfix for PVM is not implemented due to the lack of testing environment. [boris: xen_hvm_init_time_ops() returns on errors instead of jumping to end] Cc: Joe Jin <joe.jin@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20220302164032.14569-3-dongli.zhang@oracle.com Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
Ohad Sharabi	12e49aefda	habanalabs: fix possible memory leak in MMU DR fini [ Upstream commit eb85eec858c1a5c11d3a0bff403f6440b05b40dc ] This patch fixes what seems to be copy paste error. We will have a memory leak if the host-resident shadow is NULL (which will likely happen as the DR and HR are not dependent). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
Trond Myklebust	a34752aa23	NFSv4: Protect the state recovery thread against direct reclaim [ Upstream commit 3e17898aca293a24dae757a440a50aa63ca29671 ] If memory allocation triggers a direct reclaim from the state recovery thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure that doesn't happen. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:12 +02:00
Xin Xiong	b37f482ba9	NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify() [ Upstream commit b7f114edd54326f730a754547e7cfb197b5bc132 ] [You don't often get email from xiongx18@fudan.edu.cn. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] The reference counting issue happens in two error paths in the function _nfs42_proc_copy_notify(). In both error paths, the function simply returns the error code and forgets to balance the refcount of object `ctx`, bumped by get_nfs_open_context() earlier, which may cause refcount leaks. Fix it by balancing refcount of the `ctx` object before the function returns in both error paths. Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Lucas Denefle	24acdd5f9c	w1: w1_therm: fixes w1_seq for ds28ea00 sensors [ Upstream commit 41a92a89eee819298f805c40187ad8b02bb53426 ] w1_seq was failing due to several devices responding to the CHAIN_DONE at the same time. Now properly selects the current device in the chain with MATCH_ROM. Also acknowledgment was read twice. Signed-off-by: Lucas Denefle <lucas.denefle@converge.io> Link: https://lore.kernel.org/r/20220223113558.232750-1-lucas.denefle@converge.io Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Xiaoke Wang	86efcb524a	staging: wfx: fix an error handling in wfx_init_common() [ Upstream commit 60f1d3c92dc1ef1026e5b917a329a7fa947da036 ] One error handler of wfx_init_common() return without calling ieee80211_free_hw(hw), which may result in memory leak. And I add one err label to unify the error handler, which is useful for the subsequent changes. Suggested-by: Jérôme Pouiller <jerome.pouiller@silabs.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jérôme Pouiller <jerome.pouiller@silabs.com> Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com> Link: https://lore.kernel.org/r/tencent_24A24A3EFF61206ECCC4B94B1C5C1454E108@qq.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Viresh Kumar	7295544bcf	opp: Expose of-node's name in debugfs [ Upstream commit 021dbecabc93b1610b5db989d52a94e0c6671136 ] It is difficult to find which OPPs are active at the moment, specially if there are multiple OPPs with same frequency available in the device tree (controlled by supported hardware feature). Expose name of the DT node to find out the exact OPP. While at it, also expose level field. Reported-by: Leo Yan <leo.yan@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Pierre Gondois	ea1f295841	cpufreq: CPPC: Fix performance/frequency conversion [ Upstream commit ec1c7ad47664f964c1101fe555b6fde0cb124b38 ] CPUfreq governors request CPU frequencies using information on current CPU usage. The CPPC driver converts them to performance requests. Frequency targets are computed as: target_freq = (util / cpu_capacity) * max_freq target_freq is then clamped between [policy->min, policy->max]. The CPPC driver converts performance values to frequencies (and vice-versa) using cppc_cpufreq_perf_to_khz() and cppc_cpufreq_khz_to_perf(). These functions both use two different factors depending on the range of the input value. For cppc_cpufreq_khz_to_perf(): - (NOMINAL_PERF / NOMINAL_FREQ) or - (LOWEST_PERF / LOWEST_FREQ) and for cppc_cpufreq_perf_to_khz(): - (NOMINAL_FREQ / NOMINAL_PERF) or - ((NOMINAL_PERF - LOWEST_FREQ) / (NOMINAL_PERF - LOWEST_PERF)) This means: 1- the functions are not inverse for some values: (perf_to_khz(khz_to_perf(x)) != x) 2- cppc_cpufreq_perf_to_khz(LOWEST_PERF) can sometimes give a different value from LOWEST_FREQ due to integer approximation 3- it is implied that performance and frequency are proportional (NOMINAL_FREQ / NOMINAL_PERF) == (LOWEST_PERF / LOWEST_FREQ) This patch changes the conversion functions to an affine function. This fixes the 3 points above. Suggested-by: Lukasz Luba <lukasz.luba@arm.com> Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Sascha Hauer	26f0a9e3d0	clk: rockchip: drop CLK_SET_RATE_PARENT from dclk_vop* on rk3568 [ Upstream commit ff3187eabb5ce478d15b6ed62eb286756adefac3 ] The pixel clocks dclk_vop[012] can be clocked from hpll, vpll, gpll or cpll. gpll and cpll also drive many other clocks, so changing the dclk_vop[012] clocks could change these other clocks as well. Drop CLK_SET_RATE_PARENT to fix that. With this change the VOP2 driver can only adjust the pixel clocks with the divider between the PLL and the dclk_vop[012] which means the user may have to adjust the PLL clock to a suitable rate using the assigned-clock-rate device tree property. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Link: https://lore.kernel.org/r/20220126145549.617165-25-s.hauer@pengutronix.de Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Amjad Ouled-Ameur	caffa76ded	phy: amlogic: meson8b-usb2: fix shared reset control use [ Upstream commit 6f1dedf089ab1a4f03ea7aadc3c4a99885b4b4a0 ] Use reset_control_rearm() call if an error occurs in case phy_meson8b_usb2_power_on() fails after reset() has been called, or in case phy_meson8b_usb2_power_off() is called i.e the resource is no longer used and the reset line may be triggered again by other devices. reset_control_rearm() keeps use of triggered_count sane in the reset framework, use of reset_control_reset() on shared reset line should be balanced with reset_control_rearm(). Signed-off-by: Amjad Ouled-Ameur <aouledameur@baylibre.com> Reported-by: Jerome Brunet <jbrunet@baylibre.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Acked-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://lore.kernel.org/r/20220111095255.176141-4-aouledameur@baylibre.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Amjad Ouled-Ameur	ab27675a0f	phy: amlogic: meson8b-usb2: Use dev_err_probe() [ Upstream commit 6466ba1898d415b527e1013bd8551a6fdfece94c ] Use the existing dev_err_probe() helper instead of open-coding the same operation. Signed-off-by: Amjad Ouled-Ameur <aouledameur@baylibre.com> Reported-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Acked-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://lore.kernel.org/r/20220111095255.176141-3-aouledameur@baylibre.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Amjad Ouled-Ameur	35df38c4be	phy: amlogic: phy-meson-gxl-usb2: fix shared reset controller use [ Upstream commit 2f87727130ce17ffefecd0895eeebf22d5a36f6f ] Use reset_control_rearm() call if an error occurs in case phy_meson_gxl_usb2_init() fails after reset() has been called ; or in case phy_meson_gxl_usb2_exit() is called i.e the resource is no longer used and the reset line may be triggered again by other devices. reset_control_rearm() keeps use of triggered_count sane in the reset framework. Therefore, use of reset_control_reset() on shared reset line should be balanced with reset_control_rearm(). Signed-off-by: Amjad Ouled-Ameur <aouledameur@baylibre.com> Reported-by: Jerome Brunet <jbrunet@baylibre.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Acked-by: Neil Armstrong <narmstrong@baylibre.com> Link: https://lore.kernel.org/r/20220111095255.176141-2-aouledameur@baylibre.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Stefan Wahren	42f2142a33	staging: vchiq_core: handle NULL result of find_service_by_handle [ Upstream commit ca225857faf237234d2fffe5d1919467dfadd822 ] In case of an invalid handle the function find_servive_by_handle returns NULL. So take care of this and avoid a NULL pointer dereference. Reviewed-by: Nicolas Saenz Julienne <nsaenz@kernel.org> Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Link: https://lore.kernel.org/r/1642968143-19281-18-git-send-email-stefan.wahren@i2se.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:11 +02:00
Stefan Wahren	176df12b38	staging: vchiq_arm: Avoid NULL ptr deref in vchiq_dump_platform_instances [ Upstream commit aa899e686d442c63d50f4d369cc02dbbf0941cb0 ] vchiq_get_state() can return a NULL pointer. So handle this cases and avoid a NULL pointer derefence in vchiq_dump_platform_instances. Reviewed-by: Nicolas Saenz Julienne <nsaenz@kernel.org> Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Link: https://lore.kernel.org/r/1642968143-19281-17-git-send-email-stefan.wahren@i2se.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Adam Wujek	161863aec0	clk: si5341: fix reported clk_rate when output divider is 2 [ Upstream commit 2a8b539433e111c4de364237627ef219d2f6350a ] SI5341_OUT_CFG_RDIV_FORCE2 shall be checked first to distinguish whether a divider for a given output is set to 2 (SI5341_OUT_CFG_RDIV_FORCE2 is set) or the output is disabled (SI5341_OUT_CFG_RDIV_FORCE2 not set, SI5341_OUT_R_REG is set 0). Before the change, divider set to 2 (SI5341_OUT_R_REG set to 0) was interpreted as output is disabled. Signed-off-by: Adam Wujek <dev_public@wujek.eu> Link: https://lore.kernel.org/r/20211203141125.2447520-1-dev_public@wujek.eu Reviewed-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Qinghua Jin	31e027259c	minix: fix bug when opening a file with O_DIRECT [ Upstream commit 9ce3c0d26c42d279b6c378a03cd6a61d828f19ca ] Testcase: 1. create a minix file system and mount it 2. open a file on the file system with O_RDWR\|O_CREAT\|O_TRUNC\|O_DIRECT 3. open fails with -EINVAL but leaves an empty file behind. All other open() failures don't leave the failed open files behind. It is hard to check the direct_IO op before creating the inode. Just as ext4 and btrfs do, this patch will resolve the issue by allowing to create the file with O_DIRECT but returning error when writing the file. Link: https://lkml.kernel.org/r/20220107133626.413379-1-qhjin.dev@gmail.com Signed-off-by: Qinghua Jin <qhjin.dev@gmail.com> Reported-by: Colin Ian King <colin.king@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Randy Dunlap	4602b7a8ee	init/main.c: return 1 from handled __setup() functions [ Upstream commit f9a40b0890658330c83c95511f9d6b396610defc ] initcall_blacklist() should return 1 to indicate that it handled its cmdline arguments. set_debug_rodata() should return 1 to indicate that it handled its cmdline arguments. Print a warning if the option string is invalid. This prevents these strings from being added to the 'init' program's environment as they are not init arguments/parameters. Link: https://lkml.kernel.org/r/20220221050901.23985-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Cc: Ingo Molnar <mingo@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Feng Tang	9d849449d2	lib/Kconfig.debug: add ARCH dependency for FUNCTION_ALIGN option [ Upstream commit 1bf18da62106225dbc47aab41efee2aeb99caccd ] 0Day robots reported there is compiling issue for 'csky' ARCH when CONFIG_DEBUG_FORCE_DATA_SECTION_ALIGNED is enabled [1]: All errors (new ones prefixed by >>): {standard input}: Assembler messages: >> {standard input}:2277: Error: pcrel offset for branch to .LS000B too far (0x3c) Which was discussed in [2]. And as there is no solution for csky yet, add some dependency for this config to limit it to several ARCHs which have no compiling issue so far. [1]. https://lore.kernel.org/lkml/202202271612.W32UJAj2-lkp@intel.com/ [2]. https://www.spinics.net/lists/linux-kbuild/msg30298.html Link: https://lkml.kernel.org/r/20220304021100.GN4548@shbuild999.sh.intel.com Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Guo Ren <guoren@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Xiubo Li	2fe82d3254	ceph: fix memory leak in ceph_readdir when note_last_dentry returns error [ Upstream commit f639d9867eea647005dc824e0e24f39ffc50d4e4 ] Reset the last_readdir at the same time, and add a comment explaining why we don't free last_readdir when dir_emit returns false. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Xiubo Li	3ae7163598	ceph: fix inode reference leakage in ceph_get_snapdir() [ Upstream commit 322794d3355c33adcc4feace0045d85a8e4ed813 ] The ceph_get_inode() will search for or insert a new inode into the hash for the given vino, and return a reference to it. If new is non-NULL, its reference is consumed. We should release the reference when in error handing cases. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Wang Yufen	eb0e7173d9	netlabel: fix out-of-bounds memory accesses [ Upstream commit f22881de730ebd472e15bcc2c0d1d46e36a87b9c ] In calipso_map_cat_ntoh(), in the for loop, if the return value of netlbl_bitmap_walk() is equal to (net_clen_bits - 1), when netlbl_bitmap_walk() is called next time, out-of-bounds memory accesses of bitmap[byte_offset] occurs. The bug was found during fuzzing. The following is the fuzzing report BUG: KASAN: slab-out-of-bounds in netlbl_bitmap_walk+0x3c/0xd0 Read of size 1 at addr ffffff8107bf6f70 by task err_OH/252 CPU: 7 PID: 252 Comm: err_OH Not tainted 5.17.0-rc7+ #17 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x21c/0x230 show_stack+0x1c/0x60 dump_stack_lvl+0x64/0x7c print_address_description.constprop.0+0x70/0x2d0 __kasan_report+0x158/0x16c kasan_report+0x74/0x120 __asan_load1+0x80/0xa0 netlbl_bitmap_walk+0x3c/0xd0 calipso_opt_getattr+0x1a8/0x230 calipso_sock_getattr+0x218/0x340 calipso_sock_getattr+0x44/0x60 netlbl_sock_getattr+0x44/0x80 selinux_netlbl_socket_setsockopt+0x138/0x170 selinux_socket_setsockopt+0x4c/0x60 security_socket_setsockopt+0x4c/0x90 __sys_setsockopt+0xbc/0x2b0 __arm64_sys_setsockopt+0x6c/0x84 invoke_syscall+0x64/0x190 el0_svc_common.constprop.0+0x88/0x200 do_el0_svc+0x88/0xa0 el0_svc+0x128/0x1b0 el0t_64_sync_handler+0x9c/0x120 el0t_64_sync+0x16c/0x170 Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Yufen <wangyufen@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Florian Westphal	58d52743ae	netfilter: conntrack: revisit gc autotuning [ Upstream commit 2cfadb761d3d0219412fd8150faea60c7e863833 ] as of commit 4608fdfc07e1 ("netfilter: conntrack: collect all entries in one cycle") conntrack gc was changed to run every 2 minutes. On systems where conntrack hash table is set to large value, most evictions happen from gc worker rather than the packet path due to hash table distribution. This causes netlink event overflows when events are collected. This change collects average expiry of scanned entries and reschedules to the average remaining value, within 1 to 60 second interval. To avoid event overflows, reschedule after each bucket and add a limit for both run time and number of evictions per run. If more entries have to be evicted, reschedule and restart 1 jiffy into the future. Reported-by: Karel Rericha <karel@maxtel.cz> Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:10 +02:00
Luiz Augusto von Dentz	d404765dff	Bluetooth: Fix use after free in hci_send_acl [ Upstream commit f63d24baff787e13b723d86fe036f84bdbc35045 ] This fixes the following trace caused by receiving HCI_EV_DISCONN_PHY_LINK_COMPLETE which does call hci_conn_del without first checking if conn->type is in fact AMP_LINK and in case it is do properly cleanup upper layers with hci_disconn_cfm: ================================================================== BUG: KASAN: use-after-free in hci_send_acl+0xaba/0xc50 Read of size 8 at addr ffff88800e404818 by task bluetoothd/142 CPU: 0 PID: 142 Comm: bluetoothd Not tainted 5.17.0-rc5-00006-gda4022eeac1a #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x150 kasan_report.cold+0x7f/0x11b hci_send_acl+0xaba/0xc50 l2cap_do_send+0x23f/0x3d0 l2cap_chan_send+0xc06/0x2cc0 l2cap_sock_sendmsg+0x201/0x2b0 sock_sendmsg+0xdc/0x110 sock_write_iter+0x20f/0x370 do_iter_readv_writev+0x343/0x690 do_iter_write+0x132/0x640 vfs_writev+0x198/0x570 do_writev+0x202/0x280 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RSP: 002b:00007ffce8a099b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014 Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 RDX: 0000000000000001 RSI: 00007ffce8a099e0 RDI: 0000000000000015 RAX: ffffffffffffffda RBX: 00007ffce8a099e0 RCX: 00007f788fc3cf77 R10: 00007ffce8af7080 R11: 0000000000000246 R12: 000055e4ccf75580 RBP: 0000000000000015 R08: 0000000000000002 R09: 0000000000000001 </TASK> R13: 000055e4ccf754a0 R14: 000055e4ccf75cd0 R15: 000055e4ccf4a6b0 Allocated by task 45: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 hci_chan_create+0x9a/0x2f0 l2cap_conn_add.part.0+0x1a/0xdc0 l2cap_connect_cfm+0x236/0x1000 le_conn_complete_evt+0x15a7/0x1db0 hci_le_conn_complete_evt+0x226/0x2c0 hci_le_meta_evt+0x247/0x450 hci_event_packet+0x61b/0xe90 hci_rx_work+0x4d5/0xc50 process_one_work+0x8fb/0x15a0 worker_thread+0x576/0x1240 kthread+0x29d/0x340 ret_from_fork+0x1f/0x30 Freed by task 45: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xfb/0x130 kfree+0xac/0x350 hci_conn_cleanup+0x101/0x6a0 hci_conn_del+0x27e/0x6c0 hci_disconn_phylink_complete_evt+0xe0/0x120 hci_event_packet+0x812/0xe90 hci_rx_work+0x4d5/0xc50 process_one_work+0x8fb/0x15a0 worker_thread+0x576/0x1240 kthread+0x29d/0x340 ret_from_fork+0x1f/0x30 The buggy address belongs to the object at ffff88800c0f0500 The buggy address is located 24 bytes inside of which belongs to the cache kmalloc-128 of size 128 The buggy address belongs to the page: 128-byte region [ffff88800c0f0500, ffff88800c0f0580) flags: 0x100000000000200(slab\|node=0\|zone=1) page:00000000fe45cd86 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xc0f0 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 raw: 0100000000000200 ffffea00003a2c80 dead000000000004 ffff8880078418c0 page dumped because: kasan: bad access detected ffff88800c0f0400: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc Memory state around the buggy address: >ffff88800c0f0500: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88800c0f0480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88800c0f0580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ================================================================== ffff88800c0f0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: Sönke Huster <soenke.huster@eknoes.de> Tested-by: Sönke Huster <soenke.huster@eknoes.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Krzysztof Kozlowski	f249bbf3cb	MIPS: ingenic: correct unit node address [ Upstream commit 8931ddd8d6a55fcefb20f44a38ba42bb746f0b62 ] Unit node addresses should not have leading 0x: Warning (unit_address_format): /nemc@13410000/efuse@d0/eth-mac-addr@0x22: unit name should not have leading "0x" Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Reviewed-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Max Filippov	11ba1aa212	xtensa: fix DTC warning unit_address_format [ Upstream commit e85d29ba4b24f68e7a78cb85c55e754362eeb2de ] DTC issues the following warnings when building xtfpga device trees: /soc/flash@00000000/partition@0x0: unit name should not have leading "0x" /soc/flash@00000000/partition@0x6000000: unit name should not have leading "0x" /soc/flash@00000000/partition@0x6800000: unit name should not have leading "0x" /soc/flash@00000000/partition@0x7fe0000: unit name should not have leading "0x" Drop leading 0x from flash partition unit names. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Deren Wu	13946d5a68	mt76: fix monitor mode crash with sdio driver [ Upstream commit 123bc712b1de0805f9d683687e17b1ec2aba0b68 ] mt7921s driver may receive frames with fragment buffers. If there is a CTS packet received in monitor mode, the payload is 10 bytes only and need 6 bytes header padding after RXD buffer. However, only RXD in the first linear buffer, if we pull buffer size RXD-size+6 bytes with skb_pull(), that would trigger "BUG_ON(skb->len < skb->data_len)" in __skb_pull(). To avoid the nonlinear buffer issue, enlarge the RXD size from 128 to 256 to make sure all MCU operation in linear buffer. [ 52.007562] kernel BUG at include/linux/skbuff.h:2313! [ 52.007578] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 52.007987] pc : skb_pull+0x48/0x4c [ 52.008015] lr : mt7921_queue_rx_skb+0x494/0x890 [mt7921_common] [ 52.008361] Call trace: [ 52.008377] skb_pull+0x48/0x4c [ 52.008400] mt76s_net_worker+0x134/0x1b0 [mt76_sdio 35339a92c6eb7d4bbcc806a1d22f56365565135c] [ 52.008431] __mt76_worker_fn+0xe8/0x170 [mt76 ef716597d11a77150bc07e3fdd68eeb0f9b56917] [ 52.008449] kthread+0x148/0x3ac [ 52.008466] ret_from_fork+0x10/0x30 Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Deren Wu <deren.wu@mediatek.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
H. Nikolaus Schaller	ac27808b82	usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm [ Upstream commit ac01df343e5a6c6bcead2ed421af1fde30f73e7e ] Usually, the vbus_regulator (smps10 on omap5evm) boots up disabled. Hence calling regulator_disable() indirectly through dwc3_omap_set_mailbox() during probe leads to: [ 10.332764] WARNING: CPU: 0 PID: 1628 at drivers/regulator/core.c:2853 _regulator_disable+0x40/0x164 [ 10.351919] unbalanced disables for smps10_out1 [ 10.361298] Modules linked in: dwc3_omap(+) clk_twl6040 at24 gpio_twl6040 palmas_gpadc palmas_pwrbutton industrialio snd_soc_omap_mcbsp(+) snd_soc_ti_sdma display_connector ti_tpd12s015 drm leds_gpio drm_panel_orientation_quirks ip_tables x_tables ipv6 autofs4 [ 10.387818] CPU: 0 PID: 1628 Comm: systemd-udevd Not tainted 5.17.0-rc1-letux-lpae+ #8139 [ 10.405129] Hardware name: Generic OMAP5 (Flattened Device Tree) [ 10.411455] unwind_backtrace from show_stack+0x10/0x14 [ 10.416970] show_stack from dump_stack_lvl+0x40/0x4c [ 10.422313] dump_stack_lvl from __warn+0xb8/0x170 [ 10.427377] __warn from warn_slowpath_fmt+0x70/0x9c [ 10.432595] warn_slowpath_fmt from _regulator_disable+0x40/0x164 [ 10.439037] _regulator_disable from regulator_disable+0x30/0x64 [ 10.445382] regulator_disable from dwc3_omap_set_mailbox+0x8c/0xf0 [dwc3_omap] [ 10.453116] dwc3_omap_set_mailbox [dwc3_omap] from dwc3_omap_probe+0x2b8/0x394 [dwc3_omap] [ 10.467021] dwc3_omap_probe [dwc3_omap] from platform_probe+0x58/0xa8 [ 10.481762] platform_probe from really_probe+0x168/0x2fc [ 10.481782] really_probe from __driver_probe_device+0xc4/0xd8 [ 10.481782] __driver_probe_device from driver_probe_device+0x24/0xa4 [ 10.503762] driver_probe_device from __driver_attach+0xc4/0xd8 [ 10.510018] __driver_attach from bus_for_each_dev+0x64/0xa0 [ 10.516001] bus_for_each_dev from bus_add_driver+0x148/0x1a4 [ 10.524880] bus_add_driver from driver_register+0xb4/0xf8 [ 10.530678] driver_register from do_one_initcall+0x90/0x1c4 [ 10.536661] do_one_initcall from do_init_module+0x4c/0x200 [ 10.536683] do_init_module from load_module+0x13dc/0x1910 [ 10.551159] load_module from sys_finit_module+0xc8/0xd8 [ 10.561319] sys_finit_module from __sys_trace_return+0x0/0x18 [ 10.561336] Exception stack(0xc344bfa8 to 0xc344bff0) [ 10.561341] bfa0: b6fb5778 b6fab8d8 00000007 b6ecfbb8 00000000 b6ed0398 [ 10.561341] bfc0: b6fb5778 b6fab8d8 855c0500 0000017b 00020000 b6f9a3cc 00000000 b6fb5778 [ 10.595500] bfe0: bede18f8 bede18e8 b6ec9aeb b6dda1c2 [ 10.601345] ---[ end trace 0000000000000000 ]--- Fix this unnecessary warning by checking if the regulator is enabled. Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com> Link: https://lore.kernel.org/r/af3b750dc2265d875deaabcf5f80098c9645da45.1646744616.git.hns@goldelico.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Michael Walle	0616792164	net: sfp: add 2500base-X quirk for Lantech SFP module [ Upstream commit 00eec9fe4f3b9588b4bfa8ef9dd0aae96407d5d7 ] The Lantech 8330-262D-E module is 2500base-X capable, but it reports the nominal bitrate as 2500MBd instead of 3125MBd. Add a quirk for the module. The following in an EEPROM dump of such a SFP with the serial number redacted: 00: 03 04 07 00 00 00 01 20 40 0c 05 01 19 00 00 00 ???...? @????... 10: 1e 0f 00 00 4c 61 6e 74 65 63 68 20 20 20 20 20 ??..Lantech 20: 20 20 20 20 00 00 00 00 38 33 33 30 2d 32 36 32 ....8330-262 30: 44 2d 45 20 20 20 20 20 56 31 2e 30 03 52 00 cb D-E V1.0?R.? 40: 00 1a 00 00 46 43 XX XX XX XX XX XX XX XX XX XX .?..FCXXXXXXXXXX 50: 20 20 20 20 32 32 30 32 31 34 20 20 68 b0 01 98 220214 h??? 60: 45 58 54 52 45 4d 45 4c 59 20 43 4f 4d 50 41 54 EXTREMELY COMPAT 70: 49 42 4c 45 20 20 20 20 20 20 20 20 20 20 20 20 IBLE Signed-off-by: Michael Walle <michael@walle.cc> Link: https://lore.kernel.org/r/20220312205014.4154907-1-michael@walle.cc Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Gal Pressman	459e56859f	net/mlx5e: Remove overzealous validations in netlink EEPROM query [ Upstream commit 970adfb76095fa719778d70a6b86030d2feb88dd ] Unlike the legacy EEPROM callbacks, when using the netlink EEPROM query (get_module_eeprom_by_page) the driver should not try to validate the query parameters, but just perform the read requested by the userspace. Recent discussion in the mailing list: https://lore.kernel.org/netdev/20220120093051.70845141@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net/ Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Jakub Kicinski	1c4561d9b5	net: limit altnames to 64k total [ Upstream commit 155fb43b70b5fce341347a77d1af2765d1e8fbb8 ] Property list (altname is a link "property") is wrapped in a nlattr. nlattrs length is 16bit so practically speaking the list of properties can't be longer than that, otherwise user space would have to interpret broken netlink messages. Prevent the problem from occurring by checking the length of the property list before adding new entries. Reported-by: George Shuklin <george.shuklin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00
Jakub Kicinski	601f748029	net: account alternate interface name memory [ Upstream commit 5d26cff5bdbebdf98ba48217c078ff102536f134 ] George reports that altnames can eat up kernel memory. We should charge that memory appropriately. Reported-by: George Shuklin <george.shuklin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-04-13 20:59:09 +02:00

1 2 3 4 5 ...

1050567 Commits