linux

iv/linux

Author	SHA1	Message	Date
Matthew Wilcox (Oracle)	41a638a1b3	affs: convert affs_symlink_read_folio() to use the folio Remove use of the old page APIs. That includes use of setting PageError on error; simply not setting the uptodate flag is sufficient. Link: https://lkml.kernel.org/r/20230713035512.4139457-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Sterba <dsterba@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:29 -07:00
Jiaqi Yan	38c1ddbde6	hugetlbfs: improve read HWPOISON hugepage When a hugepage contains HWPOISON pages, read() fails to read any byte of the hugepage and returns -EIO, although many bytes in the HWPOISON hugepage are readable. Improve this by allowing hugetlbfs_read_iter returns as many bytes as possible. For a requested range [offset, offset + len) that contains HWPOISON page, return [offset, first HWPOISON page addr); the next read attempt will fail and return -EIO. Link: https://lkml.kernel.org/r/20230713001833.3778937-4-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: James Houghton <jthoughton@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:26 -07:00
Axel Rasmussen	fc71884a5f	mm: userfaultfd: add new UFFDIO_POISON ioctl The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages. When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not. So the basic way to use this new feature is: - On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose). - On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned. - If so, we can respond with a UFFDIO_POISON - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly. UFFDIO_POISON does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed. Similarly, UFFDIO_POISON won't replace existing PTE markers. This might be okay to do, but it seems to be safer to just refuse to overwrite any existing entry (like a UFFD_WP PTE marker). Link: https://lkml.kernel.org/r/20230707215540.2324998-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:16 -07:00
Axel Rasmussen	2ef5d7245d	mm: userfaultfd: check for start + len overflow in validate_range Most userfaultfd ioctls take a `start + len` range as an argument. We have the validate_range helper to check that such ranges are valid. However, some (but not all!) ioctls also check that `start + len` doesn't wrap around (overflow). Just check for this in validate_range. This saves some repetitive code, and adds the check to some ioctls which weren't bothering to check for it before. [axelrasmussen@google.com: call validate_range() on the src range too] Link: https://lkml.kernel.org/r/20230714182932.2608735-1-axelrasmussen@google.com [axelrasmussen@google.com: fix src/dst validation] Link: https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230707215540.2324998-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:16 -07:00
David Howells	b4fa966f03	mm, netfs, fscache: stop read optimisation when folio removed from pagecache Fscache has an optimisation by which reads from the cache are skipped until we know that (a) there's data there to be read and (b) that data isn't entirely covered by pages resident in the netfs pagecache. This is done with two flags manipulated by fscache_note_page_release(): if (... test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) && test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to indicate that netfslib should download from the server or clear the page instead. The fscache_note_page_release() function is intended to be called from ->releasepage() - but that only gets called if PG_private or PG_private_2 is set - and currently the former is at the discretion of the network filesystem and the latter is only set whilst a page is being written to the cache, so sometimes we miss clearing the optimisation. Fix this by following Willy's suggestion[1] and adding an address_space flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call ->release_folio() if it's set, even if PG_private or PG_private_2 aren't set. Note that this would require folio_test_private() and page_has_private() to become more complicated. To avoid that, in the places[] where these are used to conditionalise calls to filemap_release_folio() and try_to_release_page(), the tests are removed the those functions just jumped to unconditionally and the test is performed there. [] There are some exceptions in vmscan.c where the check guards more than just a call to the releaser. I've added a function, folio_needs_release() to wrap all the checks for that. AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from fscache and cleared in ->evict_inode() before truncate_inode_pages_final() is called. Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared and the optimisation cancelled if a cachefiles object already contains data when we open it. [dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()] Link: `902c990e31` Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page") Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Daire Byrne <daire.byrne@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:13 -07:00
David Howells	0201ebf274	mm: merge folio_has_private()/filemap_release_folio() call pairs Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:12 -07:00
Andrew Yang	8a144612eb	fs: drop_caches: draining pages before dropping caches We expect a file page access after dropping caches should be a major fault, but sometimes it's still a minor fault. That's because a file page can't be dropped if it's in a per-cpu pagevec. Draining all pages from per-cpu pagevec to lru list before trying to drop caches. Link: https://lkml.kernel.org/r/20230630092203.16080-1-andrew.yang@mediatek.com Signed-off-by: Andrew Yang <andrew.yang@mediatek.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:11 -07:00
xu xin	6080d19f07	ksm: add ksm zero pages for each process As the number of ksm zero pages is not included in ksm_merging_pages per process when enabling use_zero_pages, it's unclear of how many actual pages are merged by KSM. To let users accurately estimate their memory demands when unsharing KSM zero-pages, it's necessary to show KSM zero- pages per process. In addition, it help users to know the actual KSM profit because KSM-placed zero pages are also benefit from KSM. since unsharing zero pages placed by KSM accurately is achieved, then tracking empty pages merging and unmerging is not a difficult thing any longer. Since we already have /proc/<pid>/ksm_stat, just add the information of 'ksm_zero_pages' in it. Link: https://lkml.kernel.org/r/20230613030938.185993-1-yang.yang29@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:10 -07:00
Bean Huo	a524fcfe19	fs: convert block_commit_write to return void block_commit_write() always returns 0, this patch changes it to return void. Link: https://lkml.kernel.org/r/20230626055518.842392-3-beanhuo@iokpp.de Signed-off-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Luís Henriques <ocfs2-devel@oss.oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:07 -07:00
Bean Huo	489b7e72a6	fs/buffer: clean up block_commit_write Originally inode is used to get blksize, after commit 45bce8f3e343 ("fs/buffer.c: make block-size be per-page and protected by the page lock"), __block_commit_write no longer uses this parameter inode. [akpm@linux-foundation.org: remove now-unused local `inode'] Link: https://lkml.kernel.org/r/20230626055518.842392-2-beanhuo@iokpp.de Signed-off-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Luís Henriques <ocfs2-devel@oss.oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:06 -07:00
Peter Xu	4849807114	mm/gup: retire follow_hugetlb_page() Now __get_user_pages() should be well prepared to handle thp completely, as long as hugetlb gup requests even without the hugetlb's special path. Time to retire follow_hugetlb_page(). Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal. Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:04 -07:00
Thomas Weißschuh	626e98cb03	mm: make MEMFD_CREATE into a selectable config option The memfd_create() syscall, enabled by CONFIG_MEMFD_CREATE, is useful on its own even when not required by CONFIG_TMPFS or CONFIG_HUGETLBFS. Split it into its own proper bool option that can be enabled by users. Move that option into mm/ where the code itself also lies. Also add "select" statements to CONFIG_TMPFS and CONFIG_HUGETLBFS so they automatically enable CONFIG_MEMFD_CREATE as before. Link: https://lkml.kernel.org/r/20230630-config-memfd-v1-1-9acc3ae38b5a@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Tested-by: Zhangjin Wu <falcon@tinylab.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:01 -07:00
Sidhartha Kumar	87b11f8622	mm: increase usage of folio_next_index() helper Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Link: https://lkml.kernel.org/r/20230627174349.491803-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:00 -07:00
Josef Bacik	c962098ca4	btrfs: fix incorrect splitting in btrfs_drop_extent_map_range In production we were seeing a variety of WARN_ON()'s in the extent_map code, specifically in btrfs_drop_extent_map_range() when we have to call add_extent_mapping() for our second split. Consider the following extent map layout PINNED [0 16K) [32K, 48K) and then we call btrfs_drop_extent_map_range for [0, 36K), with skip_pinned == true. The initial loop will have start = 0 end = 36K len = 36K we will find the [0, 16k) extent, but since we are pinned we will skip it, which has this code start = em_end; if (end != (u64)-1) len = start + len - em_end; em_end here is 16K, so now the values are start = 16K len = 16K + 36K - 16K = 36K len should instead be 20K. This is a problem when we find the next extent at [32K, 48K), we need to split this extent to leave [36K, 48k), however the code for the split looks like this split->start = start + len; split->len = em_end - (start + len); In this case we have em_end = 48K split->start = 16K + 36K // this should be 16K + 20K split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K and now we have an invalid extent_map in the tree that potentially overlaps other entries in the extent map. Even in the non-overlapping case we will have split->start set improperly, which will cause problems with any block related calculations. We don't actually need len in this loop, we can simply use end as our end point, and only adjust start up when we find a pinned extent we need to skip. Adjust the logic to do this, which keeps us from inserting an invalid extent map. We only skip_pinned in the relocation case, so this is relatively rare, except in the case where you are running relocation a lot, which can happen with auto relocation on. Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-18 14:38:10 +02:00
Georg Ottinger	2ebc736c84	ext2: improve consistency of ext2_fsblk_t datatype usage The ext2 block allocation/deallocation functions and their respective calls use a mixture of unsigned long and ext2_fsblk_t datatypes to index the desired ext2 block. This commit replaces occurrences of unsigned long with ext2_fsblk_t, covering the functions ext2_new_block(), ext2_new_blocks(), ext2_free_blocks(), ext2_free_data() and ext2_free_branches(). This commit is rather conservative, and only replaces unsigned long with ext2_fsblk_t if the variable is used to index a specific ext2 block. Signed-off-by: Georg Ottinger <g.ottinger@gmx.at> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230817195925.10268-1-g.ottinger@gmx.at>	2023-08-18 12:54:54 +02:00
Zizhen Pang	c1950a111d	fs/xfs: Fix typos in comments Delete duplicate word "the" [chandan: Fix mangled patch] Signed-off-by: Zizhen Pang <pangzizhen001@208suo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>	2023-08-18 13:42:43 +05:30
Darrick J. Wong	2c234a2286	xfs: fix dqiterate thinko For some unknown reason, when I converted the incore dquot objects to store the dquot id in host endian order, I removed the increment here. This causes the scan to stop after retrieving the root dquot, which severely limits the usefulness of the quota scrubber. Fix the lost increment, though it won't fix the problem that the quota iterator code filters out zeroed dquot records. Fixes: c51df7334167e ("xfs: stop using q_core.d_id in the quota code") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>	2023-08-18 13:42:36 +05:30
Chandan Babu R	220c8d57f5	xfs: fixes for the block mapping checker [v26.1] This series amends the file extent map checking code so that nonexistent cow/attr forks get the ENOENT return they're supposed to; and fixes some incorrect logic about the presence of a cow fork vs. reflink iflag. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WQAKCRBKO3ySh0YR pl9yAQCuAluDN+6ToSqmGrubdnnjlUhQZPJMbo8orE5/M4F0wgEAl8fI+JBs3psd L3F/qsJEVy02bbVpONUCvDTFU/4wNwg= =tL3U -----END PGP SIGNATURE----- Merge tag 'scrub-bmap-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: fixes for the block mapping checker This series amends the file extent map checking code so that nonexistent cow/attr forks get the ENOENT return they're supposed to; and fixes some incorrect logic about the presence of a cow fork vs. reflink iflag. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'scrub-bmap-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: don't check reflink iflag state when checking cow fork xfs: simplify returns in xchk_bmap xfs: rewrite xchk_inode_is_allocated to work properly xfs: hide xfs_inode_is_allocated in scrub common code	2023-08-18 13:39:32 +05:30
Chandan Babu R	939c9de87f	xfs: fixes to the AGFL repair code [v26.1] This series contains a couple of bug fixes to the AGFL repair code that came up during QA. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR ptuaAP9QE9S4/7293ZidQHP9BDgPLdIpJyc95/g55BY8Ppb6ywD+NmtDIlYb2BPF WcGfY6jLypDy/F+YCXPaufsjq+HcfgI= =Ee53 -----END PGP SIGNATURE----- Merge tag 'repair-agfl-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: fixes to the AGFL repair code This series contains a couple of bug fixes to the AGFL repair code that came up during QA. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-agfl-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix agf_fllast when repairing an empty AGFL xfs: clear pagf_agflreset when repairing the AGFL	2023-08-18 13:36:52 +05:30
Chandan Babu R	5221002c05	xfs: force rebuilding of metadata [v26.1] This patchset adds a new IFLAG to the scrub ioctl so that userspace can force a rebuild of an otherwise consistent piece of metadata. This will eventually enable the use of online repair to relocate metadata during a filesystem reorganization (e.g. shrink). For now, it facilitates stress testing of online repair without needing the debugging knobs to be enabled. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR pq3EAQC6+TQORGg6EnB6997Ftl07v0z7lLWJI5iS4JrGQPAnqwEA2W4cG8daGTAo hJyIrT4wBM92rSnWK6S8uVU2cCarbAk= =ZNnN -----END PGP SIGNATURE----- Merge tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: force rebuilding of metadata This patchset adds a new IFLAG to the scrub ioctl so that userspace can force a rebuild of an otherwise consistent piece of metadata. This will eventually enable the use of online repair to relocate metadata during a filesystem reorganization (e.g. shrink). For now, it facilitates stress testing of online repair without needing the debugging knobs to be enabled. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-force-rebuild-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: allow userspace to rebuild metadata structures xfs: don't complain about unfixed metadata when repairs were injected	2023-08-18 13:32:43 +05:30
Chandan Babu R	7857acd877	xfs: miscellaneous repair tweaks [v26.1] Before we start adding online repair functionality, there's a few tweaks that I'd like to make to the common repair code. First is a fix to the integration between repair and the health status code that was interfering with repair re-evaluations. Second is a minor tweak to the sole existing repair functions to make one last check that the user hasn't terminated the calling process before we start writing to the filesystem. This is a pattern that will repeat throughout the rest of the repair functions. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR poolAP9yVrzlW+ni3nKOHQYsbU3Id+uNceytkBh1kAqPayggegEA0FKLbzrVb6IP jXN6YL2QUyNbDQX7WDtVeIkFOe+S2Qo= =UeKu -----END PGP SIGNATURE----- Merge tag 'repair-tweaks-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: miscellaneous repair tweaks Before we start adding online repair functionality, there's a few tweaks that I'd like to make to the common repair code. First is a fix to the integration between repair and the health status code that was interfering with repair re-evaluations. Second is a minor tweak to the sole existing repair functions to make one last check that the user hasn't terminated the calling process before we start writing to the filesystem. This is a pattern that will repeat throughout the rest of the repair functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-tweaks-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: allow the user to cancel repairs before we start writing xfs: always rescan allegedly healthy per-ag metadata after repair	2023-08-18 13:29:43 +05:30
Chandan Babu R	df7833234b	xfs: online scrubbing of realtime summary files [v26.1] This patchset implements an online checker for the realtime summary file. The first few changes are some general cleanups -- scrub should get its own references to all inodes, and we also wrap the inode lock functions so that we can standardize unlocking and releasing inodes that are the focus of a scrub. With that out of the way, we move on to constructing a shadow copy of the rtsummary information from the rtbitmap, and compare the new copy against the ondisk copy. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR pgDVAPsEWPE0rPh6NtW1uJZkE7pXMQM1hm+5SWx11nD2DQl7cwD/QdIB1Nha2FUo 1M4ib1ERq9Fut8EV919BSRrn9AacHAk= =jK1O -----END PGP SIGNATURE----- Merge tag 'scrub-rtsummary-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: online scrubbing of realtime summary files This patchset implements an online checker for the realtime summary file. The first few changes are some general cleanups -- scrub should get its own references to all inodes, and we also wrap the inode lock functions so that we can standardize unlocking and releasing inodes that are the focus of a scrub. With that out of the way, we move on to constructing a shadow copy of the rtsummary information from the rtbitmap, and compare the new copy against the ondisk copy. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'scrub-rtsummary-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: implement online scrubbing of rtsummary info xfs: move the realtime summary file scrubber to a separate source file xfs: wrap ilock/iunlock operations on sc->ip xfs: get our own reference to inodes that we want to scrub	2023-08-18 13:25:53 +05:30
Chandan Babu R	889b09b3d0	xfs: add usage counters for scrub [v26.1] This series introduces simple usage and performance counters for the online fsck subsystem. The goal here is to enable developers and sysadmins to look at summary counts of how many objects were checked and repaired; what the outcomes were; and how much time the kernel has spent on these operations. The counter file is exposed in debugfs because that's easier than cramming it into the device model, and debugfs doesn't have rules against complex file contents, unlike sysfs. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR pn2QAQDNpzFeWpqOy9hgNIEVzhu8fVwWjhW0Qh+so3Ti7Wc6OgEA2rcieO9YX/7j luOoX6C2w7n7tW6n5uw7YxSxncaPvQY= =D6oP -----END PGP SIGNATURE----- Merge tag 'scrub-usage-stats-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: add usage counters for scrub This series introduces simple usage and performance counters for the online fsck subsystem. The goal here is to enable developers and sysadmins to look at summary counts of how many objects were checked and repaired; what the outcomes were; and how much time the kernel has spent on these operations. The counter file is exposed in debugfs because that's easier than cramming it into the device model, and debugfs doesn't have rules against complex file contents, unlike sysfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'scrub-usage-stats-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: track usage statistics of online fsck xfs: create scaffolding for creating debugfs entries	2023-08-18 12:52:16 +05:30
Chandan Babu R	d668fc1fda	xfs: stage repair information in pageable memory [v26.1] In general, online repair of an indexed record set walks the filesystem looking for records. These records are sorted and bulk-loaded into a new btree. To make this happen without pinning gigabytes of metadata in memory, first create an abstraction ('xfile') of memfd files so that kernel code can access paged memory, and then an array abstraction ('xfarray') based on xfiles so that online repair can create an array of new records without pinning memory. These two data storage abstractions are critical for repair of space metadata -- the memory used is pageable, which helps us avoid pinning kernel memory and driving OOM problems; and they are byte-accessible enough that we can use them like (very slow and programmatic) memory buffers. Later patchsets will build on this functionality to provide blob storage and btrees. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR pouiAQDRoD3jYIJFCSnvEtr8FHNxBnBKFmNxP5O3xXJDHSXp6wEAwK8c8OyO2PCE JIYa50W/56tdgVp8AECfaQ5dilCkAwA= =VgYl -----END PGP SIGNATURE----- Merge tag 'big-array-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: stage repair information in pageable memory In general, online repair of an indexed record set walks the filesystem looking for records. These records are sorted and bulk-loaded into a new btree. To make this happen without pinning gigabytes of metadata in memory, first create an abstraction ('xfile') of memfd files so that kernel code can access paged memory, and then an array abstraction ('xfarray') based on xfiles so that online repair can create an array of new records without pinning memory. These two data storage abstractions are critical for repair of space metadata -- the memory used is pageable, which helps us avoid pinning kernel memory and driving OOM problems; and they are byte-accessible enough that we can use them like (very slow and programmatic) memory buffers. Later patchsets will build on this functionality to provide blob storage and btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'big-array-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: improve xfarray quicksort pivot xfs: cache pages used for xfarray quicksort convergence xfs: speed up xfarray sort by sorting xfile page contents directly xfs: teach xfile to pass back direct-map pages to caller xfs: convert xfarray insertion sort to heapsort using scratchpad memory xfs: enable sorting of xfile-backed arrays xfs: create a big array data structure	2023-08-18 12:49:06 +05:30
Chandan Babu R	81fbc5f930	xfs: fix online repair block reaping [v26.1] These patches fix a few problems that I noticed in the code that deals with old btree blocks after a successful repair. First, I observed that it is possible for repair to incorrectly invalidate and delete old btree blocks if they were crosslinked. The solution here is to consult the reverse mappings for each block in the extent -- singly owned blocks are invalidated and freed, whereas for crosslinked blocks, we merely drop the incorrect reverse mapping. A largeish change in this patchset is moving the reaping code to a separate file, because the code are mostly interrelated static functions. For now this also drops the ability to reap file blocks, which will return when we add the bmbt repair functions. Second, we convert the reap function to use EFIs so that we can commit to freeing as many blocks in as few transactions as we dare. We would like to free as many old blocks as we can in the same transaction that commits the new structure to the ondisk filesystem to minimize the number of blocks that leak if the system crashes before the repair fully completes. The third change made in this series is to avoid tripping buffer cache assertions if we're merely scanning the buffer cache for buffers to invalidate, and find a non-stale buffer of the wrong length. This is primarily cosmetic, but makes my life easier. The fourth change restructures the reaping code to try to process as many blocks in one go as possible, to reduce logging traffic. The last change switches the reaping mechanism to use per-AG bitmaps defined in a previous patchset. This should reduce type confusion when reading the source code. This has been running on the djcloud for years with no problems. Enjoy! Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZNT+WAAKCRBKO3ySh0YR pt1nAQDn4pjj8ugmH3mcbwE6pqOde52xvApWD6HM2gz7CJa17gEAyGNQu2tw8I15 MJ15sGUq6LaNWBF5kZjLt3qpWykoLgk= =sVzo -----END PGP SIGNATURE----- Merge tag 'repair-reap-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-mergeA xfs: fix online repair block reaping These patches fix a few problems that I noticed in the code that deals with old btree blocks after a successful repair. First, I observed that it is possible for repair to incorrectly invalidate and delete old btree blocks if they were crosslinked. The solution here is to consult the reverse mappings for each block in the extent -- singly owned blocks are invalidated and freed, whereas for crosslinked blocks, we merely drop the incorrect reverse mapping. A largeish change in this patchset is moving the reaping code to a separate file, because the code are mostly interrelated static functions. For now this also drops the ability to reap file blocks, which will return when we add the bmbt repair functions. Second, we convert the reap function to use EFIs so that we can commit to freeing as many blocks in as few transactions as we dare. We would like to free as many old blocks as we can in the same transaction that commits the new structure to the ondisk filesystem to minimize the number of blocks that leak if the system crashes before the repair fully completes. The third change made in this series is to avoid tripping buffer cache assertions if we're merely scanning the buffer cache for buffers to invalidate, and find a non-stale buffer of the wrong length. This is primarily cosmetic, but makes my life easier. The fourth change restructures the reaping code to try to process as many blocks in one go as possible, to reduce logging traffic. The last change switches the reaping mechanism to use per-AG bitmaps defined in a previous patchset. This should reduce type confusion when reading the source code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> * tag 'repair-reap-fixes-6.6_2023-08-10' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair xfs: reap large AG metadata extents when possible xfs: allow scanning ranges of the buffer cache for live buffers xfs: rearrange xrep_reap_block to make future code flow easier xfs: use deferred frees to reap old btree blocks xfs: only allow reaping of per-AG blocks in xrep_reap_extents xfs: only invalidate blocks if we're going to free them xfs: move the post-repair block reaping code to a separate file xfs: cull repair code that will never get used	2023-08-18 12:45:06 +05:30
Yuxiao Zhang	104fd0b5e9	pstore: Support record sizes larger than kmalloc() limit Currently pstore record buffers are allocated using kmalloc() which has a maximum size based on page size. If a large "pmsg-size" module parameter is specified, pmsg will fail to copy the contents since memdup_user() is limited to kmalloc() allocation sizes. Since we don't need physically contiguous memory for any of the pstore record buffers, use kvzalloc() to avoid such limitations in the core of pstore and in the ram backend, and explicitly read from userspace using vmemdup_user(). This also means that any other backends that want to (or do already) support larger record sizes will Just Work now. Signed-off-by: Yuxiao Zhang <yuxiaozhang@google.com> Link: https://lore.kernel.org/r/20230627202540.881909-2-yuxiaozhang@google.com Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>	2023-08-17 15:18:24 -07:00
Trond Myklebust	be2fd1560e	NFS: Fix a use after free in nfs_direct_join_group() Be more careful when tearing down the subrequests of an O_DIRECT write as part of a retransmission. Reported-by: Chris Mason <clm@fb.com> Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2023-08-17 09:30:38 -04:00
xiaoshoukui	29eefa6d0d	btrfs: fix BUG_ON condition in btrfs_cancel_balance Pausing and canceling balance can race to interrupt balance lead to BUG_ON panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance does not take this race scenario into account. However, the race condition has no other side effects. We can fix that. Reproducing it with panic trace like this: kernel BUG at fs/btrfs/volumes.c:4618! RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0 Call Trace: <TASK> ? do_nanosleep+0x60/0x120 ? hrtimer_nanosleep+0xb7/0x1a0 ? sched_core_clone_cookie+0x70/0x70 btrfs_ioctl_balance_ctl+0x55/0x70 btrfs_ioctl+0xa46/0xd20 __x64_sys_ioctl+0x7d/0xa0 do_syscall_64+0x38/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Race scenario as follows: > mutex_unlock(&fs_info->balance_mutex); > -------------------- > .......issue pause and cancel req in another thread > -------------------- > ret = __btrfs_balance(fs_info); > > mutex_lock(&fs_info->balance_mutex); > if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) { > btrfs_info(fs_info, "balance: paused"); > btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED); > } CC: stable@vger.kernel.org # 4.19+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-17 15:27:45 +02:00
Chris Mason	09c3717c3a	btrfs: only subtract from len_to_oe_boundary when it is tracking an extent bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone as we submit bios for writes. Every time we add a page to the bio, we decrement those bytes from len_to_oe_boundary, and then we submit the bio if we happen to hit zero. Most of the time, len_to_oe_boundary gets set to U32_MAX. submit_extent_page() adds pages into our bio, and the size of the bio ends up limited by: - Are we contiguous on disk? - Does bio_add_page() allow us to stuff more in? - is len_to_oe_boundary > 0? The len_to_oe_boundary math starts with U32_MAX, which isn't page or sector aligned, and subtracts from it until it hits zero. In the non-zoned case, the last IO we submit before we hit zero is going to be unaligned, triggering BUGs. This is hard to trigger because bio_add_page() isn't going to make a bio of U32_MAX size unless you give it a perfect set of pages and fully contiguous extents on disk. We can hit it pretty reliably while making large swapfiles during provisioning because the machine is freshly booted, mostly idle, and the disk is freshly formatted. It's also possible to trigger with reads when read_ahead_kb is set to 4GB. The code has been clean up and shifted around a few times, but this flaw has been lurking since the counter was added. I think the commit 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended up exposing the bug. The fix used here is to skip doing math on len_to_oe_boundary unless we've changed it from the default U32_MAX value. bio_add_page() is the real limit we want, and there's no reason to do extra math when block layer is doing it for us. Sample reproducer, note you'll need to change the path to the bdi and device: SUBVOL=/btrfs/swapvol SWAPFILE=$SUBVOL/swapfile SZMB=8192 mkfs.btrfs -f /dev/vdb mount /dev/vdb /btrfs btrfs subvol create $SUBVOL chattr +C $SUBVOL dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB sync echo 4 > /proc/sys/vm/drop_caches echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb while true; do echo 1 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/drop_caches dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock done Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-17 15:27:35 +02:00
Anand Jain	b471965fdb	btrfs: fix replace/scrub failure with metadata_uuid Fstests with POST_MKFS_CMD="btrfstune -m" (as in the mailing list) reported a few of the test cases failing. The failure scenario can be summarized and simplified as follows: $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sdb1 /dev/sdb2 :0 $ btrfstune -m /dev/sdb1 :0 $ wipefs -a /dev/sdb1 :0 $ mount -o degraded /dev/sdb2 /btrfs :0 $ btrfs replace start -B -f -r 1 /dev/sdb1 /btrfs :1 STDERR: ERROR: ioctl(DEV_REPLACE_START) failed on "/btrfs": Input/output error [11290.583502] BTRFS warning (device sdb2): tree block 22036480 mirror 2 has bad fsid, has 99835c32-49f0-4668-9e66-dc277a96b4a6 want da40350c-33ac-4872-92a8-4948ed8c04d0 [11290.586580] BTRFS error (device sdb2): unable to fix up (regular) error at logical 22020096 on dev /dev/sdb8 physical 1048576 As above, the replace is failing because we are verifying the header with fs_devices::fsid instead of fs_devices::metadata_uuid, despite the metadata_uuid actually being present. To fix this, use fs_devices::metadata_uuid. We copy fsid into fs_devices::metadata_uuid if there is no metadata_uuid, so its fine. Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block") CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-17 15:26:39 +02:00
Ye Bin	9bc6fc3304	ext2: dump current reservation window info There's report BUG in 'ext2_try_to_allocate_with_rsv()', although there's now dump of all reservation windows information. But there's unknown which window is being processed.So this is not helpful for locating the issue. To better analyze the problem, dump the information about reservation window that is being processed. And just bail with error instead of BUG here. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-5-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2023-08-16 17:49:47 +02:00
Ye Bin	83f99de1b7	ext2: fix race between setxattr and write back There's an issue when allocating xattrs as follows: Block Allocation Reservation Windows Map (ext2_try_to_allocate_with_rsv): reservation window 0x000000006f105382 start: 0, end: 0 reservation window 0x000000008fd1a555 start: 1044, end: 1059 Window map complete. kernel BUG at fs/ext2/balloc.c:1158! invalid opcode: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:ext2_try_to_allocate_with_rsv.isra.0+0x15c4/0x1800 Call Trace: <TASK> ext2_new_blocks+0x935/0x1690 ext2_new_block+0x73/0xa0 ext2_xattr_set2+0x74f/0x1730 ext2_xattr_set+0x12b6/0x2260 ext2_xattr_user_set+0x9c/0x110 __vfs_setxattr+0x139/0x1d0 __vfs_setxattr_noperm+0xfc/0x370 __vfs_setxattr_locked+0x205/0x2c0 vfs_setxattr+0x19d/0x3b0 do_setxattr+0xff/0x220 setxattr+0x123/0x150 path_setxattr+0x193/0x1e0 __x64_sys_setxattr+0xc8/0x170 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Above issue may happens as follows: setxattr write back ext2_xattr_set ext2_xattr_set2 ext2_new_block ext2_new_blocks ext2_try_to_allocate_with_rsv alloc_new_reservation --> group=0 [0, 1023] rsv [1016, 1023] do_writepages mpage_writepages write_cache_pages __mpage_writepage ext2_get_block ext2_get_blocks ext2_alloc_branch ext2_new_blocks ext2_try_to_allocate_with_rsv alloc_new_reservation -->group=1 [1024, 2047] rsv [1044, 1059] if ((my_rsv->rsv_start > group_last_block) \|\| (my_rsv->rsv_end < group_first_block) rsv_window_dump BUG(); Now ext2 mkwrite doesn't allocate new blocks so for these cases we may be allocating blocks during writeback. However, there is no protection between ext2_xattr_set() and do_writepages() so these two functions can conflict on handling the reservation window. To solve about issue don't use the reservation window when allocating block for xattr. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-4-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2023-08-16 17:47:37 +02:00
Ye Bin	b450159d09	ext2: introduce new flags argument for ext2_new_blocks() This patch introduces a new flags argument for ext2_new_blocks() and also a new EXT2_ALLOC_NORESERVE flag. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-3-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2023-08-16 17:42:42 +02:00
Ye Bin	2445a8a192	ext2: remove ext2_new_block() Now, only xattr allocate block use ext2_new_block(), so just opencode it in the xattr code. Signed-off-by: Ye Bin <yebin10@huawei.com> Message-Id: <20230815112612.221145-2-yebin10@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2023-08-16 17:35:08 +02:00
Georg Ottinger	e880763484	ext2: fix datatype of block number in ext2_xattr_set2() I run a small server that uses external hard drives for backups. The backup software I use uses ext2 filesystems with 4KiB block size and the server is running SELinux and therefore relies on xattr. I recently upgraded the hard drives from 4TB to 12TB models. I noticed that after transferring some TBs I got a filesystem error "Freeing blocks not in datazone - block = 18446744071529317386, count = 1" and the backup process stopped. Trying to fix the fs with e2fsck resulted in a completely corrupted fs. The error probably came from ext2_free_blocks(), and because of the large number 18e19 this problem immediately looked like some kind of integer overflow. Whereas the 4TB fs was about 1e9 blocks, the new 12TB is about 3e9 blocks. So, searching the ext2 code, I came across the line in fs/ext2/xattr.c:745 where ext2_new_block() is called and the resulting block number is stored in the variable block as an int datatype. If a block with a block number greater than INT32_MAX is returned, this variable overflows and the call to sb_getblk() at line fs/ext2/xattr.c:750 fails, then the call to ext2_free_blocks() produces the error. Signed-off-by: Georg Ottinger <g.ottinger@gmx.at> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230815100340.22121-1-g.ottinger@gmx.at>	2023-08-16 16:09:45 +02:00
Miklos Szeredi	9dc10a54ab	fuse: add ATTR_TIMEOUT macro Next patch will introduce yet another type attribute reply. Add a macro that can handle attribute timeouts for all of the structs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-16 12:39:41 +02:00
Miklos Szeredi	8d8f9c4b8d	fuse: handle empty request_mask in statx If no attribute is requested, then don't send request to userspace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-16 12:39:24 +02:00
Hao Xu	b5a2a3a0b7	fuse: write back dirty pages before direct write in direct_io_relax mode In direct_io_relax mode, there can be shared mmaped files and thus dirty pages in its page cache. Therefore those dirty pages should be written back to backend before direct io to avoid data loss. Signed-off-by: Hao Xu <howeyxu@tencent.com> Reviewed-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-16 12:39:24 +02:00
Hao Xu	e78662e818	fuse: add a new fuse init flag to relax restrictions in no cache mode FOPEN_DIRECT_IO is usually set by fuse daemon to indicate need of strong coherency, e.g. network filesystems. Thus shared mmap is disabled since it leverages page cache and may write to it, which may cause inconsistence. But FOPEN_DIRECT_IO can be used not for coherency but to reduce memory footprint as well, e.g. reduce guest memory usage with virtiofs. Therefore, add a new fuse init flag FUSE_DIRECT_IO_RELAX to relax restrictions in that mode, currently, it allows shared mmap. One thing to note is to make sure it doesn't break coherency in your use case. Signed-off-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-16 12:39:16 +02:00
Hao Xu	80e4f25262	fuse: invalidate page cache pages before direct write In FOPEN_DIRECT_IO, page cache may still be there for a file since private mmap is allowed. Direct write should respect that and invalidate the corresponding pages so that page cache readers don't get stale data. Signed-off-by: Hao Xu <howeyxu@tencent.com> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-16 12:20:37 +02:00
ruanmeisi	b8bd342d50	fuse: nlookup missing decrement in fuse_direntplus_link During our debugging of glusterfs, we found an Assertion failed error: inode_lookup >= nlookup, which was caused by the nlookup value in the kernel being greater than that in the FUSE file system. The issue was introduced by fuse_direntplus_link, where in the function, fuse_iget increments nlookup, and if d_splice_alias returns failure, fuse_direntplus_link returns failure without decrementing nlookup https://github.com/gluster/glusterfs/pull/4081 Signed-off-by: ruanmeisi <ruan.meisi@zte.com.cn> Fixes: 0b05b18381ee ("fuse: implement NFS-like readdirplus support") Cc: <stable@vger.kernel.org> # v3.9 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-16 09:40:48 +02:00
Scott Mayhew	270d73e650	smb: client: fix null auth Commit abdb1742a312 removed code that clears ctx->username when sec=none, so attempting to mount with '-o sec=none' now fails with -EACCES. Fix it by adding that logic to the parsing of the 'sec' option, as well as checking if the mount is using null auth before setting the username when parsing the 'user' option. Fixes: abdb1742a312 ("cifs: get rid of mount options string parsing") Cc: stable@vger.kernel.org Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-16 00:26:07 -05:00
Joel Granados	53f3811dfd	sysctl: Use ctl_table_size as stopping criteria for list macro This is a preparation commit to make it easy to remove the sentinel elements (empty end markers) from the ctl_table arrays. It both allows the systematic removal of the sentinels and adds the ctl_table_size variable to the stopping criteria of the list_for_each_table_entry macro that traverses all ctl_table arrays. Once all the sentinels are removed by subsequent commits, ctl_table_size will become the only stopping criteria in the macro. We don't actually remove any elements in this commit, but it sets things up to for the removal process to take place. By adding header->ctl_table_size as an additional stopping criteria for the list_for_each_table_entry macro, it will execute until it finds an "empty" ->procname or until the size runs out. Therefore if a ctl_table array with a sentinel is passed its size will be too big (by one element) but it will stop on the sentinel. On the other hand, if the ctl_table array without a sentinel is passed its size will be just write and there will be no need for a sentinel. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Jani Nikula <jani.nikula@linux.intel.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:18 -07:00
Joel Granados	3bc269cfd3	sysctl: Add size arg to __register_sysctl_init This commit adds table_size to __register_sysctl_init in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do not remove any sentinels in this commit, we set things up by calculating the ctl_table array size with ARRAY_SIZE. We add a table_size argument to __register_sysctl_init and modify the register_sysctl_init macro to calculate the array size with ARRAY_SIZE. The original callers do not need to be updated as they will go through the new macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	9edbfe92a0	sysctl: Add size to register_sysctl This commit adds table_size to register_sysctl in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do not remove any sentinels in this commit, we set things up by either passing the table_size explicitly or using ARRAY_SIZE on the ctl_table arrays. We replace the register_syctl function with a macro that will add the ARRAY_SIZE to the new register_sysctl_sz function. In this way the callers that are already using an array of ctl_table structs do not change. For the callers that pass a ctl_table array pointer, we pass the table_size to register_sysctl_sz instead of the macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	bff97cf11b	sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we do make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size does however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	b1f01e2bae	sysctl: Add size argument to init_header In this commit, we add a table_size argument to the init_header function in order to initialize the ctl_table_size variable in ctl_table_header. Even though the size is not yet used, it is now initialized within the sysctl subsys. We need this commit for when we start adding the table_size arguments to the sysctl functions (e.g. register_sysctl, __register_sysctl_table and __register_sysctl_init). Note that in __register_sysctl_table we temporarily use a calculated size until we add the size argument to that function in subsequent commits. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	18d4b42e9d	sysctl: Use ctl_table_header in list_for_each_table_entry We replace the ctl_table with the ctl_table_header pointer in list_for_each_table_entry which is the macro responsible for traversing the ctl_table arrays. This is a preparation commit that will make it easier to add the ctl_table array size (that will be added to ctl_table_header in subsequent commits) to the already existing loop logic based on empty ctl_table elements (so called sentinels). Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	cc9f7ee01e	sysctl: Prefer ctl_table_header in proc_sysctl This is a preparation commit that replaces ctl_table with ctl_table_header as the pointer that is passed around in proc_sysctl.c. This will become necessary in subsequent commits when the size of the ctl_table array can no longer be calculated by searching for an empty sentinel (last empty ctl_table element) but will be carried along inside the ctl_table_header struct. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Linus Torvalds	2d7b8c6b90	three smb client fixes, all for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTaMT8ACgkQiiy9cAdy T1F7Ygv/ed2tYvcdwvrakFOvLWEvgTpAh/tb7dh+l58mG1WV7ZDRDWamSAyzy8OT s8IeBYmIaOdOv5opYakuQrY0lhTcUSRCoSsvH6k2vZAMLG0SX9nXVyHv0JgPDTuz gNvxDRrOusOhNfVlGya0dhH90hDyvW1wzU66HlCMbzrfmeQKKG6A6shOztGfw1y6 cXVKr4k315dcH9sAHzMDcg5bv3ucyKWztdAaF68dK71oEUwceMTmKpKc7OYPxThn DOY4blVefIUAPTZYh7RD1Ota1VYfQafZFu01ttqh3XvG9PtOlDTuEbRlANpYv2d/ Awn6ZIdx2tV8MERJ7R0p/vKdVj5m5sDaTls0q4PWc/OMFrOFGfvuMhoZ/uAFPhFc e9EKjg7u0B7q3F8aT4E34Hqwl6UNhyDvRqn5BhztcDgMdIke7OVuvHQSiLGXfMQT XJN0bTynTB6RnHDsFxG8i7YBlsHDk6Ic/xOAcSG42U/5hNBTrovY9HyhYdMzZ/TD 9sETDezn =Woor -----END PGP SIGNATURE----- Merge tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: "Three smb client fixes, all for stable: - fix for oops in unmount race with lease break of deferred close - debugging improvement for reconnect - fix for fscache deadlock (folio_wait_bit_common hang)" * tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: display network namespace in debug information cifs: Release folio lock on fscache read hit. cifs: fix potential oops in cifs_oplock_break	2023-08-15 20:00:40 +00:00

... 15 16 17 18 19 ...

84057 Commits