Commit Graph

79939 Commits

Author SHA1 Message Date
Christoph Hellwig
7d28631786 mpage: stop using bdev_{read,write}_page
Patch series "remove ->rw_page".

This series removes the ->rw_page block_device_operation, which is an old
and clumsy attempt at a simple read/write fast path for the block layer. 
It isn't actually used by the fastest block layer operations that we
support (polled I/O through io_uring), but only used by the mpage buffered
I/O helpers which are some of the slowest I/O we have and do not make any
difference there at all, and zram which is a block device abused to
duplicate the zram functionality.

Given that zram is heavily used we need to make sure there is a good
replacement for synchronous I/O, so this series adds a new flag for
drivers that complete I/O synchronously and uses that flag to use on-stack
bios and synchronous submission for them in the swap code.


This patch (of 7):

These are micro-optimizations for synchronous I/O, which do not matter
compared to all the other inefficiencies in the legacy buffer_head based
mpage code.

Link: https://lkml.kernel.org/r/20230125133436.447864-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230125133436.447864-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:32 -08:00
Matthew Wilcox (Oracle)
75376c6fb9 mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()
Only one caller doesn't have a folio, so move the page_folio() call to
that one caller from mem_cgroup_css_from_folio().

Link: https://lkml.kernel.org/r/20230116192507.2146150-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:19 -08:00
Matthew Wilcox (Oracle)
9cfb816b1c mm/fs: convert inode_attach_wb() to take a folio
Patch series "Writeback folio conversions".

Remove more calls to compound_head() by passing folios around instead of
pages.


This patch (of 2):

The only caller of inode_attach_wb() which doesn't pass NULL already has a
folio, so convert the whole call-chain to take folios.

Link: https://lkml.kernel.org/r/20230116192507.2146150-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230116192507.2146150-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:19 -08:00
Vishal Moola (Oracle)
243c5ea4f7 nilfs2: convert nilfs_clear_dirty_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().  This change removes 2 calls to
compound_head().

Link: https://lkml.kernel.org/r/20230104211448.4804-23-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:18 -08:00
Vishal Moola (Oracle)
d4a16d3133 nilfs2: convert nilfs_copy_dirty_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().  This change removes 8 calls to
compound_head().

Link: https://lkml.kernel.org/r/20230104211448.4804-22-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:17 -08:00
Vishal Moola (Oracle)
41f3f3b537 nilfs2: convert nilfs_btree_lookup_dirty_buffers() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().  This change removes 1 call to
compound_head().

Link: https://lkml.kernel.org/r/20230104211448.4804-21-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:17 -08:00
Vishal Moola (Oracle)
a245865831 nilfs2: convert nilfs_lookup_dirty_node_buffers() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().  This change removes 1 call to
compound_head().

Link: https://lkml.kernel.org/r/20230104211448.4804-20-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:17 -08:00
Vishal Moola (Oracle)
5ee4b25cb7 nilfs2: convert nilfs_lookup_dirty_data_buffers() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for
the removal of find_get_pages_range_tag(). This change removes 4 calls
to compound_head().

Link: https://lkml.kernel.org/r/20230104211448.4804-19-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:17 -08:00
Vishal Moola (Oracle)
87ed37e66d gfs2: convert gfs2_write_cache_jdata() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pgaes_range_tag().  This change removes 8 calls to
compound_head().

Also had to modify and rename gfs2_write_jdata_pagevec() to take in and
utilize folio_batch rather than pagevec and use folios rather than pages. 
gfs2_write_jdata_batch() now supports large folios.

Link: https://lkml.kernel.org/r/20230104211448.4804-18-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:17 -08:00
Vishal Moola (Oracle)
580e7a4926 f2fs: convert f2fs_sync_meta_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().  This change removes 5 calls to
compound_head().

Initially the function was checking if the previous page index is truly
the previous page i.e.  1 index behind the current page.  To convert to
folios and maintain this check we need to make the check folio->index !=
prev + folio_nr_pages(previous folio) since we don't know how many pages
are in a folio.

At index i == 0 the check is guaranteed to succeed, so to workaround
indexing bounds we can simply ignore the check for that specific index. 
This makes the initial assignment of prev trivial, so I removed that as
well.

Also modify a comment in commit_checkpoint for consistency.

Link: https://lkml.kernel.org/r/20230104211448.4804-17-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:16 -08:00
Vishal Moola (Oracle)
4f4a4f0feb f2fs: convert last_fsync_dnode() to use filemap_get_folios_tag()
Convert to use a folio_batch instead of pagevec.  This is in preparation
for the removal of find_get_pages_range_tag().

Link: https://lkml.kernel.org/r/20230104211448.4804-16-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:16 -08:00
Vishal Moola (Oracle)
1cd98ee747 f2fs: convert f2fs_write_cache_pages() to use filemap_get_folios_tag()
Convert the function to use a folio_batch instead of pagevec.  This is in
preparation for the removal of find_get_pages_range_tag().

Also modified f2fs_all_cluster_page_ready to take in a folio_batch instead
of pagevec.  This does NOT support large folios.  The function currently
only utilizes folios of size 1 so this shouldn't cause any issues right
now.

This version of the patch limits the number of pages fetched to
F2FS_ONSTACK_PAGES.  If that ever happens, update the start index here
since filemap_get_folios_tag() updates the index to be after the last
found folio, not necessarily the last used page.

Link: https://lkml.kernel.org/r/20230104211448.4804-15-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:16 -08:00
Vishal Moola (Oracle)
7525486aff f2fs: convert f2fs_sync_node_pages() to use filemap_get_folios_tag()
Convert function to use a folio_batch instead of pagevec.  This is in
preparation for the removal of find_get_pages_range_tag().

Link: https://lkml.kernel.org/r/20230104211448.4804-14-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:16 -08:00
Vishal Moola (Oracle)
a40a4ad118 f2fs: convert f2fs_flush_inline_data() to use filemap_get_folios_tag()
Convert function to use a folio_batch instead of pagevec. This is in
preparation for the removal of find_get_pages_tag().

Link: https://lkml.kernel.org/r/20230104211448.4804-13-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:15 -08:00
Vishal Moola (Oracle)
e6e46e1eb7 f2fs: convert f2fs_fsync_node_pages() to use filemap_get_folios_tag()
Convert function to use a folio_batch instead of pagevec.  This is in
preparation for the removal of find_get_pages_range_tag().

Link: https://lkml.kernel.org/r/20230104211448.4804-12-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:15 -08:00
Vishal Moola (Oracle)
50ead25374 ext4: convert mpage_prepare_extent_to_map() to use filemap_get_folios_tag()
Convert the function to use folios throughout.  This is in preparation for
the removal of find_get_pages_range_tag().  Now supports large folios. 
This change removes 11 calls to compound_head().

Link: https://lkml.kernel.org/r/20230104211448.4804-11-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:15 -08:00
Vishal Moola (Oracle)
4cda80f3a7 cifs: convert wdata_alloc_and_fillpages() to use filemap_get_folios_tag()
This is in preparation for the removal of find_get_pages_range_tag().  Now
also supports the use of large folios.

Since tofind might be larger than the max number of folios in a
folio_batch (15), we loop through filling in wdata->pages pulling more
batches until we either reach tofind pages or run out of folios.

This function may not return all pages in the last found folio before
tofind pages are reached.

Link: https://lkml.kernel.org/r/20230104211448.4804-10-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: Tom Talpey <tom@talpey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:15 -08:00
Vishal Moola (Oracle)
590a2b5f0a ceph: convert ceph_writepages_start() to use filemap_get_folios_tag()
Convert function to use a folio_batch instead of pagevec.  This is in
preparation for the removal of find_get_pages_range_tag().

Also some minor renaming for consistency.

Link: https://lkml.kernel.org/r/20230104211448.4804-9-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:15 -08:00
Vishal Moola (Oracle)
9f50fd2e92 btrfs: convert extent_write_cache_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().  Now also supports large folios.

Link: https://lkml.kernel.org/r/20230104211448.4804-8-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:14 -08:00
Vishal Moola (Oracle)
51c5cd3baf btrfs: convert btree_write_cache_pages() to use filemap_get_folio_tag()
Convert function to use folios throughout.  This is in preparation for the
removal of find_get_pages_range_tag().

Link: https://lkml.kernel.org/r/20230104211448.4804-7-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:14 -08:00
Vishal Moola (Oracle)
acc8d8588c afs: convert afs_writepages_region() to use filemap_get_folios_tag()
Convert to use folios throughout.  This function is in preparation to
remove find_get_pages_range_tag().

Also modify this function to write the whole batch one at a time, rather
than calling for a new set every single write.

Link: https://lkml.kernel.org/r/20230104211448.4804-6-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:14 -08:00
Alistair Popple
7d4a8be0c4 mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export
mmu_notifier_range_update_to_read_only() was originally introduced in
commit c6d23413f8 ("mm/mmu_notifier:
mmu_notifier_range_update_to_read_only() helper") as an optimisation for
device drivers that know a range has only been mapped read-only.  However
there are no users of this feature so remove it.  As it is the only user
of the struct mmu_notifier_range.vma field remove that also.

Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:32:54 -08:00
Andrew Morton
5ab0fc155d Sync mm-stable with mm-hotfixes-stable to pick up dependent patches
Merge branch 'mm-hotfixes-stable' into mm-stable
2023-01-31 17:25:17 -08:00
Phillip Lougher
f65c4bbbd6 Squashfs: fix handling and sanity checking of xattr_ids count
A Sysbot [1] corrupted filesystem exposes two flaws in the handling and
sanity checking of the xattr_ids count in the filesystem.  Both of these
flaws cause computation overflow due to incorrect typing.

In the corrupted filesystem the xattr_ids value is 4294967071, which
stored in a signed variable becomes the negative number -225.

Flaw 1 (64-bit systems only):

The signed integer xattr_ids variable causes sign extension.

This causes variable overflow in the SQUASHFS_XATTR_*(A) macros.  The
variable is first multiplied by sizeof(struct squashfs_xattr_id) where the
type of the sizeof operator is "unsigned long".

On a 64-bit system this is 64-bits in size, and causes the negative number
to be sign extended and widened to 64-bits and then become unsigned.  This
produces the very large number 18446744073709548016 or 2^64 - 3600.  This
number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and
divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0
(stored in len).

Flaw 2 (32-bit systems only):

On a 32-bit system the integer variable is not widened by the unsigned
long type of the sizeof operator (32-bits), and the signedness of the
variable has no effect due it always being treated as unsigned.

The above corrupted xattr_ids value of 4294967071, when multiplied
overflows and produces the number 4294963696 or 2^32 - 3400.  This number
when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by
SQUASHFS_METADATA_SIZE overflows again and produces a length of 0.

The effect of the 0 length computation:

In conjunction with the corrupted xattr_ids field, the filesystem also has
a corrupted xattr_table_start value, where it matches the end of
filesystem value of 850.

This causes the following sanity check code to fail because the
incorrectly computed len of 0 matches the incorrect size of the table
reported by the superblock (0 bytes).

    len = SQUASHFS_XATTR_BLOCK_BYTES(*xattr_ids);
    indexes = SQUASHFS_XATTR_BLOCKS(*xattr_ids);

    /*
     * The computed size of the index table (len bytes) should exactly
     * match the table start and end points
    */
    start = table_start + sizeof(*id_table);
    end = msblk->bytes_used;

    if (len != (end - start))
            return ERR_PTR(-EINVAL);

Changing the xattr_ids variable to be "usigned int" fixes the flaw on a
64-bit system.  This relies on the fact the computation is widened by the
unsigned long type of the sizeof operator.

Casting the variable to u64 in the above macro fixes this flaw on a 32-bit
system.

It also means 64-bit systems do not implicitly rely on the type of the
sizeof operator to widen the computation.

[1] https://lore.kernel.org/lkml/000000000000cd44f005f1a0f17f@google.com/

Link: https://lkml.kernel.org/r/20230127061842.10965-1-phillip@squashfs.org.uk
Fixes: 506220d2ba ("squashfs: add more sanity checks in xattr id lookup")
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Fedor Pchelkin <pchelkin@ispras.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31 16:44:10 -08:00
Mike Kravetz
3489dbb696 mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs".

This issue of mapcount in hugetlb pages referenced by shared PMDs was
discussed in [1].  The following two patches address user visible behavior
caused by this issue.

[1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/


This patch (of 2):

A hugetlb page will have a mapcount of 1 if mapped by multiple processes
via a shared PMD.  This is because only the first process increases the
map count, and subsequent processes just add the shared PMD page to their
page table.

page_mapcount is being used to decide if a hugetlb page is shared or
private in /proc/PID/smaps.  Pages referenced via a shared PMD were
incorrectly being counted as private.

To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
count the hugetlb page as shared.  A new helper to check for a shared PMD
is added.

[akpm@linux-foundation.org: simplification, per David]
[akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()]
Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com
Fixes: 25ee01a2fc ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31 16:44:09 -08:00
Randy Dunlap
0d7866eace freevxfs: Kconfig: fix spelling
Fix a spello in freevxfs Kconfig.
(reported by codespell)

Link: https://lkml.kernel.org/r/20230124181638.15604-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31 16:44:08 -08:00
Fedor Pchelkin
72e544b1b2 squashfs: harden sanity check in squashfs_read_xattr_id_table
While mounting a corrupted filesystem, a signed integer '*xattr_ids' can
become less than zero.  This leads to the incorrect computation of 'len'
and 'indexes' values which can cause null-ptr-deref in copy_bio_to_actor()
or out-of-bounds accesses in the next sanity checks inside
squashfs_read_xattr_id_table().

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Link: https://lkml.kernel.org/r/20230117105226.329303-2-pchelkin@ispras.ru
Fixes: 506220d2ba ("squashfs: add more sanity checks in xattr id lookup")
Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31 16:44:08 -08:00
Vishal Moola (Oracle)
e8dfc854ee ext4: convert mext_page_double_lock() to mext_folio_double_lock()
Convert mext_page_double_lock() to use folios.  This change saves 146
bytes of kernel text.  It also removes 6 calls to compound_head() and 2
calls to folio_file_page().

Link: https://lkml.kernel.org/r/20221207181009.4016-1-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:59 -08:00
David Hildenbrand
fc4f4be9b5 mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping()
Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings".

Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first
requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings. 
CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings.

This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for
example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is
actually means.

Let's first get the weird case out of the way and not use VM_MAYSHARE in
MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead.


This patch (of 3):

We want to stop using VM_MAYSHARE in private mappings to pave the way for
clarifying the semantics of VM_MAYSHARE vs.  VM_SHARED and reduce the
confusion.  While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED,
!CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings
that are an effective overlay of a file mapping.

Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into
is_nommu_shared_mapping() first.

Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well
(unless there is a serious BUG).  So there is not need to test for
VM_SHARED manually.

No functional change intended.

Link: https://lkml.kernel.org/r/20230102160856.500584-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230102160856.500584-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:56 -08:00
Jan Kara
4b89a37d54 fs: don't allocate blocks beyond EOF from __mpage_writepage
When __mpage_writepage() is called for a page beyond EOF, it will go and
allocate all blocks underlying the page.  This is not only unnecessary but
this way blocks can get leaked (e.g.  if a page beyond EOF is marked dirty
but in the end write fails and i_size is not extended).

Link: https://lkml.kernel.org/r/20230103104430.27749-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:56 -08:00
Christoph Hellwig
17c30ee6f2 ocfs2: use filemap_fdatawrite_wbc instead of generic_writepages
filemap_fdatawrite_wbc is a fairly thing wrapper around do_writepages, and
the big difference there is support for cgroup writeback, which is not
supported by ocfs2, and the potential to use ->writepages instead of
->writepage, which ocfs2 does not currently implement but eventually
should.

Link: https://lkml.kernel.org/r/20221229161031.391878-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:51 -08:00
Christoph Hellwig
cff61bbc71 jbd2,ocfs2: move jbd2_journal_submit_inode_data_buffers to ocfs2
jbd2_journal_submit_inode_data_buffers is only used by ocfs2, so move it
there to prepare for removing generic_writepages.

Link: https://lkml.kernel.org/r/20221229161031.391878-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:51 -08:00
Christoph Hellwig
25a89826f2 ntfs3: remove ->writepage
->writepage is a very inefficient method to write back data, and only used
through write_cache_pages or a a fallback when no ->migrate_folio method
is present.

Set ->migrate_folio to the generic buffer_head based helper, and remove
the ->writepage implementation.

Link: https://lkml.kernel.org/r/20221229161031.391878-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:51 -08:00
Christoph Hellwig
d4428bad14 ntfs3: stop using generic_writepages
Open code the resident inode handling in ntfs_writepages by directly using
write_cache_pages to prepare removing the ->writepage handler in ntfs3.

Link: https://lkml.kernel.org/r/20221229161031.391878-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:51 -08:00
Christoph Hellwig
5b68de6703 fs: remove an outdated comment on mpage_writepages
Patch series "remove generic_writepages"

This series removes generic_writepages by open coding the current
functionality in the three remaining callers.  Besides removing some
code the main benefit is that one of the few remaining ->writepage
callers from outside the core page cache code go away.

This patch (of 6):


mpage_writepages doesn't do any of the page locking itself, so remove and
outdated comment on the locking pattern there.

Link: https://lkml.kernel.org/r/20221229161031.391878-1-hch@lst.de
Link: https://lkml.kernel.org/r/20221229161031.391878-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:50 -08:00
Matthew Wilcox (Oracle)
a5fd8390d2 mpage: use b_folio in do_mpage_readpage()
Remove this conversion of a folio back to a page.

Link: https://lkml.kernel.org/r/20221215214402.3522366-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:41 -08:00
Matthew Wilcox (Oracle)
ac55e78d9e reiserfs: replace obvious uses of b_page with b_folio
These places just use b_page to get to the buffer's address_space or call
page_folio() on b_page to get a folio.

Link: https://lkml.kernel.org/r/20221215214402.3522366-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:41 -08:00
Matthew Wilcox (Oracle)
6ad4cd7f36 nilfs2: replace obvious uses of b_page with b_folio
These places just use b_page to get to the buffer's address_space or the
index of the page the buffer is in.

Link: https://lkml.kernel.org/r/20221215214402.3522366-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:41 -08:00
Matthew Wilcox (Oracle)
0d22fe2f03 jbd2: replace obvious uses of b_page with b_folio
These places just use b_page to get to the buffer's address_space or have
already been converted to folio.

Link: https://lkml.kernel.org/r/20221215214402.3522366-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:41 -08:00
Matthew Wilcox (Oracle)
11551cf15e gfs2: replace obvious uses of b_page with b_folio
These places just use b_page to get to the buffer's address_space.

Link: https://lkml.kernel.org/r/20221215214402.3522366-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:40 -08:00
Matthew Wilcox (Oracle)
cf1d3417e6 buffer: use b_folio in mark_buffer_dirty()
Removes about four calls to compound_head().  Two of them are inline which
removes 132 bytes from the kernel text.

Link: https://lkml.kernel.org/r/20221215214402.3522366-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:40 -08:00
Matthew Wilcox (Oracle)
743ed81ec1 buffer: use b_folio in end_buffer_async_write()
Save 76 bytes from avoiding the call to compound_head() in SetPageError().
Also avoid the call to compound_head() in end_page_writeback().

Link: https://lkml.kernel.org/r/20221215214402.3522366-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:40 -08:00
Matthew Wilcox (Oracle)
2e2dba15d1 buffer: use b_folio in end_buffer_async_read()
Removes a call to compound_head() in SetPageError(), saving 76 bytes of
text.

Link: https://lkml.kernel.org/r/20221215214402.3522366-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:40 -08:00
Matthew Wilcox (Oracle)
03c5f33123 buffer: use b_folio in touch_buffer()
Removes a call to compound_head() in this path.

Link: https://lkml.kernel.org/r/20221215214402.3522366-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:40 -08:00
Matthew Wilcox (Oracle)
abc8a8a2c7 buffer: replace obvious uses of b_page with b_folio
These cases just check if it's NULL, or use b_page to get to the page's
address space.  They are assumptions that b_page never points to a tail
page.

Link: https://lkml.kernel.org/r/20221215214402.3522366-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:39 -08:00
Peter Xu
9c67a20704 mm/hugetlb: introduce hugetlb_walk()
huge_pte_offset() is the main walker function for hugetlb pgtables.  The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside.  Assert on the locks when walking
the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Document the last special case in the page_vma_mapped_walk() path where we
don't need any more lock to call hugetlb_walk().

Taking vma lock there is not needed because either: (1) potential callers
of hugetlb pvmw holds i_mmap_rwsem already (from one rmap_walk()), or (2)
the caller will not walk a hugetlb vma at all so the hugetlb code path not
reachable (e.g.  in ksm or uprobe paths).

It's slightly implicit for future page_vma_mapped_walk() callers on that
lock requirement.  But anyway, when one day this rule breaks, one will get
a straightforward warning in hugetlb_walk() with lockdep, then there'll be
a way out.

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20221216155229.2043750-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:39 -08:00
Peter Xu
b8da2e4660 mm/hugetlb: make userfaultfd_huge_must_wait() safe to pmd unshare
We can take the hugetlb walker lock, here taking vma lock directly.

Link: https://lkml.kernel.org/r/20221216155217.2043700-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:38 -08:00
Peter Xu
243b1f2d3b mm/hugetlb: let vma_offset_start() to return start
Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd
unshare", v4.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.  It was assumed to be safe but it's actually not.  One race
condition can easily trigger by: (1) firstly trigger pmd share on a memory
range, (2) do huge_pte_offset() on the range, then at the meantime, (3)
another thread unshare the pmd range, and the pgtable page is prone to lost
if the other shared process wants to free it completely (by either munmap
or exit mm).

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.  It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons especially
on huge_pte_offset() usage.

One more thing to mention is that besides the vma lock, i_mmap_rwsem can
also be used to protect the pgtable page (along with its pgtable lock) from
being freed from under us.  IOW, huge_pte_offset() callers need to either
hold the vma lock or i_mmap_rwsem to safely walk the pgtables.

A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is
very hard to trigger, one needs to apply another kernel delay patch too,
see below):

======8<=======
  #define _GNU_SOURCE
  #include <stdio.h>
  #include <stdlib.h>
  #include <errno.h>
  #include <unistd.h>
  #include <sys/mman.h>
  #include <fcntl.h>
  #include <linux/memfd.h>
  #include <assert.h>
  #include <pthread.h>

  #define  MSIZE  (1UL << 30)     /* 1GB */
  #define  PSIZE  (2UL << 20)     /* 2MB */

  #define  HOLD_SEC  (1)

  int pipefd[2];
  void *buf;

  void *do_map(int fd)
  {
      unsigned char *tmpbuf, *p;
      int ret;

      ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE);
      if (ret) {
          perror("posix_memalign() failed");
          return NULL;
      }

      tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_FIXED, fd, 0);
      if (tmpbuf == MAP_FAILED) {
          perror("mmap() failed");
          return NULL;
      }
      printf("mmap() -> %p\n", tmpbuf);

      for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) {
          *p = 1;
      }

      return tmpbuf;
  }

  void do_unmap(void *buf)
  {
      munmap(buf, MSIZE);
  }

  void proc2(int fd)
  {
      unsigned char c;

      buf = do_map(fd);
      if (!buf)
          return;

      read(pipefd[0], &c, 1);
      /*
       * This frees the shared pgtable page, causing use-after-free in
       * proc1_thread1 when soft walking hugetlb pgtable.
       */
      do_unmap(buf);

      printf("Proc2 quitting\n");
  }

  void *proc1_thread1(void *data)
  {
      /*
       * Trigger follow-page on 1st 2m page.  Kernel hack patch needed to
       * withhold this procedure for easier reproduce.
       */
      madvise(buf, PSIZE, MADV_POPULATE_WRITE);
      printf("Proc1-thread1 quitting\n");
      return NULL;
  }

  void *proc1_thread2(void *data)
  {
      unsigned char c;

      /* Wait a while until proc1_thread1() start to wait */
      sleep(0.5);
      /* Trigger pmd unshare */
      madvise(buf, PSIZE, MADV_DONTNEED);
      /* Kick off proc2 to release the pgtable */
      write(pipefd[1], &c, 1);

      printf("Proc1-thread2 quitting\n");
      return NULL;
  }

  void proc1(int fd)
  {
      pthread_t tid1, tid2;
      int ret;

      buf = do_map(fd);
      if (!buf)
          return;

      ret = pthread_create(&tid1, NULL, proc1_thread1, NULL);
      assert(ret == 0);
      ret = pthread_create(&tid2, NULL, proc1_thread2, NULL);
      assert(ret == 0);

      /* Kick the child to share the PUD entry */
      pthread_join(tid1, NULL);
      pthread_join(tid2, NULL);

      do_unmap(buf);
  }

  int main(void)
  {
      int fd, ret;

      fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB);
      if (fd < 0) {
          perror("open failed");
          return -1;
      }

      ret = ftruncate(fd, MSIZE);
      if (ret) {
          perror("ftruncate() failed");
          return -1;
      }

      ret = pipe(pipefd);
      if (ret) {
          perror("pipe() failed");
          return -1;
      }

      if (fork()) {
          proc1(fd);
      } else {
          proc2(fd);
      }

      close(pipefd[0]);
      close(pipefd[1]);
      close(fd);

      return 0;
  }
======8<=======

The kernel patch needed to present such a race so it'll trigger 100%:

======8<=======
: diff --git a/mm/hugetlb.c b/mm/hugetlb.c
: index 9d97c9a2a15d..f8d99dad5004 100644
: --- a/mm/hugetlb.c
: +++ b/mm/hugetlb.c
: @@ -38,6 +38,7 @@
:  #include <asm/page.h>
:  #include <asm/pgalloc.h>
:  #include <asm/tlb.h>
: +#include <asm/delay.h>
: 
:  #include <linux/io.h>
:  #include <linux/hugetlb.h>
: @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
:                 bool unshare = false;
:                 int absent;
:                 struct page *page;
: +               unsigned long c = 0;
: 
:                 /*
:                  * If we have a pending SIGKILL, don't keep faulting pages and
: @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
:                  */
:                 pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
:                                       huge_page_size(h));
: +
: +               pr_info("%s: withhold 1 sec...\n", __func__);
: +               for (c = 0; c < 100; c++) {
: +                       udelay(10000);
: +               }
: +               pr_info("%s: withhold 1 sec...done\n", __func__);
: +
:                 if (pte)
:                         ptl = huge_pte_lock(h, mm, pte);
:                 absent = !pte || huge_pte_none(huge_ptep_get(pte));
: ======8<=======

It'll trigger use-after-free of the pgtable spinlock:

======8<=======
[   16.959907] follow_hugetlb_page: withhold 1 sec...
[   17.960315] follow_hugetlb_page: withhold 1 sec...done
[   17.960550] ------------[ cut here ]------------
[   17.960742] DEBUG_LOCKS_WARN_ON(1)
[   17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0
[   17.961264] Modules linked in:
[   17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46
[   17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0
[   17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9
[   17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096
[   17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000
[   17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[   17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58
[   17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000
[   17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[   17.965123] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[   17.965443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[   17.965956] PKRU: 55555554
[   17.966068] Call Trace:
[   17.966172]  <TASK>
[   17.966268]  ? tick_nohz_tick_stopped+0x12/0x30
[   17.966455]  lock_acquire+0xbf/0x2b0
[   17.966603]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.966799]  ? _printk+0x48/0x4e
[   17.966934]  _raw_spin_lock+0x2f/0x40
[   17.967087]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.967285]  follow_hugetlb_page.cold+0x75/0x5c4
[   17.967473]  __get_user_pages+0xbb/0x620
[   17.967635]  faultin_vma_page_range+0x9a/0x100
[   17.967817]  madvise_vma_behavior+0x3c0/0xbd0
[   17.967998]  ? mas_prev+0x11/0x290
[   17.968141]  ? find_vma_prev+0x5e/0xa0
[   17.968304]  ? madvise_vma_anon_name+0x70/0x70
[   17.968486]  madvise_walk_vmas+0xa9/0x120
[   17.968650]  do_madvise.part.0+0xfa/0x270
[   17.968813]  __x64_sys_madvise+0x5a/0x70
[   17.968974]  do_syscall_64+0x37/0x90
[   17.969123]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   17.969329] RIP: 0033:0x7f1840f0efdb
[   17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68
[   17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[   17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb
[   17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000
[   17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f
[   17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80
[   17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000
[   17.972083]  </TASK>
[   17.972199] irq event stamp: 2353
[   17.972372] hardirqs last  enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70
[   17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70
[   17.973365] softirqs last  enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[   17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160
[   17.974341] ---[ end trace 0000000000000000 ]---
[   17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8
[   17.975012] #PF: supervisor read access in kernel mode
[   17.975314] #PF: error_code(0x0000) - not-present page
[   17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0
[   17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G        W          6.1.0-rc4-peterx+ #46
[   17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0
[   17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f
[   17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046
[   17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000
[   17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff
[   17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58
[   17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000
[   17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001
[   17.984815] FS:  00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000
[   17.985924] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0
[   17.986674] PKRU: 55555554
[   17.986832] Call Trace:
[   17.987012]  <TASK>
[   17.987266]  ? tick_nohz_tick_stopped+0x12/0x30
[   17.987770]  lock_acquire+0xbf/0x2b0
[   17.988118]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.988575]  ? _printk+0x48/0x4e
[   17.988889]  _raw_spin_lock+0x2f/0x40
[   17.989243]  ? follow_hugetlb_page.cold+0x75/0x5c4
[   17.989687]  follow_hugetlb_page.cold+0x75/0x5c4
[   17.990119]  __get_user_pages+0xbb/0x620
[   17.990500]  faultin_vma_page_range+0x9a/0x100
[   17.990928]  madvise_vma_behavior+0x3c0/0xbd0
[   17.991354]  ? mas_prev+0x11/0x290
[   17.991678]  ? find_vma_prev+0x5e/0xa0
[   17.992024]  ? madvise_vma_anon_name+0x70/0x70
[   17.992421]  madvise_walk_vmas+0xa9/0x120
[   17.992793]  do_madvise.part.0+0xfa/0x270
[   17.993166]  __x64_sys_madvise+0x5a/0x70
[   17.993539]  do_syscall_64+0x37/0x90
[   17.993879]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
======8<=======

Resolution
==========

This patchset protects all the huge_pte_offset() callers to also take the
vma lock properly.

Patch Layout
============

Patch 1-2:         cleanup, or dependency of the follow up patches
Patch 3:           before fixing, document huge_pte_offset() on lock required
Patch 4-8:         each patch resolves one possible race condition
Patch 9:           introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

The series is verified with the above reproducer so the race cannot
trigger anymore.  It also passes all hugetlb kselftests.


This patch (of 9):

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for
all the callers because whenever the retval is used, it'll be ultimately
added into the vma->vm_start anyway, so it's better.

Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:38 -08:00
Andrew Morton
bd86d2ea36 Sync with v6.2-rc4
Merge branch 'master' into mm-hotfixes-stable
2023-01-18 16:52:20 -08:00
Linus Torvalds
b35ad63eec 7 smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmPB/t0ACgkQiiy9cAdy
 T1Gvbwv+LIXF5dHNGHDuezecbD8T9sRF2v15Mh7i6SSg7BWeXXebYY4dyrSQ5SHu
 KqsN2Y3P3A0ZQ3ApzFryImM6BwSpOeyCsVRhl7VCWnMgXcroqc/O6F6/YRFVkUAi
 iWWZLXM7WFBQGXUbPXaiPc+wCRARrnul9p+48Teyy0CJWiWormQmkznVxeihErDX
 /pWdQdvJeFcUrIj1H3e4cyJF2hVzRiUGI/eZmBGlDyaK192vYgGYO2AhHnTfd7fU
 dUJ+/trVw0koyC5/86veHRqCcXzFD44ORkAB46NaCic1K8t+RPhmxgtriiNLcCrQ
 kEmeub6ayPkuniV88NBEPXaDy0S/cEYHr7GEuZTn4sq+hw//y5KbKN3sa6aLHsMH
 46BIHcyTXU59eNJ4lWOjoqD2NiqP2GYFY4PZftB85H1EW8Fchwcsw4WzW0ENpzmi
 qcWslXDKYjJIZ6NHnBiR/FYI1VdsmoGbDzMhrHreCrWCuIuVaqfrFPMnvE8f5dqY
 fkfnkqT4
 =SdZA
 -----END PGP SIGNATURE-----

Merge tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:

 - memory leak and double free fix

 - two symlink fixes

 - minor cleanup fix

 - two smb1 fixes

* tag '6.2-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix uninitialized memory read for smb311 posix symlink create
  cifs: fix potential memory leaks in session setup
  cifs: do not query ifaces on smb1 mounts
  cifs: fix double free on failed kerberos auth
  cifs: remove redundant assignment to the variable match
  cifs: fix file info setting in cifs_open_file()
  cifs: fix file info setting in cifs_query_path_info()
2023-01-14 08:08:25 -06:00