IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Using tried_regions/total_bytes file, users can efficiently retrieve the
total size of memory regions having specific access pattern. However,
DAMON sysfs interface in kernel still populates all the infomration on the
tried_regions subdirectories. That means the kernel part overhead for the
construction of tried regions directories still exists. To remove the
overhead, implement yet another command input for 'state' DAMON sysfs
file. Writing the input to the file makes DAMON sysfs interface to update
only the total_bytes file.
Link: https://lkml.kernel.org/r/20230802213222.109841-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon/sysfs-schemes: implement DAMOS tried total bytes
file".
The tried_regions directory of DAMON sysfs interface is useful for
retrieving monitoring results snapshot or DAMOS debugging. However, for
common use case that need to monitor only the total size of the scheme
tried regions (e.g., monitoring working set size), the kernel overhead for
directory construction and user overhead for reading the content could be
high if the number of monitoring region is not small. This patchset
implements DAMON sysfs files for efficient support of the use case.
The first patch implements the sysfs file to reduce the user space
overhead, and the second patch implements a command for reducing the
kernel space overhead.
The third patch adds a selftest for the new file, and following two
patches update documents.
[1] https://lore.kernel.org/damon/20230728201817.70602-1-sj@kernel.org/
This patch (of 5):
The tried_regions directory can be used for retrieving the monitoring
results snapshot for regions of specific access pattern, by setting the
scheme's action as 'stat' and the access pattern as required. While the
interface provides every detail of the monitoring results, some use cases
including working set size monitoring requires only the total size of the
regions. For such cases, users should read all the information and
calculate the total size of the regions. However, it could incur high
overhead if the number of regions is high. Add a file for retrieving only
the information, namely 'total_bytes' file. It allows users to get the
total size by reading only the file.
Link: https://lkml.kernel.org/r/20230802213222.109841-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20230802213222.109841-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
walk->can_swap might be invalid since it's not guaranteed to be
initialized for the particular lruvec. Instead deduce it from the folio
type (anon/file).
Link: https://lkml.kernel.org/r/20230802025606.346758-3-kaleshsingh@google.com
Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This
is because the generations are reused (the last oldest now empty
generation will become the next youngest generation).
inc_min_seq() is retried until successful, dropping the lru_lock
and yielding the CPU on each failure, and retaking the lock before
trying again:
while (!inc_min_seq(lruvec, type, can_swap)) {
spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
spin_lock_irq(&lruvec->lru_lock);
}
However, the initial condition that required incrementing the min_seq
(nr_gens == MAX_NR_GENS) is not retested. This can change by another
call to inc_max_seq() from run_aging() with force_scan=true from the
debugfs interface.
Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid
unnecessarily incrementing the min_seq by rechecking the number of
generations before each attempt.
This issue was uncovered in previous discussion on the list by Yu Zhao
and Aneesh Kumar [1].
[1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com
Fixes: d6c3af7d8a2b ("mm: multi-gen LRU: debugfs interface")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MGLRU has a LRU list for each zone for each type (anon/file) in each
generation:
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
The min_seq (oldest generation) can progress independently for each
type but the max_seq (youngest generation) is shared for both anon and
file. This is to maintain a common frame of reference.
In order for eviction to advance the min_seq of a type, all the per-zone
lists in the oldest generation of that type must be empty.
The eviction logic only considers pages from eligible zones for
eviction or promotion.
scan_folios() {
...
for (zone = sc->reclaim_idx; zone >= 0; zone--) {
...
sort_folio(); // Promote
...
isolate_folio(); // Evict
}
...
}
Consider the system has the movable zone configured and default 4
generations. The current state of the system is as shown below
(only illustrating one type for simplicity):
Type: ANON
Zone DMA32 Normal Movable Device
Gen 0 0 0 4GB 0
Gen 1 0 1GB 1MB 0
Gen 2 1MB 4GB 1MB 0
Gen 3 1MB 1MB 1MB 0
Now consider there is a GFP_KERNEL allocation request (eligible zone
index <= Normal), evict_folios() will return without doing any work
since there are no pages to scan in the eligible zones of the oldest
generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE
allocation request; which may not happen soon if there is a lot of free
memory in the movable zone. This can lead to OOM kills, although there
is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to
reclaim.
This issue is not seen in the conventional active/inactive LRU since
there are no per-zone lists.
If there are no (not enough) folios to scan in the eligible zones, move
folios from ineligible zone (zone_index > reclaim_index) to the next
generation. This allows for the progression of min_seq and reclaiming
from the next generation (Gen 1).
Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently.
[1] https://github.com/raspberrypi/linux/issues/5395
Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and swap
floods"), proactive reclaim will extreme overreclaim sometimes. But
proactive reclaim still inaccurate and some extent overreclaim.
Problematic case is easy to construct. Allocate lots of anonymous memory
(e.g., 20G) in a memcg, then swapping by writing memory.recalim and there
is a certain probability of overreclaim. For example, request 1G by
writing memory.reclaim will eventually reclaim 1.7G or other values more
than 1G.
The reason is that reclaimer may have already reclaimed part of requested
memory in one loop, but before adjust sc->nr_to_reclaim in outer loop,
call shrink_lruvec() again will still follow the current sc->nr_to_reclaim
to work. It will eventually lead to overreclaim. In theory, the amount
of reclaimed would be in [request, 2 * request).
Reclaimer usually tends to reclaim more than request. But either direct
or kswapd reclaim have much smaller nr_to_reclaim targets, so it is less
noticeable and not have much impact.
Proactive reclaim can usually come in with a larger value, so the error is
difficult to ignore. Considering proactive reclaim is usually low
frequency, handle the batching into smaller chunks is a better approach.
Link: https://lkml.kernel.org/r/20230721014116.3388-1-yangyifei03@kuaishou.com
Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damos_new_filter() was having a bug that not initializing ->list field of
the returning damos_filter struct, which results in access to
uninitialized memory. Add a unit test for the function.
Link: https://lkml.kernel.org/r/20230729203733.38949-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When free_pages is 0, alike_pages is not used. So alike_pages calculation
can be avoided by checking free_pages early to save cpu cycles. Also fix
typo 'comparable'. It should be 'compatible' here.
Link: https://lkml.kernel.org/r/20230801123723.2225543-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As ptep_get, Use the pmdp_get wrapper when we accessing pmdval instead of
directly dereferencing pmd.
Link: https://lkml.kernel.org/r/20230727212157.2985025-1-ppbuk5246@gmail.com
Signed-off-by: Levi Yun <ppbuk5246@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A recent patch shows that not everybody understands that "stabilise the
mapping" really means "prevent the mapping from being freed", so change
the wording to hopefully make that more clear.
Link: https://lkml.kernel.org/r/ZMLWEB4m3zvX6SBN@casper.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use helper macros PAGE_ALIGN and PAGE_ALIGN_DOWN to improve code
readability. No functional modification involved.
Link: https://lkml.kernel.org/r/20230727011612.2721843-4-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marco Elver <elver@google.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "minor cleanups for kmsan".
Use helper function and macros to improve code readability. No functional
modification involved.
This patch (of 3):
Use function page_size() to improve code readability. No functional
modification involved.
Link: https://lkml.kernel.org/r/20230727011612.2721843-1-zhangpeng362@huawei.com
Link: https://lkml.kernel.org/r/20230727011612.2721843-2-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marco Elver <elver@google.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add description of @mas and @tree_end, remove @mt in unmap_vmas(). to
silence the warnings:
mm/memory.c:1837: warning: Function parameter or member 'mas' not described in 'unmap_vmas'
mm/memory.c:1837: warning: Function parameter or member 'tree_end' not described in 'unmap_vmas'
mm/memory.c:1837: warning: Excess function parameter 'mt' description in 'unmap_vmas'
Link: https://lkml.kernel.org/r/20230727015558.69554-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5996
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The __read_swap_cache_async() interface isn't more difficult to understand
than what the helper abstracts. Save the indirection and a level of
indentation for the primary work of the writeback func.
Link: https://lkml.kernel.org/r/20230727162343.1415598-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Removing a zswap entry from the tree is tied to an explicit operation
that's supposed to drop the base reference: swap invalidation, exclusive
load, duplicate store. Don't silently remove the entry on final put, but
instead warn if an entry is in tree without reference.
While in that diff context, convert a BUG_ON to a WARN_ON_ONCE. No need
to crash on a refcount underflow.
Link: https://lkml.kernel.org/r/20230727162343.1415598-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: zswap: three cleanups".
Three small cleanups to zswap, the first one suggested by Yosry during the
frontswap removal.
This patch (of 3):
Minor cleanup. Instead of open-coding the tree deletion and the put, use
the zswap_invalidate_entry() convenience helper.
Link: https://lkml.kernel.org/r/20230727162343.1415598-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20230727162343.1415598-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use page_ext_data helper in page_owner to avoid access offset directly.
Link: https://lkml.kernel.org/r/20230718145812.1991717-4-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Andrew Morton <akpm@linux-foudation.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use page_ext_data helper in page_table_check to avoid access offset
directly.
Link: https://lkml.kernel.org/r/20230718145812.1991717-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Andrew Morton <akpm@linux-foudation.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Only convert a few easy parts of this function to use the folio passed in;
convert back to struct page for the majority of it. Removes three hidden
calls to compound_head().
Link: https://lkml.kernel.org/r/20230715042343.434588-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace six implicit calls to compound_head() with one call to
page_folio().
Link: https://lkml.kernel.org/r/20230715042343.434588-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As the one caller now has a folio, pass it in and use it. Removes three
calls to compound_head().
Link: https://lkml.kernel.org/r/20230715042343.434588-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Followup folio conversions for zswap".
With frontswap killed, it's worth converting the zswap_load() and
zswap_store() functions to take a folio instead of a page pointer. They
aren't converted to support large folios, but there are a lot of
unnecessary calls to compound_head() that are removed by these patches.
This patch (of 4):
Only convert a few easy parts of this function to use the folio passed in;
convert back to struct page for the majority of it. This does remove a
few hidden calls to compound_head().
Link: https://lkml.kernel.org/r/20230715042343.434588-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230715042343.434588-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The only user of frontswap is zswap, and has been for a long time. Have
swap call into zswap directly and remove the indirection.
[hannes@cmpxchg.org: remove obsolete comment, per Yosry]
Link: https://lkml.kernel.org/r/20230719142832.GA932528@cmpxchg.org
[fengwei.yin@intel.com: don't warn if none swapcache folio is passed to zswap_load]
Link: https://lkml.kernel.org/r/20230810095652.3905184-1-fengwei.yin@intel.com
Link: https://lkml.kernel.org/r/20230717160227.GA867137@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Support using multiple zpools of the same type in zswap, for concurrency
purposes. A fixed number of 32 zpools is suggested by this commit, which
was determined empirically. It can be later changed or made into a config
option if needed.
On a setup with zswap and zsmalloc, comparing a single zpool to 32 zpools
shows improvements in the zsmalloc lock contention, especially on the swap
out path.
The following shows the perf analysis of the swapout path when 10
workloads are simultaneously reclaiming and refaulting tmpfs pages. There
are some improvements on the swap in path as well, but less significant.
1 zpool:
|--28.99%--zswap_frontswap_store
|
<snip>
|
|--8.98%--zpool_map_handle
| |
| --8.98%--zs_zpool_map
| |
| --8.95%--zs_map_object
| |
| --8.38%--_raw_spin_lock
| |
| --7.39%--queued_spin_lock_slowpath
|
|--8.82%--zpool_malloc
| |
| --8.82%--zs_zpool_malloc
| |
| --8.80%--zs_malloc
| |
| |--7.21%--_raw_spin_lock
| | |
| | --6.81%--queued_spin_lock_slowpath
<snip>
32 zpools:
|--16.73%--zswap_frontswap_store
|
<snip>
|
|--1.81%--zpool_malloc
| |
| --1.81%--zs_zpool_malloc
| |
| --1.79%--zs_malloc
| |
| --0.73%--obj_malloc
|
|--1.06%--zswap_update_total_size
|
|--0.59%--zpool_map_handle
| |
| --0.59%--zs_zpool_map
| |
| --0.57%--zs_map_object
| |
| --0.51%--_raw_spin_lock
<snip>
Link: https://lkml.kernel.org/r/20230620194644.3142384-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Yu Zhao <yuzhao@google.com>
Acked-by: Chris Li (Google) <chrisl@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Nhat Pham <nphamcs@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a memcg is in the process of being released mem_cgroup_tryget will
fail because its reference count has already reached 0. This can happen
during reclaim if the memcg has already been offlined, and we reclaim all
remaining pages attributed to the offlined memcg. shrink_many attempts to
skip the empty memcg in this case, and continue reclaiming from the
remaining memcgs in the old generation. If there is only one memcg
remaining, or if all remaining memcgs are in the process of being released
then shrink_many will spin until all memcgs have finished being released.
The release occurs through a workqueue, so it can take a while before
kswapd is able to make any further progress.
This fix results in reductions in kswapd activity and direct reclaim in
a test where 28 apps (working set size > total memory) are repeatedly
launched in a random sequence:
A B delta ratio(%)
allocstall_movable 5962 3539 -2423 -40.64
allocstall_normal 2661 2417 -244 -9.17
kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71
pageoutrun 57365 11750 -45615 -79.52
Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When page_handle_poison() fails to handle the hugepage or free page in
retry path, soft_offline_page() will return 0 while -EBUSY is expected in
this case.
Consequently the user will think soft_offline_page succeeds while it in
fact failed. So the user will not try again later in this case.
Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
flush_cache_vmap() must be called after new vmalloc mappings are installed
in the page table in order to allow architectures to make sure the new
mapping is visible.
It could lead to a panic since on some architectures (like powerpc),
the page table walker could see the wrong pte value and trigger a
spurious page fault that can not be resolved (see commit f1cb8f9beba8
("powerpc/64s/radix: avoid ptesync after set_pte and
ptep_set_access_flags")).
But actually the patch is aiming at riscv: the riscv specification
allows the caching of invalid entries in the TLB, and since we recently
removed the vmalloc page fault handling, we now need to emit a tlb
shootdown whenever a new vmalloc mapping is emitted
(https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/).
That's a temporary solution, there are ways to avoid that :)
Link: https://lkml.kernel.org/r/20230809164633.1556126-1-alexghiti@rivosinc.com
Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function")
Reported-by: Dylan Jhong <dylan@andestech.com>
Closes: https://lore.kernel.org/linux-riscv/ZMytNY2J8iyjbPPy@atctrx.andestech.com/
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Dylan Jhong <dylan@andestech.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In contrast to most other GUP code, GUP-fast common page table walking
code like gup_pte_range() also handles hugetlb pages. But in contrast to
other hugetlb page table walking code, it does not look at the hugetlb PTE
abstraction whereby we have only a single logical hugetlb PTE per hugetlb
page, even when using multiple cont-PTEs underneath -- which is for
example what huge_ptep_get() abstracts.
So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast
might stumble over a PTE that does not map the head page of a hugetlb page
-- not the first "head" PTE of such a cont mapping.
Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we
might end up calling gup_must_unshare() with a tail page of a hugetlb
page.
We only maintain a single PageAnonExclusive flag per hugetlb page (as
hugetlb pages cannot get partially COW-shared), stored for the head page.
That flag is clear for all tail pages.
So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail
page of a hugetlb page:
1) With CONFIG_DEBUG_VM_PGFLAGS
Stumbles over the:
VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);
For example, when executing the COW selftests with 64k hugetlb pages on
arm64:
[ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11
[ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2
[ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff)
[ 61.084101] page_type: 0xffffffff()
[ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000
[ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000
[ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71
[ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000
[ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page))
[ 61.086914] ------------[ cut here ]------------
[ 61.087220] kernel BUG at include/linux/page-flags.h:990!
[ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 61.087999] Modules linked in: ...
[ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3
[ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98
[ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98
[ 61.091592] sp : ffff8000825eb940
[ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440
[ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000
[ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58
[ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff
[ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
[ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
[ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0
[ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8
[ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000
[ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046
[ 61.096894] Call trace:
[ 61.097080] gup_must_unshare.part.0+0x64/0x98
[ 61.097392] gup_pte_range+0x3a8/0x3f0
[ 61.097662] gup_pgd_range+0x1ec/0x280
[ 61.097942] lockless_pages_from_mm+0x64/0x1a0
[ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0
[ 61.098612] pin_user_pages_fast+0x58/0x78
[ 61.098917] pin_longterm_test_start+0xf4/0x2b8
[ 61.099243] gup_test_ioctl+0x170/0x3b0
[ 61.099528] __arm64_sys_ioctl+0xa8/0xf0
[ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0
[ 61.100160] el0_svc_common.constprop.0+0xe8/0x100
[ 61.100500] do_el0_svc+0x38/0xa0
[ 61.100736] el0_svc+0x3c/0x198
[ 61.100971] el0t_64_sync_handler+0x134/0x150
[ 61.101280] el0t_64_sync+0x17c/0x180
[ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000)
2) Without CONFIG_DEBUG_VM_PGFLAGS
Always detects "not exclusive" for passed tail pages and refuses to PIN
the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback
to ordinary GUP. As ordinary GUP properly considers the logical hugetlb
PTE abstraction in hugetlb_follow_page_mask(), pinning the page will
succeed when looking at the PageAnonExclusive on the head page only.
So the only real effect of this is that with cont-PTE hugetlb pages, we'll
always fallback from GUP-fast to ordinary GUP when not working on the head
page, which ends up checking the head page and do the right thing.
Consequently, the cow selftests pass with cont-PTE hugetlb pages as well
without CONFIG_DEBUG_VM_PGFLAGS.
Note that this only applies to anon hugetlb pages that are mapped using
cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel.
... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into
the page table R/O using GUP-fast.
On production kernels (and even most debug kernels, that don't set
CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required
to be backported. But of course, it does not hurt.
Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com
Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.
The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.
A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.
Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We shouldn't be using a GUP-internal helper if it can be avoided.
Similar to smaps_pte_entry() that uses vm_normal_page(), let's use
vm_normal_page_pmd() that similarly refuses to return the huge zeropage.
In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd():
(1) Will always return the head page, not a tail page of a THP.
If we'd ever call smaps_account with a tail page while setting "compound
= true", we could be in trouble, because smaps_account() would look at
the memmap of unrelated pages.
If we're unlucky, that memmap does not exist at all. Before we removed
PG_doublemap, we could have triggered something similar as in
commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for
migration entry").
This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc:
smaps_rollup: do not stall write attempts on mmap_lock"):
(a) We're in show_smaps_rollup() and processed a VMA
(b) We release the mmap lock in show_smaps_rollup() because it is
contended
(c) We merged that VMA with another VMA
(d) We collapsed a THP in that merged VMA at that position
If the end address of the original VMA falls into the middle of a THP
area, we would call smap_gather_stats() with a start address that falls
into a PMD-mapped THP. It's probably very rare to trigger when not
really forced.
(2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page()
Treat such PMDs here just like smaps_pte_entry() would treat such PTEs.
If such pages would be anonymous, we most certainly would want to
account them.
(3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap()
As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE
pages. So just like smaps_pte_entry(), we'll now also ignore such PMD
entries.
Especially, follow_pmd_mask() never ends up calling
follow_trans_huge_pmd() on pmd_devmap(). Instead it calls
follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN
is set.
So skipping pmd_devmap() pages seems to be the right thing to do.
(4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page()
We won't be returning a memmap that should be ignored by core-mm, or
worse, a memmap that does not even exist. Note that while
walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't.
Most probably this case doesn't currently really happen on the PMD level,
otherwise we'd already be able to trigger kernel crashes when reading
smaps / smaps_rollup.
So most probably only (1) is relevant in practice as of now, but could only
cause trouble in extreme corner cases.
Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future
reuse in wrong context.
Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com
Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: liubo <liubo254@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()") missed that follow_page() and
follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really
don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting
or due to inaccessible (PROT_NONE) VMAs.
As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast"): "Other follow_page callers like KSM
should not use FOLL_NUMA, or they would fail to get the pages if they use
follow_page instead of get_user_pages."
liubo reported [1] that smaps_rollup results are imprecise, because they
miss accounting of pages that are mapped PROT_NONE. Further, it's easy to
reproduce that KSM no longer works on inaccessible VMAs on x86-64, because
pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs,
and follow_page() refuses to return such pages right now.
As KVM really depends on these NUMA hinting faults, removing the
pte_protnone()/pmd_protnone() handling in GUP code completely is not
really an option.
To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
to restore the original behavior for now and add better comments.
Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in
is_valid_gup_args(), to add that flag for all external GUP users.
Note that there are three GUP-internal __get_user_pages() users that don't
end up calling is_valid_gup_args() and consequently won't get
FOLL_HONOR_NUMA_FAULT set.
1) get_dump_page(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE and wouldn't have honored NUMA
hinting faults already.
2) populate_vma_page_range(): we really don't want to handle NUMA hinting
faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have
honored NUMA hinting faults already.
3) faultin_vma_page_range(): we similarly don't want to handle NUMA
hinting faults.
To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in
inaccessible VMAs properly, we have to perform VMA accessibility checks in
gup_can_follow_protnone().
As GUP-fast should reject such pages either way in
pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and
arm64 that both implement pte_protnone() -- let's just always fallback to
ordinary GUP when stumbling over pte_protnone()/pmd_protnone().
As Linus notes [2], honoring NUMA faults might only make sense for
selected GUP users.
So we should really see if we can instead let relevant GUP callers specify
it manually, and not trigger NUMA hinting faults from GUP as default.
Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and
adding appropriate documenation.
While at it, remove a stale comment from follow_trans_huge_pmd(): That
comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa:
serialise parallel get_user_page against THP migration"), which noted:
THP does not unmap pages due to a lack of support for migration
entries at a PMD level. This allows races with get_user_pages
Nowadays, we do have PMD migration entries, so the comment no longer
applies. Let's drop it.
[1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
[2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com
Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: liubo <liubo254@huawei.com>
Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com
Reported-by: Peter Xu <peterx@redhat.com>
Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fold folio_account_redirty into folio_redirty_for_writepage now
that all other users except for the also unused account_page_redirty
wrapper are gone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This might_sleep() goes back a long time: it was originally introduced
way back when by commit 010060741ad3 ("x86: add might_sleep() to
do_page_fault()"), and made it into the generic VM code when the x86
fault path got re-organized and generalized in commit c2508ec5a58d ("mm:
introduce new 'lock_mm_and_find_vma()' page fault helper").
However, it turns out that the placement of that might_sleep() has
always been rather questionable simply because it's not only a debug
statement to warn about sleeping in contexts that shouldn't sleep (which
was the original reason for adding it), but it also implies a voluntary
scheduling point.
That, in turn, is less than desirable for two reasons:
(a) it ends up being done after we successfully got the mmap_lock, so
just as we got the lock we will now eagerly schedule away and
increase lock contention
and
(b) this is all very possibly part of the "oops, things went horribly
wrong" path and we just haven't figured that out yet
After all, the whole _reason_ for having that get_mmap_lock_carefully()
rather than just doing the obvious mmap_read_lock() is because this code
wants to deal somewhat gracefully with potential kernel wild pointer
bugs.
So then a voluntary scheduling point here is simply not a good idea.
We could certainly turn the 'might_sleep()' into a '__might_sleep()' and
make it be just the debug check that it was originally intended to be.
But even that seems questionable in the wild kernel pointer case - which
again is part of the whole point of this code. The problem wouldn't be
about the _sleeping_ part of the page fault, but about a bad kernel
access. The fact that that bad kernel access might happen in a section
that you shouldn't sleep in is secondary.
So it really ends up being the case that this is simply entirely the
wrong place to do this debug check and related scheduling point at all.
So let's just remove the check entirely. It's been around for over a
decade, it has served its purpose.
The re-schedule will happen at return to user space anyway for the
normal case, and the warning - if we even need it - might be better off
done as a special case for "page fault from kernel mode" once we've
dealt with any potential kernel oopses where the oops is the relevant
thing, not some artificial "scheduling while atomic" test.
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/lkml/20230820104303.2083444-1-mjguzik@gmail.com/
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c
fa165e194997 ("sfc: don't unregister flow_indr if it was never registered")
3bf969e88ada ("sfc: add MAE table machinery for conntrack table")
https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to
read-only) and the output address (pfn) of the vmemmap ptes. That is not
supported without unmapping of pte(marking it invalid) by some
architectures.
With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb
vmemmap optimization disabled. Hence split DAX optimization support into
a different config.
s390, loongarch and riscv don't have devdax support. So the DAX config is
not enabled for them. With this change, arm64 should be able to select
DAX optimization
[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
Link: https://lkml.kernel.org/r/20230724190759.483013-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set
Link: https://lkml.kernel.org/r/20230724190759.483013-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Architectures like powerpc will like to use different page table
allocators and mapping mechanisms to implement vmemmap optimization.
Similar to vmemmap_populate allow architectures to implement
vmemap_populate_compound_pages
Link: https://lkml.kernel.org/r/20230724190759.483013-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within
vmemmap such that tail page mapping can point to the second PAGE_SIZE
area. Enforce that in vmemmap_can_optimize() function.
Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.
Link: https://lkml.kernel.org/r/20230724190759.483013-4-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We will use this in a later patch to do tlb flush when clearing pud
entries on powerpc. This is similar to commit 93a98695f2f9 ("mm: change
pmdp_huge_get_and_clear_full take vm_area_struct as arg")
Link: https://lkml.kernel.org/r/20230724190759.483013-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Add support for DAX vmemmap optimization for ppc64", v6.
This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with
radix MMU translation and 1GB PUD mapping with 64K page size.
The patch series also splits the hugetlb vmemmap optimization as a
separate Kconfig variable so that architectures can enable DAX vmemmap
optimization without enabling hugetlb vmemmap optimization. This should
enable architectures like arm64 to enable DAX vmemmap optimization while
they can't enable hugetlb vmemmap optimization. More details of the same
are in patch "mm/vmemmap optimization: Split hugetlb and devdax vmemmap
optimization".
With 64K page size for 16384 pages added (1G) we save 14 pages
With 4K page size for 262144 pages added (1G) we save 4094 pages
With 4K page size for 512 pages added (2M) we save 6 pages
This patch (of 13):
Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.
[aneesh.kumar@linux.ibm.com: use the new has_transparent_pud_hugepage()]
Link: https://lkml.kernel.org/r/87tttrvtaj.fsf@linux.ibm.com
Link: https://lkml.kernel.org/r/20230724190759.483013-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20230724190759.483013-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move FAULT_FLAG_VMA_LOCK check out of handle_pte_fault(). This should
have a significant performance improvement for mmaped files. Write faults
(on read-only shared pages) still take the mmap lock as we do not want to
audit all the implementations of ->pfn_mkwrite() and ->page_mkwrite().
However write-faults on private mappings are handled under the VMA lock.
[willy@infradead.org: address "suspicious RCU usage" warning]
Link: https://lkml.kernel.org/r/ZMK7jwpI4uD6tKrF@casper.infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move the FAULT_FLAG_VMA_LOCK check down in handle_pte_fault(). This is
probably not a huge win in its own right, but is a nicely separable bit
from the next patch.
Link: https://lkml.kernel.org/r/20230724185410.1124082-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The map_pages fs method should be safe to run under the VMA lock instead
of the mmap lock. This should have a measurable reduction in contention
on the mmap lock.
Link: https://lkml.kernel.org/r/20230724185410.1124082-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Perform the check at the start of do_read_fault(), do_cow_fault() and
do_shared_fault() instead. Should be no performance change from the last
commit.
Link: https://lkml.kernel.org/r/20230724185410.1124082-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Call do_pte_missing() under the VMA lock ... then immediately retry in
do_fault().
Link: https://lkml.kernel.org/r/20230724185410.1124082-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Push the VMA_LOCK check down from __handle_mm_fault() to
handle_pte_fault(). Once again, we refuse to call ->huge_fault() with the
VMA lock held, but we will wait for a PMD migration entry with the VMA
lock held, handle NUMA migration and set the accessed bit. We were
already doing this for anonymous VMAs, so it should be safe.
Link: https://lkml.kernel.org/r/20230724185410.1124082-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Postpone checking the VMA_LOCK flag until we've attempted to handle faults
on PUDs. There's a mild upside to this patch in that we'll allocate the
page tables while under the VMA lock rather than the mmap lock, reducing
the hold time on the mmap lock, since the retry will find the page tables
already populated. The real purpose here is to make a commit that shows
we don't call ->huge_fault under the VMA lock. We do now handle setting
the accessed bit on a PUD fault under the VMA lock, but that doesn't seem
likely to be a measurable difference.
Link: https://lkml.kernel.org/r/20230724185410.1124082-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>