IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Address the following coccicheck warning:
tools/testing/selftests/vm/userfaultfd.c:1536:21-22: WARNING opportunity
for swap().
tools/testing/selftests/vm/userfaultfd.c:1540:33-34: WARNING opportunity
for swap().
by using swap() for the swapping of variable values and drop
`tmp_area` that is not needed any more.
`swap()` macro in userfaultfd.c is introduced in commit 681696862b
("selftests: vm: remove dependecy from internal kernel macros")
It has been tested with gcc (Debian 8.3.0-6) 8.3.0.
Link: https://lkml.kernel.org/r/20220407123141.4998-1-guozhengkui@vivo.com
Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we're trying to collapse a 2M huge shmem page, don't retract pgtable
pmd page if it's registered with uffd-wp, because that pgtable could have
pte markers installed. Recycling of that pgtable means we'll lose the pte
markers. That could cause data loss for an uffd-wp enabled application on
shmem.
Instead of disabling khugepaged on these files, simply skip retracting
these special VMAs, then the page cache can still be merged into a huge
thp, and other mm/vma can still map the range of file with a huge thp when
proper.
Note that checking VM_UFFD_WP needs to be done with mmap_sem held for
write, that avoids race like:
khugepaged user thread
========== ===========
check VM_UFFD_WP, not set
UFFDIO_REGISTER with uffd-wp on shmem
wr-protect some pages (install markers)
take mmap_sem write lock
erase pmd and free pmd page
--> pte markers are dropped unnoticed!
Link: https://lkml.kernel.org/r/20220405014921.14994-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
if unmapping an entire vma or synchronized such that faults can not race
with the unmap operation. This requires passing zap_flags all the way to
the lowest level hugetlb unmap routine: __unmap_hugepage_range.
In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
faults. The exception is hole punch which will first unmap without any
synchronization. Later when hole punch actually removes the page from the
file, it will check to see if there was a subsequent fault and if so take
the hugetlb fault mutex while unmapping again. This second unmap will
pass in ZAP_FLAG_DROP_MARKER.
The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
unmap a hugetlb range" is (IMHO): we should never reach a state when a
page fault could errornously fault in a page-cache page that was
wr-protected to be writable, even in an extremely short period. That
could happen if e.g. we pass ZAP_FLAG_DROP_MARKER when
hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
faults after that call and before remove_inode_hugepages() is executed,
the page cache can be mapped writable again in the small racy window, that
can cause unexpected data overwritten.
[peterx@redhat.com: fix sparse warning]
Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
[akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We don't have "huge" version of pte markers, instead when necessary we
split the thp.
However split the thp is not enough, because file-backed thp is handled
totally differently comparing to anonymous thps: rather than doing a real
split, the thp pmd will simply got cleared in __split_huge_pmd_locked().
That is not enough if e.g. when there is a thp covers range [0, 2M) but
we want to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd, and
change_pmd_range() will just skip it right after the split.
Here we leverage the previously introduced change_pmd_prepare() macro so
that we'll populate the pmd with a pgtable page after the pmd split (in
which process the pmd will be cleared for cases like shmem). Then
change_pte_range() will do all the rest for us by installing the uffd-wp
pte marker at any none pte that we'd like to wr-protect.
Link: https://lkml.kernel.org/r/20220405014852.14413-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
File-backed memory differs from anonymous memory in that even if the pte
is missing, the data could still resides either in the file or in
page/swap cache. So when wr-protect a pte, we need to consider none ptes
too.
We do that by installing the uffd-wp pte markers when necessary. So when
there's a future write to the pte, the fault handler will go the special
path to first fault-in the page as read-only, then report to userfaultfd
server with the wr-protect message.
On the other hand, when unprotecting a page, it's also possible that the
pte got unmapped but replaced by the special uffd-wp marker. Then we'll
need to be able to recover from a uffd-wp pte marker into a none pte, so
that the next access to the page will fault in correctly as usual when
accessed the next time.
Special care needs to be taken throughout the change_protection_range()
process. Since now we allow user to wr-protect a none pte, we need to be
able to pre-populate the page table entries if we see (!anonymous &&
MM_CP_UFFD_WP) requests, otherwise change_protection_range() will always
skip when the pgtable entry does not exist.
For example, the pgtable can be missing for a whole chunk of 2M pmd, but
the page cache can exist for the 2M range. When we want to wr-protect one
4K page within the 2M pmd range, we need to pre-populate the pgtable and
install the pte marker showing that we want to get a message and block the
thread when the page cache of that 4K page is written. Without
pre-populating the pmd, change_protection() will simply skip that whole
pmd.
Note that this patch only covers the small pages (pte level) but not
covering any of the transparent huge pages yet. That will be done later,
and this patch will be a preparation for it too.
Link: https://lkml.kernel.org/r/20220405014850.14352-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
File-backed memory is prone to being unmapped at any time. It means all
information in the pte will be dropped, including the uffd-wp flag.
To persist the uffd-wp flag, we'll use the pte markers. This patch
teaches the zap code to understand uffd-wp and know when to keep or drop
the uffd-wp bit.
Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we
don't want to persist such an information, for example, when destroying
the whole vma, or punching a hole in a shmem file. For the rest cases we
should never drop the uffd-wp bit, or the wr-protect information will get
lost.
The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than
memory.c because it'll be further referenced in hugetlb files later.
Link: https://lkml.kernel.org/r/20220405014847.14295-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
File-backed memories are prone to unmap/swap so the ptes are always
unstable, because they can be easily faulted back later using the page
cache. This could lead to uffd-wp getting lost when unmapping or swapping
out such memory. One example is shmem. PTE markers are needed to store
those information.
This patch prepares it by handling uffd-wp pte markers first it is applied
elsewhere, so that the page fault handler can recognize uffd-wp pte
markers.
The handling of uffd-wp pte markers is similar to missing fault, it's just
that we'll handle this "missing fault" when we see the pte markers,
meanwhile we need to make sure the marker information is kept during
processing the fault.
This is a slow path of uffd-wp handling, because zapping of wr-protected
shmem ptes should be rare. So far it should only trigger in two
conditions:
(1) When trying to punch holes in shmem_fallocate(), there is an
optimization to zap the pgtables before evicting the page.
(2) When swapping out shmem pages.
Because of this, the page fault handling is simplifed too by not sending
the wr-protect message in the 1st page fault, instead the page will be
installed read-only, so the uffd-wp message will be generated in the next
fault, which will trigger the do_wp_page() path of general uffd-wp
handling.
Disable fault-around for all uffd-wp registered ranges for extra safety
just like uffd-minor fault, and clean the code up.
Link: https://lkml.kernel.org/r/20220405014844.14239-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch allows do_fault() to trigger on !pte_none() cases too. This
prepares for the pte markers to be handled by do_fault() just like none
pte.
To achieve this, instead of unconditionally check against pte_none() in
finish_fault(), we may hit the case that the orig_pte was some pte marker
so what we want to do is to replace the pte marker with some valid pte
entry. Then if orig_pte was set we'd want to check the current *pte
(under pgtable lock) against orig_pte rather than none pte.
Right now there's no solid way to safely reference orig_pte because when
pmd is not allocated handle_pte_fault() will not initialize orig_pte, so
it's not safe to reference it.
There's another solution proposed before this patch to do pte_clear() for
vmf->orig_pte for pmd==NULL case, however it turns out it'll break arm32
because arm32 could have assumption that pte_t* pointer will always reside
on a real ram32 pgtable, not any kernel stack variable.
To solve this, we add a new flag FAULT_FLAG_ORIG_PTE_VALID, and it'll be
set along with orig_pte when there is valid orig_pte, or it'll be cleared
when orig_pte was not initialized.
It'll be updated every time we call handle_pte_fault(), so e.g. if a page
fault retry happened it'll be properly updated along with orig_pte.
[1] https://lore.kernel.org/lkml/710c48c9-406d-e4c5-a394-10501b951316@samsung.com/
[akpm@linux-foundation.org: coding-style cleanups]
[peterx@redhat.com: fix crash reported by Marek]
Link: https://lkml.kernel.org/r/Ylb9rXJyPm8/ao8f@xz-m1.local
Link: https://lkml.kernel.org/r/20220405014836.14077-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.
Overview
========
Userfaultfd-wp anonymous support was merged two years ago. There're quite
a few applications that started to leverage this capability either to take
snapshots for user-app memory, or use it for full user controled swapping.
This series tries to complete the feature for uffd-wp so as to cover all
the RAM-based memory types. So far uffd-wp is the only missing piece of
the rest features (uffd-missing & uffd-minor mode).
One major reason to do so is that anonymous pages are sometimes not
satisfying the need of applications, and there're growing users of either
shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
between hypervisor process and device emulation process, shmem local live
migration for upgrades), or for performance on tlb hits.
All these mean that if a uffd-wp app wants to switch to any of the memory
types, it'll stop working. I think it's worthwhile to have the kernel to
cover all these aspects.
This series chose to protect pages in pte level not page level.
One major reason is safety. I have no idea how we could make it safe if
any of the uffd-privileged app can wr-protect a page that any other
application can use. It means this app can block any process potentially
for any time it wants.
The other reason is that it aligns very well with not only the anonymous
uffd-wp solution, but also uffd as a whole. For example, userfaultfd is
implemented fundamentally based on VMAs. We set flags to VMAs showing the
status of uffd tracking. For another per-page based protection solution,
it'll be crossing the fundation line on VMA-based, and it could simply be
too far away already from what's called userfaultfd.
PTE markers
===========
The patchset is based on the idea called PTE markers. It was discussed in
one of the mm alignment sessions, proposed starting from v6, and this is
the 2nd version of it using PTE marker idea.
PTE marker is a new type of swap entry that is ony applicable to file
backed memories like shmem and hugetlbfs. It's used to persist some
pte-level information even if the original present ptes in pgtable are
zapped.
Logically pte markers can store more than uffd-wp information, but so far
only one bit is used for uffd-wp purpose. When the pte marker is
installed with uffd-wp bit set, it means this pte is wr-protected by uffd.
It solves the problem on e.g. file-backed memory mapped ptes got zapped
due to any reason (e.g. thp split, or swapped out), we can still keep the
wr-protect information in the ptes. Then when the page fault triggers
again, we'll know this pte is wr-protected so we can treat the pte the
same as a normal uffd wr-protected pte.
The extra information is encoded into the swap entry, or swp_offset to be
explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses
one bit out of the swap entry, the rest bits of swp_offset are still
reserved for other purposes.
There're two configs to enable/disable PTE markers:
CONFIG_PTE_MARKER
CONFIG_PTE_MARKER_UFFD_WP
We can set !PTE_MARKER to completely disable all the PTE markers, along
with uffd-wp support. I made two config so we can also enable PTE marker
but disable uffd-wp file-backed for other purposes. At the end of current
series, I'll enable CONFIG_PTE_MARKER by default, but that patch is
standalone and if anyone worries about having it by default, we can also
consider turn it off by dropping that oneliner patch. So far I don't see
a huge risk of doing so, so I kept that patch.
In most cases, PTE markers should be treated as none ptes. It is because
that unlike most of the other swap entry types, there's no PFN or block
offset information encoded into PTE markers but some extra well-defined
bits showing the status of the pte. These bits should only be used as
extra data when servicing an upcoming page fault, and then we behave as if
it's a none pte.
I did spend a lot of time observing all the pte_none() users this time.
It is indeed a challenge because there're a lot, and I hope I didn't miss
a single of them when we should take care of pte markers. Luckily, I
don't think it'll need to be considered in many cases, for example: boot
code, arch code (especially non-x86), kernel-only page handlings (e.g.
CPA), or device driver codes when we're tackling with pure PFN mappings.
I introduced pte_none_mostly() in this series when we need to handle pte
markers the same as none pte, the "mostly" is the other way to write
"either none pte or a pte marker".
I didn't replace pte_none() to cover pte markers for below reasons:
- Very rare case of pte_none() callers will handle pte markers. E.g., all
the kernel pages do not require knowledge of pte markers. So we don't
pollute the major use cases.
- Unconditionally change pte_none() semantics could confuse people, because
pte_none() existed for so long a time.
- Unconditionally change pte_none() semantics could make pte_none() slower
even if in many cases pte markers do not exist.
- There're cases where we'd like to handle pte markers differntly from
pte_none(), so a full replace is also impossible. E.g. khugepaged should
still treat pte markers as normal swap ptes rather than none ptes, because
pte markers will always need a fault-in to merge the marker with a valid
pte. Or the smap code will need to parse PTE markers not none ptes.
Patch Layout
============
Introducing PTE marker and uffd-wp bit in PTE marker:
mm: Introduce PTE_MARKER swap entry
mm: Teach core mm about pte markers
mm: Check against orig_pte for finish_fault()
mm/uffd: PTE_MARKER_UFFD_WP
Adding support for shmem uffd-wp:
mm/shmem: Take care of UFFDIO_COPY_MODE_WP
mm/shmem: Handle uffd-wp special pte in page fault handler
mm/shmem: Persist uffd-wp bit across zapping for file-backed
mm/shmem: Allow uffd wr-protect none pte for file-backed mem
mm/shmem: Allows file-back mem to be uffd wr-protected on thps
mm/shmem: Handle uffd-wp during fork()
Adding support for hugetlbfs uffd-wp:
mm/hugetlb: Introduce huge pte version of uffd-wp helpers
mm/hugetlb: Hook page faults for uffd write protection
mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
mm/hugetlb: Handle UFFDIO_WRITEPROTECT
mm/hugetlb: Handle pte markers in page faults
mm/hugetlb: Allow uffd wr-protect none ptes
mm/hugetlb: Only drop uffd-wp special pte if required
mm/hugetlb: Handle uffd-wp during fork()
Misc handling on the rest mm for uffd-wp file-backed:
mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
Enabling of uffd-wp on file-backed memory:
mm/uffd: Enable write protection for shmem & hugetlbfs
mm: Enable PTE markers by default
selftests/uffd: Enable uffd-wp for shmem/hugetlbfs
Tests
=====
- Compile test on x86_64 and aarch64 on different configs
- Kernel selftests
- uffd-test [0]
- Umapsort [1,2] test for shmem/hugetlb, with swap on/off
[0] https://github.com/xzpeter/clibs/tree/master/uffd-test
[1] https://github.com/xzpeter/umap-apps/tree/peter
[2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs
This patch (of 23):
Introduces a new swap entry type called PTE_MARKER. It can be installed
for any pte that maps a file-backed memory when the pte is temporarily
zapped, so as to maintain per-pte information.
The information that kept in the pte is called a "marker". Here we define
the marker as "unsigned long" just to match pgoff_t, however it will only
work if it still fits in swp_offset(), which is e.g. currently 58 bits on
x86_64.
A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A
bunch of helpers are defined altogether to service the rest of the pte
marker code.
[peterx@redhat.com: fixup]
Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local
Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM reads the user input parameters only when it starts. To
allow more efficient online tuning, this commit implements a new input
parameter called 'commit_inputs'. Writing true to the parameter makes
DAMON_RECLAIM reads the input parameters again.
Link: https://lkml.kernel.org/r/20220429160606.127307-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, DAMON sysfs interface doesn't provide a way for adjusting DAMON
input parameters while it is turned on. Therefore, users who want to
reconfigure DAMON need to stop DAMON and restart. This means all the
monitoring results that accumulated so far, which could be useful, should
be flushed. This would be inefficient for many cases.
For an example, let's suppose a sysadmin was running a DAMON-based
Operation Scheme to find memory regions not accessed for more than 5 mins
and page out the regions. If it turns out the 5 mins threshold was too
long and therefore the sysadmin wants to reduce it to 4 mins, the sysadmin
should turn off DAMON, restart it, and wait for at least 4 more minutes so
that DAMON can find the cold memory regions, even though DAMON was knowing
there are regions that not accessed for 4 mins at the time of shutdown.
This commit makes DAMON sysfs interface to support online DAMON input
parameters updates by adding a new input keyword for the 'state' DAMON
sysfs file, 'commit'. Writing the keyword to the 'state' file while the
corresponding kdamond is running makes the kdamond to read the sysfs file
values again and update the DAMON context.
Link: https://lkml.kernel.org/r/20220429160606.127307-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Only '->kdamond' and '->kdamond_stop' are protected by 'kdamond_lock' of
'struct damon_ctx'. All other DAMON context internal data items are
recommended to be accessed in DAMON callbacks, or under some additional
synchronizations. But, DAMON sysfs is accessing the schemes stat under
'kdamond_lock'.
It makes no big issue as the read values are not used anywhere inside
kernel, but would better to be fixed. This commit moves the reads to
DAMON callback context, as supposed to be used for the purpose.
Link: https://lkml.kernel.org/r/20220429160606.127307-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON sysfs 'state' file handling code is using string literals in both
'state_show()' and 'state_store()'. This makes the code error prone and
inflexible for future extensions.
To improve the situation, this commit defines possible input strings and
'enum' for identifying each input keyword only once, and refactors the
code to reuse those.
Link: https://lkml.kernel.org/r/20220429160606.127307-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'damon_va_apply_three_regions()' is for adjusting address ranges to fit in
three discontiguous ranges. The function can be generalized for arbitrary
number of discontiguous ranges and reused for future usage, such as
arbitrary online regions update. For such future usage, this commit
introduces a generalized version of the function called
'damon_set_regions()'.
Link: https://lkml.kernel.org/r/20220429160606.127307-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When 'after_sampling()' or 'after_aggregation()' DAMON callbacks return an
error, kdamond continues the remaining loop once. It makes no much sense
to run the remaining part while something wrong already happened. The
context might be corrupted or having invalid data. This commit therefore
makes kdamond skips the remaining works and immediately finish in the
cases.
Link: https://lkml.kernel.org/r/20220429160606.127307-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: Support online tuning".
Effects of DAMON and DAMON-based Operation Schemes highly depends on the
configurations. Wrong configurations could even result in unexpected
efficiency degradations. For finding a best configuration, repeating
incremental configuration changes and results measurements, in other
words, online tuning, could be helpful.
Nevertheless, DAMON kernel API supports only restrictive online tuning.
Worse yet, the sysfs-based DAMON user interface doesn't support online
tuning at all. DAMON_RECLAIM also doesn't support online tuning.
This patchset makes the DAMON kernel API, DAMON sysfs interface, and
DAMON_RECLAIM supports online tuning.
Sequence of patches
-------------------
First two patches enhance DAMON online tuning for kernel API users.
Specifically, patch 1 let kernel API users to be able to do DAMON online
tuning without a restriction, and patch 2 makes error handling easier.
Following seven patches (patches 3-9) refactor code for better readability
and easier reuse of code fragments that will be useful for online tuning
support.
Patch 10 introduces DAMON callback based user request handling structure
for DAMON sysfs interface, and patch 11 enables DAMON online tuning via
DAMON sysfs interface. Documentation patch (patch 12) for usage of it
follows.
Patch 13 enables online tuning of DAMON_RECLAIM and finally patch 14
documents the DAMON_RECLAIM online tuning usage.
This patch (of 14):
For updating input parameters for running DAMON contexts, DAMON kernel API
users can use the contexts' callbacks, as it is the safe place for context
internal data accesses. When the context has DAMON-based operation
schemes and all schemes are deactivated due to their watermarks, however,
DAMON does nothing but only watermarks checks. As a result, no callbacks
will be called back, and therefore the kernel API users cannot update the
input parameters including monitoring attributes, DAMON-based operation
schemes, and watermarks.
To let users easily update such DAMON input parameters in such a case,
this commit adds a new callback, 'after_wmarks_check()'. It will be
called after each watermarks check. Users can do the online input
parameters update in the callback even under the schemes deactivated case.
Link: https://lkml.kernel.org/r/20220429160606.127307-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>