- More userfaultfs work from Peter Xu. - Several convert-to-folios series from Sidhartha Kumar and Huang Ying. - Some filemap cleanups from Vishal Moola. - David Hildenbrand added the ability to selftest anon memory COW handling. - Some cpuset simplifications from Liu Shixin. - Addition of vmalloc tracing support by Uladzislau Rezki. - Some pagecache folioifications and simplifications from Matthew Wilcox. - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it. - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series shold have been in the non-MM tree, my bad. - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages. - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages. - Peter Xu utilized the PTE marker code for handling swapin errors. - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient. - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand. - zram support for multiple compression streams from Sergey Senozhatsky. - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway. - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations. - Vishal Moola removed the try_to_release_page() wrapper. - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache. - David Hildenbrand did some cleanup and repair work on KSM COW breaking. - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend. - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range(). - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen. - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect. - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages(). - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting. - David Hildenbrand has fixed some of our MM selftests for 32-bit machines. - Many singleton patches, as usual. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml CodAgiA51qwzId3GRytIo/tfWZSezgA= =d19R -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
248 lines
6.1 KiB
C
248 lines
6.1 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
/*
|
|
* Copyright (c) 2014, The Linux Foundation. All rights reserved.
|
|
*/
|
|
#include <linux/kernel.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/module.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/vmalloc.h>
|
|
|
|
#include <asm/cacheflush.h>
|
|
#include <asm/set_memory.h>
|
|
#include <asm/tlbflush.h>
|
|
|
|
struct page_change_data {
|
|
pgprot_t set_mask;
|
|
pgprot_t clear_mask;
|
|
};
|
|
|
|
bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
|
|
|
|
bool can_set_direct_map(void)
|
|
{
|
|
/*
|
|
* rodata_full, DEBUG_PAGEALLOC and KFENCE require linear map to be
|
|
* mapped at page granularity, so that it is possible to
|
|
* protect/unprotect single pages.
|
|
*/
|
|
return (rodata_enabled && rodata_full) || debug_pagealloc_enabled() ||
|
|
IS_ENABLED(CONFIG_KFENCE);
|
|
}
|
|
|
|
static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
|
|
{
|
|
struct page_change_data *cdata = data;
|
|
pte_t pte = READ_ONCE(*ptep);
|
|
|
|
pte = clear_pte_bit(pte, cdata->clear_mask);
|
|
pte = set_pte_bit(pte, cdata->set_mask);
|
|
|
|
set_pte(ptep, pte);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* This function assumes that the range is mapped with PAGE_SIZE pages.
|
|
*/
|
|
static int __change_memory_common(unsigned long start, unsigned long size,
|
|
pgprot_t set_mask, pgprot_t clear_mask)
|
|
{
|
|
struct page_change_data data;
|
|
int ret;
|
|
|
|
data.set_mask = set_mask;
|
|
data.clear_mask = clear_mask;
|
|
|
|
ret = apply_to_page_range(&init_mm, start, size, change_page_range,
|
|
&data);
|
|
|
|
flush_tlb_kernel_range(start, start + size);
|
|
return ret;
|
|
}
|
|
|
|
static int change_memory_common(unsigned long addr, int numpages,
|
|
pgprot_t set_mask, pgprot_t clear_mask)
|
|
{
|
|
unsigned long start = addr;
|
|
unsigned long size = PAGE_SIZE * numpages;
|
|
unsigned long end = start + size;
|
|
struct vm_struct *area;
|
|
int i;
|
|
|
|
if (!PAGE_ALIGNED(addr)) {
|
|
start &= PAGE_MASK;
|
|
end = start + size;
|
|
WARN_ON_ONCE(1);
|
|
}
|
|
|
|
/*
|
|
* Kernel VA mappings are always live, and splitting live section
|
|
* mappings into page mappings may cause TLB conflicts. This means
|
|
* we have to ensure that changing the permission bits of the range
|
|
* we are operating on does not result in such splitting.
|
|
*
|
|
* Let's restrict ourselves to mappings created by vmalloc (or vmap).
|
|
* Those are guaranteed to consist entirely of page mappings, and
|
|
* splitting is never needed.
|
|
*
|
|
* So check whether the [addr, addr + size) interval is entirely
|
|
* covered by precisely one VM area that has the VM_ALLOC flag set.
|
|
*/
|
|
area = find_vm_area((void *)addr);
|
|
if (!area ||
|
|
end > (unsigned long)kasan_reset_tag(area->addr) + area->size ||
|
|
!(area->flags & VM_ALLOC))
|
|
return -EINVAL;
|
|
|
|
if (!numpages)
|
|
return 0;
|
|
|
|
/*
|
|
* If we are manipulating read-only permissions, apply the same
|
|
* change to the linear mapping of the pages that back this VM area.
|
|
*/
|
|
if (rodata_enabled &&
|
|
rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
|
|
pgprot_val(clear_mask) == PTE_RDONLY)) {
|
|
for (i = 0; i < area->nr_pages; i++) {
|
|
__change_memory_common((u64)page_address(area->pages[i]),
|
|
PAGE_SIZE, set_mask, clear_mask);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Get rid of potentially aliasing lazily unmapped vm areas that may
|
|
* have permissions set that deviate from the ones we are setting here.
|
|
*/
|
|
vm_unmap_aliases();
|
|
|
|
return __change_memory_common(start, size, set_mask, clear_mask);
|
|
}
|
|
|
|
int set_memory_ro(unsigned long addr, int numpages)
|
|
{
|
|
return change_memory_common(addr, numpages,
|
|
__pgprot(PTE_RDONLY),
|
|
__pgprot(PTE_WRITE));
|
|
}
|
|
|
|
int set_memory_rw(unsigned long addr, int numpages)
|
|
{
|
|
return change_memory_common(addr, numpages,
|
|
__pgprot(PTE_WRITE),
|
|
__pgprot(PTE_RDONLY));
|
|
}
|
|
|
|
int set_memory_nx(unsigned long addr, int numpages)
|
|
{
|
|
return change_memory_common(addr, numpages,
|
|
__pgprot(PTE_PXN),
|
|
__pgprot(PTE_MAYBE_GP));
|
|
}
|
|
|
|
int set_memory_x(unsigned long addr, int numpages)
|
|
{
|
|
return change_memory_common(addr, numpages,
|
|
__pgprot(PTE_MAYBE_GP),
|
|
__pgprot(PTE_PXN));
|
|
}
|
|
|
|
int set_memory_valid(unsigned long addr, int numpages, int enable)
|
|
{
|
|
if (enable)
|
|
return __change_memory_common(addr, PAGE_SIZE * numpages,
|
|
__pgprot(PTE_VALID),
|
|
__pgprot(0));
|
|
else
|
|
return __change_memory_common(addr, PAGE_SIZE * numpages,
|
|
__pgprot(0),
|
|
__pgprot(PTE_VALID));
|
|
}
|
|
|
|
int set_direct_map_invalid_noflush(struct page *page)
|
|
{
|
|
struct page_change_data data = {
|
|
.set_mask = __pgprot(0),
|
|
.clear_mask = __pgprot(PTE_VALID),
|
|
};
|
|
|
|
if (!can_set_direct_map())
|
|
return 0;
|
|
|
|
return apply_to_page_range(&init_mm,
|
|
(unsigned long)page_address(page),
|
|
PAGE_SIZE, change_page_range, &data);
|
|
}
|
|
|
|
int set_direct_map_default_noflush(struct page *page)
|
|
{
|
|
struct page_change_data data = {
|
|
.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
|
|
.clear_mask = __pgprot(PTE_RDONLY),
|
|
};
|
|
|
|
if (!can_set_direct_map())
|
|
return 0;
|
|
|
|
return apply_to_page_range(&init_mm,
|
|
(unsigned long)page_address(page),
|
|
PAGE_SIZE, change_page_range, &data);
|
|
}
|
|
|
|
#ifdef CONFIG_DEBUG_PAGEALLOC
|
|
void __kernel_map_pages(struct page *page, int numpages, int enable)
|
|
{
|
|
if (!can_set_direct_map())
|
|
return;
|
|
|
|
set_memory_valid((unsigned long)page_address(page), numpages, enable);
|
|
}
|
|
#endif /* CONFIG_DEBUG_PAGEALLOC */
|
|
|
|
/*
|
|
* This function is used to determine if a linear map page has been marked as
|
|
* not-valid. Walk the page table and check the PTE_VALID bit.
|
|
*
|
|
* Because this is only called on the kernel linear map, p?d_sect() implies
|
|
* p?d_present(). When debug_pagealloc is enabled, sections mappings are
|
|
* disabled.
|
|
*/
|
|
bool kernel_page_present(struct page *page)
|
|
{
|
|
pgd_t *pgdp;
|
|
p4d_t *p4dp;
|
|
pud_t *pudp, pud;
|
|
pmd_t *pmdp, pmd;
|
|
pte_t *ptep;
|
|
unsigned long addr = (unsigned long)page_address(page);
|
|
|
|
if (!can_set_direct_map())
|
|
return true;
|
|
|
|
pgdp = pgd_offset_k(addr);
|
|
if (pgd_none(READ_ONCE(*pgdp)))
|
|
return false;
|
|
|
|
p4dp = p4d_offset(pgdp, addr);
|
|
if (p4d_none(READ_ONCE(*p4dp)))
|
|
return false;
|
|
|
|
pudp = pud_offset(p4dp, addr);
|
|
pud = READ_ONCE(*pudp);
|
|
if (pud_none(pud))
|
|
return false;
|
|
if (pud_sect(pud))
|
|
return true;
|
|
|
|
pmdp = pmd_offset(pudp, addr);
|
|
pmd = READ_ONCE(*pmdp);
|
|
if (pmd_none(pmd))
|
|
return false;
|
|
if (pmd_sect(pmd))
|
|
return true;
|
|
|
|
ptep = pte_offset_kernel(pmdp, addr);
|
|
return pte_valid(READ_ONCE(*ptep));
|
|
}
|