IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Switching mm and tinkering with current->active_mm should be done with
irqs disabled. There is a path where exit_lazy_flush_tlb can be called
with irqs enabled:
exit_lazy_flush_tlb
flush_type_needed
__flush_all_mm
tlb_finish_mmu
exit_mmap
Which results in the switching being done with irqs enabled, which is
incorrect.
Fixes: a665eec0a22e ("powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm")
Cc: stable@vger.kernel.org # v5.10+
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lore.kernel.org/linuxppc-dev/A9A5D83D-BA70-47A4-BCB4-30C1AE19BC22@linux.ibm.com/
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230607005601.583293-1-npiggin@gmail.com
It was reported that soft dirty tracking doesn't work when using the
Radix MMU.
The tracking is supposed to work by clearing the soft dirty bit for a
mapping and then write protecting the PTE. If/when the page is written
to, a page fault occurs and the soft dirty bit is added back via
pte_mkdirty(). For example in wp_page_reuse():
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
Unfortunately on radix _PAGE_SOFTDIRTY is being dropped by
radix__ptep_set_access_flags(), called from ptep_set_access_flags(),
meaning the soft dirty bit is not set even though the page has been
written to.
Fix it by adding _PAGE_SOFTDIRTY to the set of bits that are able to be
changed in radix__ptep_set_access_flags().
Fixes: b0b5e9b13047 ("powerpc/mm/radix: Add radix pte #defines")
Cc: stable@vger.kernel.org # v4.7+
Reported-by: Dan Horák <dan@danny.cz>
Link: https://lore.kernel.org/r/20230511095558.56663a50f86bdc4cd97700b7@danny.cz
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230511114224.977423-1-mpe@ellerman.id.au
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting.
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes. There is no
functional change with this patch.
Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Support for configuring secure boot with user-defined keys on PowerVM LPARs.
- Simplify the replay of soft-masked IRQs by making it non-recursive.
- Add support for KCSAN on 64-bit Book3S.
- Improvements to the API & code which interacts with RTAS (pseries firmware).
- Change 32-bit powermac to assign PCI bus numbers per domain by default.
- Some improvements to the 32-bit BPF JIT.
- Various other small features and fixes.
Thanks to: Anders Roxell, Andrew Donnellan, Andrew Jeffery, Benjamin Gray, Christophe
Leroy, Frederic Barrat, Ganesh Goudar, Geoff Levand, Greg Kroah-Hartman, Jan-Benedict
Glaw, Josh Poimboeuf, Kajol Jain, Laurent Dufour, Mahesh Salgaonkar, Mathieu Desnoyers,
Mimi Zohar, Murphy Zhou, Nathan Chancellor, Nathan Lynch, Nayna Jain, Nicholas Piggin,
Pali Rohár, Petr Mladek, Rohan McLure, Russell Currey, Sachin Sant, Sathvika Vasireddy,
Sourabh Jain, Stefan Berger, Stephen Rothwell, Sudhakar Kuppusamy.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmP4GnkTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgEnlEAC9UoE9JM853o9ZzpOJDrbYknHsRQad
ztQJ9xu5qjkFHHryTmWKYdiAtNDFbcfn7+1aoc5FXrIb6BOfvBo/uRFw6P501Qwv
Fg0MQyWUnT5WrI7+rBE2q+1+FaHBNKLycLNRSh5JpXtuKe2ubQfiFD80tarBnEnU
6I4bqXd+xjDtnqtpfiYnil/kdZTu/MzntdkmCne6fMkflgEQFU9EVQEnnE+imqFa
6BuCwITvZ+NyaaU+cYMeGZT7aoz9PAwkksgTxXW2gQbTIApX9WX4kYU/vbW4aHts
0bpzMmIbSbAklYIu2PQQhSU0bLfKJ+xly8E8tozHgRX6hrFlqvtmD/T5LHTBD11f
FFzKb0NUCD8qTIy6Hn0M1tj5egLpxxzATPe/kVTkxxqTlZrzdSEaqzft6syyJHJd
ueo0QN53AUyBaVMtxLbnB/U/8Vnz6rLqY+8dLKzXhjYjoPJqOZh/Qlc1Tk3syPwf
E2j4H6wFqGMTOGi453Pijkpj3qpNkNT79FG5DmClcQLJxD/EXDyffLZITrkzQa0S
FEkcMzz/Hn9Hkf7ZuNo4DN6ss6IF0vlxoi7GNr+MRR53/aVQJUDc8z24c4ICl/3w
20ETk57XMVJzP++Hb+yn16JyAawfQOOlckBRZ2O8W5YYVoes45hxDQxVoh8EII69
hb3KOGYEqF5wyA==
=ECNb
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Support for configuring secure boot with user-defined keys on PowerVM
LPARs
- Simplify the replay of soft-masked IRQs by making it non-recursive
- Add support for KCSAN on 64-bit Book3S
- Improvements to the API & code which interacts with RTAS (pseries
firmware)
- Change 32-bit powermac to assign PCI bus numbers per domain by
default
- Some improvements to the 32-bit BPF JIT
- Various other small features and fixes
Thanks to Anders Roxell, Andrew Donnellan, Andrew Jeffery, Benjamin
Gray, Christophe Leroy, Frederic Barrat, Ganesh Goudar, Geoff Levand,
Greg Kroah-Hartman, Jan-Benedict Glaw, Josh Poimboeuf, Kajol Jain,
Laurent Dufour, Mahesh Salgaonkar, Mathieu Desnoyers, Mimi Zohar, Murphy
Zhou, Nathan Chancellor, Nathan Lynch, Nayna Jain, Nicholas Piggin, Pali
Rohár, Petr Mladek, Rohan McLure, Russell Currey, Sachin Sant, Sathvika
Vasireddy, Sourabh Jain, Stefan Berger, Stephen Rothwell, and Sudhakar
Kuppusamy.
* tag 'powerpc-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (114 commits)
powerpc/pseries: Avoid hcall in plpks_is_available() on non-pseries
powerpc: dts: turris1x.dts: Set lower priority for CPLD syscon-reboot
powerpc/e500: Add missing prototype for 'relocate_init'
powerpc/64: Fix unannotated intra-function call warning
powerpc/epapr: Don't use wrteei on non booke
powerpc: Pass correct CPU reference to assembler
powerpc/mm: Rearrange if-else block to avoid clang warning
powerpc/nohash: Fix build with llvm-as
powerpc/nohash: Fix build error with binutils >= 2.38
powerpc/pseries: Fix endianness issue when parsing PLPKS secvar flags
macintosh: windfarm: Use unsigned type for 1-bit bitfields
powerpc/kexec_file: print error string on usable memory property update failure
powerpc/machdep: warn when machine_is() used too early
powerpc/64: Replace -mcpu=e500mc64 by -mcpu=e5500
powerpc/eeh: Set channel state after notifying the drivers
selftests/powerpc: Fix incorrect kernel headers search path
powerpc/rtas: arch-wide function token lookup conversions
powerpc/rtas: introduce rtas_function_token() API
powerpc/pseries/lpar: convert to papr_sysparm API
powerpc/pseries/hv-24x7: convert to papr_sysparm API
...
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
Clang warns:
arch/powerpc/mm/book3s64/radix_tlb.c:1191:23: error: variable 'hstart' is uninitialized when used here
__tlbiel_va_range(hstart, hend, pid,
^~~~~~
arch/powerpc/mm/book3s64/radix_tlb.c:1191:31: error: variable 'hend' is uninitialized when used here
__tlbiel_va_range(hstart, hend, pid,
^~~~
Rework the 'if (IS_ENABLE(CONFIG_TRANSPARENT_HUGEPAGE))' so hstart/hend
is always initialized to silence the warnings. That will also simplify
the 'else' path. Clang is getting confused with these warnings, but the
warnings is a false-positive.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220810114318.3220630-1-anders.roxell@linaro.org
The stress_hpt memblock allocation did not pass in an alignment,
which causes a stack dump in early boot (that I missed, oops).
Fixes: 6b34a099faa1 ("powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faults")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221216115930.2667772-2-npiggin@gmail.com
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This looks like it came across from x86, but x86 uses TLB_FLUSH_ALL as
a parameter to internal functions. Powerpc never sets it anywhere.
Remove the associated logic and leave a warning for now.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230203111718.1149852-4-npiggin@gmail.com
The MMU_NO_CONTEXT checks are an unnecessary complication. Make
these warn to prepare to remove them in future.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230203111718.1149852-3-npiggin@gmail.com
need_flush_all is only set by arch code to instruct generic tlb_flush
to flush all. It is never set by powerpc, so it can be removed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230203111718.1149852-2-npiggin@gmail.com
If a relocatable kernel is loaded at a non-zero address and told not to
relocate to zero (kdump or RELOCATABLE_TEST), the mapping of the
interrupt code at zero is left with RWX permissions.
That is a security weakness, and leads to a warning at boot if
CONFIG_DEBUG_WX is enabled:
powerpc/mm: Found insecure W+X mapping at address 00000000056435bc/0xc000000000000000
WARNING: CPU: 1 PID: 1 at arch/powerpc/mm/ptdump/ptdump.c:193 note_page+0x484/0x4c0
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc1-00001-g8ae8e98aea82-dirty #175
Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,git-dd0dca hv:linux,kvm pSeries
NIP: c0000000004a1c34 LR: c0000000004a1c30 CTR: 0000000000000000
REGS: c000000003503770 TRAP: 0700 Not tainted (6.2.0-rc1-00001-g8ae8e98aea82-dirty)
MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24000220 XER: 00000000
CFAR: c000000000545a58 IRQMASK: 0
...
NIP note_page+0x484/0x4c0
LR note_page+0x480/0x4c0
Call Trace:
note_page+0x480/0x4c0 (unreliable)
ptdump_pmd_entry+0xc8/0x100
walk_pgd_range+0x618/0xab0
walk_page_range_novma+0x74/0xc0
ptdump_walk_pgd+0x98/0x170
ptdump_check_wx+0x94/0x100
mark_rodata_ro+0x30/0x70
kernel_init+0x78/0x1a0
ret_from_kernel_thread+0x5c/0x64
The fix has two parts. Firstly the pages from zero up to the end of
interrupts need to be marked read-only, so that they are left with R-X
permissions. Secondly the mapping logic needs to be taught to ensure
there is a page boundary at the end of the interrupt region, so that the
permission change only applies to the interrupt text, and not the region
following it.
Fixes: c55d7b5e6426 ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230110124753.1325426-2-mpe@ellerman.id.au
If a relocatable kernel is loaded at an address that is not 2MB aligned
and told not to relocate to zero, the kernel can crash due to
mark_rodata_ro() incorrectly changing some read-write data to read-only.
Scenarios where the misalignment can occur are when the kernel is
loaded by kdump or using the RELOCATABLE_TEST config option.
Example crash with the kernel loaded at 5MB:
Run /sbin/init as init process
BUG: Unable to handle kernel data access on write at 0xc000000000452000
Faulting instruction address: 0xc0000000005b6730
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
CPU: 1 PID: 1 Comm: init Not tainted 6.2.0-rc1-00011-g349188be4841 #166
Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,git-5b4c5a hv:linux,kvm pSeries
NIP: c0000000005b6730 LR: c000000000ae9ab8 CTR: 0000000000000380
REGS: c000000004503250 TRAP: 0300 Not tainted (6.2.0-rc1-00011-g349188be4841)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44288480 XER: 00000000
CFAR: c0000000005b66ec DAR: c000000000452000 DSISR: 0a000000 IRQMASK: 0
...
NIP memset+0x68/0x104
LR zero_user_segments.constprop.0+0xa8/0xf0
Call Trace:
ext4_mpage_readpages+0x7f8/0x830
ext4_readahead+0x48/0x60
read_pages+0xb8/0x380
page_cache_ra_unbounded+0x19c/0x250
filemap_fault+0x58c/0xae0
__do_fault+0x60/0x100
__handle_mm_fault+0x1230/0x1a40
handle_mm_fault+0x120/0x300
___do_page_fault+0x20c/0xa80
do_page_fault+0x30/0xc0
data_access_common_virt+0x210/0x220
This happens because mark_rodata_ro() tries to change permissions on the
range _stext..__end_rodata, but _stext sits in the middle of the 2MB
page from 4MB to 6MB:
radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
radix-mmu: Mapped 0x0000000000400000-0x0000000002400000 with 2.00 MiB pages (exec)
The logic that changes the permissions assumes the linear mapping was
split correctly at boot, so it marks the entire 2MB page read-only. That
leads to the write fault above.
To fix it, the boot time mapping logic needs to consider that if the
kernel is running at a non-zero address then _stext is a boundary where
it must split the mapping.
That leads to the mapping being split correctly, allowing the rodata
permission change to take happen correctly, with no spillover:
radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
radix-mmu: Mapped 0x0000000000400000-0x0000000000500000 with 64.0 KiB pages
radix-mmu: Mapped 0x0000000000500000-0x0000000000600000 with 64.0 KiB pages (exec)
radix-mmu: Mapped 0x0000000000600000-0x0000000002400000 with 2.00 MiB pages (exec)
If the kernel is loaded at a 2MB aligned address, the mapping continues
to use 2MB pages as before:
radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
radix-mmu: Mapped 0x0000000000400000-0x0000000002c00000 with 2.00 MiB pages (exec)
radix-mmu: Mapped 0x0000000002c00000-0x0000000100000000 with 2.00 MiB pages
Fixes: c55d7b5e6426 ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230110124753.1325426-1-mpe@ellerman.id.au
stress_hpt_timer_fn() is only used in hash_utils.c, make it static.
Fixes: 6b34a099faa1 ("powerpc/64s/hash: add stress_hpt kernel boot option to increase hash faults")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221228093603.3166599-1-yangyingliang@huawei.com
This option increases the number of hash misses by limiting the number
of kernel HPT entries, by keeping a per-CPU record of the last kernel
HPTEs installed, and removing that from the hash table on the next hash
insertion. A timer round-robins CPUs removing remaining kernel HPTEs and
clearing the TLB (in the case of bare metal) to increase and slightly
randomise kernel fault activity.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add comment about NR_CPUS usage, fixup whitespace]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221024030150.852517-1-npiggin@gmail.com
stop_machine_cpuslocked takes a mutex so it must be called in a
preemptible context, so it can't simply be fixed by disabling
preemption.
This is not a bug, because CPU hotplug is locked, so this processor will
call in to the stop machine function. So raw_smp_processor_id() could be
used. This leaves a small chance that this thread will be migrated to
another CPU, so the master work would be done by a CPU from a different
context. Better for test coverage to make that a common case by just
having the first CPU to call in become the master.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221013151647.1857994-2-npiggin@gmail.com
This lock is taken while the raw kfence_freelist_lock is held, so it
must also be a raw spinlock, as reported by lockdep when raw lock
nesting checking is enabled.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221013230710.1987253-3-npiggin@gmail.com
With kfence enabled, there are several cases where HPTE and TLBIE locks
are called from softirq context, for example:
WARNING: inconsistent lock state
6.0.0-11845-g0cbbc95b12ac #1 Tainted: G N
--------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
swapper/0/1 [HC0[0]:SC0[0]:HE1:SE1] takes:
c000000002734de8 (native_tlbie_lock){+.?.}-{2:2}, at: .native_hpte_updateboltedpp+0x1a4/0x600
{IN-SOFTIRQ-W} state was registered at:
.lock_acquire+0x20c/0x520
._raw_spin_lock+0x4c/0x70
.native_hpte_invalidate+0x62c/0x840
.hash__kernel_map_pages+0x450/0x640
.kfence_protect+0x58/0xc0
.kfence_guarded_free+0x374/0x5a0
.__slab_free+0x3d0/0x630
.put_cred_rcu+0xcc/0x120
.rcu_core+0x3c4/0x14e0
.__do_softirq+0x1dc/0x7dc
.do_softirq_own_stack+0x40/0x60
Fix this by consistently disabling irqs while taking either of these
locks. Don't just disable bh because several of the more common cases
already disable irqs, so this just makes the locks always irq-safe.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221013230710.1987253-2-npiggin@gmail.com
Add lockdep annotation for the HPTE bit-spinlock. Modern systems don't
take the tlbie lock, so this shows up some of the same lockdep warnings
that were being reported by the ppc970. And they're not taken in exactly
the same places so this is nice to have in its own right.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20221013230710.1987253-1-npiggin@gmail.com
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
- Remove our now never-true definitions for pgd_huge() and p4d_leaf().
- Add pte_needs_flush() and huge_pmd_needs_flush() for 64-bit.
- Add support for syscall wrappers.
- Add support for KFENCE on 64-bit.
- Update 64-bit HV KVM to use the new guest state entry/exit accounting API.
- Support execute-only memory when using the Radix MMU (P9 or later).
- Implement CONFIG_PARAVIRT_TIME_ACCOUNTING for pseries guests.
- Updates to our linker script to move more data into read-only sections.
- Allow the VDSO to be randomised on 32-bit.
- Many other small features and fixes.
Thanks to: Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev, Christophe
Leroy, David Hildenbrand, Disha Goel, Fabiano Rosas, Gaosheng Cui, Gustavo A. R. Silva,
Haren Myneni, Hari Bathini, Jilin Yuan, Joel Stanley, Kajol Jain, Kees Cook, Krzysztof
Kozlowski, Laurent Dufour, Liang He, Li Huafei, Lukas Bulwahn, Madhavan Srinivasan, Nathan
Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Pali Rohár, Rohan McLure,
Russell Currey, Sachin Sant, Segher Boessenkool, Shrikanth Hegde, Tyrel Datwyler, Wolfram
Sang, ye xingchen, Zheng Yongjun.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmNCpBMTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDx3EACCf86iumFF3RyvENtDwoTRgH3H0z2E
/ZC4LKrtxgaPFJzKUT4F0kLK85Hw5GzMEKK42NIhAB0o5vFwmEzxOtnlHOyEufAm
EDIZDIfxV2J9Qx/cW2DSojPj/o9O6noXwhw9SBqMwiDWd8gXmNgOUEklAO7aR7Vq
Ne2N2FLMNthZydCoHR6dAEjfe2ceFXP5cALwzQO+ILDdZQ0UcF2Yq4yw/gEDoCrB
FH7mmE7UaQQHvYzo85VTZu7XfUys1P7kUcnhVurOg7/07ITnvnQR+itKZXC+bSft
1K7ULtjd2QiCgxZA/apFc3lO46kqHVFsB3onRQw12/Ku5vfGFfY0L0iK97OgM4s0
0u4r+J7A+MM5YBJVVjwZ6woYO5CWMHYKBZepxOpcvftPxj1LNkiHsryqKILGISEC
aIY/lI0hpeNU4QshDMXzSTgeb/VF9O5cGPncTPkOFbXxD4RpVyz8tSngsG1+D8lj
S6B2h3k4A14rnblLOxP22jcedBlTYQcRQS4vwr0a7+63QTjfSJ12xT3ucIAKU9f7
65rVSS/igbrfxqHDmrd60WWZBMXeK0Zy7YIG6iYPTxpP31eFpSp9wtDlV7V2+EH2
F2p+TJY8aTA8UW+2L5gigN3RsBeeEB8zxJkB14ivICM7+XzVu11PxPDqjDZYkfzC
ueKKvCcHhHAYqQ==
=TFBA
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Remove our now never-true definitions for pgd_huge() and p4d_leaf().
- Add pte_needs_flush() and huge_pmd_needs_flush() for 64-bit.
- Add support for syscall wrappers.
- Add support for KFENCE on 64-bit.
- Update 64-bit HV KVM to use the new guest state entry/exit accounting
API.
- Support execute-only memory when using the Radix MMU (P9 or later).
- Implement CONFIG_PARAVIRT_TIME_ACCOUNTING for pseries guests.
- Updates to our linker script to move more data into read-only
sections.
- Allow the VDSO to be randomised on 32-bit.
- Many other small features and fixes.
Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Christophe Leroy, David Hildenbrand, Disha Goel, Fabiano Rosas,
Gaosheng Cui, Gustavo A. R. Silva, Haren Myneni, Hari Bathini, Jilin
Yuan, Joel Stanley, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent
Dufour, Liang He, Li Huafei, Lukas Bulwahn, Madhavan Srinivasan, Nathan
Chancellor, Nathan Lynch, Nicholas Miehlbradt, Nicholas Piggin, Pali
Rohár, Rohan McLure, Russell Currey, Sachin Sant, Segher Boessenkool,
Shrikanth Hegde, Tyrel Datwyler, Wolfram Sang, ye xingchen, and Zheng
Yongjun.
* tag 'powerpc-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
KVM: PPC: Book3S HV: Fix stack frame regs marker
powerpc: Don't add __powerpc_ prefix to syscall entry points
powerpc/64s/interrupt: Fix stack frame regs marker
powerpc/64: Fix msr_check_and_set/clear MSR[EE] race
powerpc/64s/interrupt: Change must-hard-mask interrupt check from BUG to WARN
powerpc/pseries: Add firmware details to the hardware description
powerpc/powernv: Add opal details to the hardware description
powerpc: Add device-tree model to the hardware description
powerpc/64: Add logical PVR to the hardware description
powerpc: Add PVR & CPU name to hardware description
powerpc: Add hardware description string
powerpc/configs: Enable PPC_UV in powernv_defconfig
powerpc/configs: Update config files for removed/renamed symbols
powerpc/mm: Fix UBSAN warning reported on hugetlb
powerpc/mm: Always update max/min_low_pfn in mem_topology_setup()
powerpc/mm/book3s/hash: Rename flush_tlb_pmd_range
powerpc: Drops STABS_DEBUG from linker scripts
powerpc/64s: Remove lost/old comment
powerpc/64s: Remove old STAB comment
powerpc: remove orphan systbl_chk.sh
...
This function does the hash page table update. Hence rename it to
indicate this better to avoid confusion with flush_pmd_tlb_range()
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Drop unnecessary extern]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220907081941.209501-1-aneesh.kumar@linux.ibm.com
KFENCE support was added for ppc32 in commit 90cbac0e995d
("powerpc: Enable KFENCE for PPC32").
Enable KFENCE on ppc64 architecture with hash and radix MMUs.
It uses the same mechanism as debug pagealloc to
protect/unprotect pages. All KFENCE kunit tests pass on both
MMUs.
KFENCE memory is initially allocated using memblock but is
later marked as SLAB allocated. This necessitates the change
to __pud_free to ensure that the KFENCE pages are freed
appropriately.
Based on previous work by Christophe Leroy and Jordan Niethe.
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-4-nicholas@linux.ibm.com
If the page is already mapped resp. already unmapped, bail out.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-3-nicholas@linux.ibm.com
debug_pagealloc_enabled() is always defined and constant folds to
'false' when CONFIG_DEBUG_PAGEALLOC is not enabled.
Remove the #ifdefs, the code and associated static variables will
be optimised out by the compiler when CONFIG_DEBUG_PAGEALLOC is
not defined.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-2-nicholas@linux.ibm.com
The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
move to use RCU instead of disabling local interrupts in fast-GUP. Using
an IPI is the old-styled way of serializing against fast-GUP although it
still works as expected now.
And fast-GUP now fixed the potential race with THP collapse by checking
whether PMD is changed or not. So IPI broadcast in radix pmd collapse
flush is not necessary anymore. But it is still needed for hash TLB.
Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
update_mmu_cache() voids when hash page tables are not used.
On PPC32 that means when MMU_FTR_HPTE_TABLE is not defined.
On PPC64 that means when RADIX is enabled.
Rename core part of update_mmu_cache() as __update_mmu_cache()
and include the initial verification in an inlined caller.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bea5ad0de7f83eff256116816d46c84fa0a444de.1662370698.git.christophe.leroy@csgroup.eu
powerpc has a number of read-only sections and tables that are put after
RO_DATA(). Move the __end_rodata symbol to cover these as well.
Setting memory to read-only at boot is done using __init_begin, change
that to use __end_rodata.
This makes is_kernel_rodata() exactly cover the read-only region, as
well as other things using __end_rodata (e.g., kernel/dma/debug.c).
Boot dmesg also prints the rodata size more accurately.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220916040755.2398112-2-npiggin@gmail.com
Currently __init_begin is used as the boundary for strict RWX between
executable/read-only text and data, and non-executable (after boot) code
and data.
But that's a little subtle, so add an explicit symbol to document that
the SRWX boundary lies there, and add a comment making it clear that
__init_begin must also begin there.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220916131422.318752-2-mpe@ellerman.id.au
Add support for execute-only memory (XOM) for the Radix MMU by using an
execute-only mapping, as opposed to the RX mapping used by powerpc's
other MMUs.
The Hash MMU already supports XOM through the execute-only pkey,
which is a separate mechanism shared with x86. A PROT_EXEC-only mapping
will map to RX, and then the pkey will be applied on top of it.
mmap() and mprotect() consumers in userspace should observe the same
behaviour on Hash and Radix despite the differences in implementation.
Replacing the vma_is_accessible() check in access_error() with a read
check should be functionally equivalent for non-Radix MMUs, since it
follows write and execute checks. For Radix, the change enables
detecting faults on execute-only mappings where vma_is_accessible() would
return true.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220817050640.406017-1-ruscur@russell.cc
The nest MMU in POWER9 does not re-fetch the PTE in response to
permission mismatch, contrary to the architecture[*] and unlike the core
MMU. This requires a TLB flush before upgrading permissions of valid
PTEs, for any address space with a coprocessor attached.
Per (non-public) Nest MMU Workbook, POWER10 nest MMU conforms to the
architecture in this regard, so skip the workaround.
[*] See: Power ISA Version 3.1B, 6.10.1.2 Modifying a Translation Table
Entry, Setting a Reference or Change Bit or Upgrading Access
Authority (PTE Subject to Atomic Hardware Updates):
"If the only change being made to a valid PTE that is subject to
atomic hardware updates is to set the Reference or Change bit to
1 or to upgrade access authority, a simpler sequence suffices
because the translation hardware will refetch the PTE if an
access is attempted for which the only problems were reference
and/or change bits needing to be set or insufficient access
authority."
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220525022358.780745-3-npiggin@gmail.com
Per (non-public) Nest MMU Workbook, POWER10 and POWER9P NMMU does not
cache PTEs in PWC, so does not require PWC flush to invalidate these
translations.
Skip the workaround on POWER10 and later.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220525022358.780745-2-npiggin@gmail.com
Since commit 87c78b612f4f ("powerpc: Fix all occurences of "the the"")
fixed "the the", there's now a steady stream of patches fixing other
duplicate words.
Just fix them all at once, to save the overhead of dealing with
individual patches for each case.
This leaves a few cases of "that that", which in some contexts is
correct.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220718095158.326606-1-mpe@ellerman.id.au
Trying to remove asm/ppc_asm.h from all places that don't need it
leads to several failures linked to firmware_has_feature().
To fix it, include asm/firmware.h in all files using
firmware_has_feature()
All users found with:
git grep -L "firmware\.h" ` git grep -l "firmware_has_feature("`
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/11956ec181a034b51a881ac9c059eea72c679a73.1651828453.git.christophe.leroy@csgroup.eu
- Convert to the generic mmap support (ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT).
- Add support for outline-only KASAN with 64-bit Radix MMU (P9 or later).
- Increase SIGSTKSZ and MINSIGSTKSZ and add support for AT_MINSIGSTKSZ.
- Enable the DAWR (Data Address Watchpoint) on POWER9 DD2.3 or later.
- Drop support for system call instruction emulation.
- Many other small features and fixes.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Andy Shevchenko, Bagas Sanjaya, Bjorn
Helgaas, Bo Liu, Chen Huang, Christophe Leroy, Colin Ian King, Daniel Axtens, Dwaipayan
Ray, Fabiano Rosas, Finn Thain, Frank Rowand, Fuqian Huang, Guilherme G. Piccoli, Hangyu
Hua, Haowen Bai, Haren Myneni, Hari Bathini, He Ying, Jason Wang, Jiapeng Chong, Jing
Yangyang, Joel Stanley, Julia Lawall, Kajol Jain, Kevin Hao, Krzysztof Kozlowski, Laurent
Dufour, Lv Ruyi, Madhavan Srinivasan, Magali Lemes, Miaoqian Lin, Minghao Chi, Nathan
Chancellor, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Oscar Salvador, Pali Rohár,
Paul Mackerras, Peng Wu, Qing Wang, Randy Dunlap, Reza Arbab, Russell Currey, Sohaib
Mohamed, Vaibhav Jain, Vasant Hegde, Wang Qing, Wang Wensheng, Xiang wangx, Xiaomeng Tong,
Xu Wang, Yang Guang, Yang Li, Ye Bin, YueHaibing, Yu Kuai, Zheng Bin, Zou Wei, Zucheng
Zheng.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmKSEgETHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgJpLEACee7mu2I00Z7VWtW5ckT4RFbAXYZcM
Hv5DbTnVB2ItoQMRHvG52DNbR73j9HnYrz8kpwfTBVk90udxVP14L/swXDs3xbT4
riXEYtJ1DRVc/bLiOK637RLPWNrmmZStWZme7k0Y9Ki5Aif8i1Erjjq7EIy47m9j
j1MTcwp3ND7IsBON2nZ3PkttEHhevKvOwCPb/BWtPMDV0OhyQUFKB2SNegrlCrkT
wshDgdQcYqbIix98PoGa2ZfUVgFQD3JVLzXa4sLpqouzGD+HvEFStOFa2Gq/ZEvV
zunaeXDdZUCjlib6KvA8+aumBbIQ1s/urrDbxd+3BuYxZ094vNP1B428NT1AWVtl
3bEZQIN8GSx0v9aHxZ8HePsAMXgG9d2o0xC9EMQ430+cqroN+6UHP7lkekwkprb7
U9EpZCG9U8jV6SDcaMigW3tooEjn657we0R8nZG2NgUNssdSHVh/JYxGDALPXIAk
awL3NQrR0tYF3Y3LJm5AxdQrK1hJH8E+hZFCZvIpUXGsr/uf9Gemy/62pD1rhrr/
niULpxIneRGkJiXB5qdGy8pRu27ED53k7Ky6+8MWSEFQl1mUsHSryYACWz939D8c
DydhBwQqDTl6Ozs41a5TkVjIRLOCrZADUd/VZM6A4kEOqPJ5t2Gz22Bn8ya1z6Ks
5Sx6vrGH7GnDjA==
=15oQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Convert to the generic mmap support (ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT)
- Add support for outline-only KASAN with 64-bit Radix MMU (P9 or later)
- Increase SIGSTKSZ and MINSIGSTKSZ and add support for AT_MINSIGSTKSZ
- Enable the DAWR (Data Address Watchpoint) on POWER9 DD2.3 or later
- Drop support for system call instruction emulation
- Many other small features and fixes
Thanks to Alexey Kardashevskiy, Alistair Popple, Andy Shevchenko, Bagas
Sanjaya, Bjorn Helgaas, Bo Liu, Chen Huang, Christophe Leroy, Colin Ian
King, Daniel Axtens, Dwaipayan Ray, Fabiano Rosas, Finn Thain, Frank
Rowand, Fuqian Huang, Guilherme G. Piccoli, Hangyu Hua, Haowen Bai,
Haren Myneni, Hari Bathini, He Ying, Jason Wang, Jiapeng Chong, Jing
Yangyang, Joel Stanley, Julia Lawall, Kajol Jain, Kevin Hao, Krzysztof
Kozlowski, Laurent Dufour, Lv Ruyi, Madhavan Srinivasan, Magali Lemes,
Miaoqian Lin, Minghao Chi, Nathan Chancellor, Naveen N. Rao, Nicholas
Piggin, Oliver O'Halloran, Oscar Salvador, Pali Rohár, Paul Mackerras,
Peng Wu, Qing Wang, Randy Dunlap, Reza Arbab, Russell Currey, Sohaib
Mohamed, Vaibhav Jain, Vasant Hegde, Wang Qing, Wang Wensheng, Xiang
wangx, Xiaomeng Tong, Xu Wang, Yang Guang, Yang Li, Ye Bin, YueHaibing,
Yu Kuai, Zheng Bin, Zou Wei, and Zucheng Zheng.
* tag 'powerpc-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (200 commits)
powerpc/64: Include cache.h directly in paca.h
powerpc/64s: Only set HAVE_ARCH_UNMAPPED_AREA when CONFIG_PPC_64S_HASH_MMU is set
powerpc/xics: Include missing header
powerpc/powernv/pci: Drop VF MPS fixup
powerpc/fsl_book3e: Don't set rodata RO too early
powerpc/microwatt: Add mmu bits to device tree
powerpc/powernv/flash: Check OPAL flash calls exist before using
powerpc/powermac: constify device_node in of_irq_parse_oldworld()
powerpc/powermac: add missing g5_phy_disable_cpu1() declaration
selftests/powerpc/pmu: fix spelling mistake "mis-match" -> "mismatch"
powerpc: Enable the DAWR on POWER9 DD2.3 and above
powerpc/64s: Add CPU_FTRS_POWER10 to ALWAYS mask
powerpc/64s: Add CPU_FTRS_POWER9_DD2_2 to CPU_FTRS_ALWAYS mask
powerpc: Fix all occurences of "the the"
selftests/powerpc/pmu/ebb: remove fixed_instruction.S
powerpc/platforms/83xx: Use of_device_get_match_data()
powerpc/eeh: Drop redundant spinlock initialization
powerpc/iommu: Add missing of_node_put in iommu_init_early_dart
powerpc/pseries/vas: Call misc_deregister if sysfs init fails
powerpc/papr_scm: Fix leaking nvdimm_events_map elements
...
Implement a limited form of KASAN for Book3S 64-bit machines running under
the Radix MMU, supporting only outline mode.
- Enable the compiler instrumentation to check addresses and maintain the
shadow region. (This is the guts of KASAN which we can easily reuse.)
- Require kasan-vmalloc support to handle modules and anything else in
vmalloc space.
- KASAN needs to be able to validate all pointer accesses, but we can't
instrument all kernel addresses - only linear map and vmalloc. On boot,
set up a single page of read-only shadow that marks all iomap and
vmemmap accesses as valid.
- Document KASAN in powerpc docs.
Background
----------
KASAN support on Book3S is a bit tricky to get right:
- It would be good to support inline instrumentation so as to be able to
catch stack issues that cannot be caught with outline mode.
- Inline instrumentation requires a fixed offset.
- Book3S runs code with translations off ("real mode") during boot,
including a lot of generic device-tree parsing code which is used to
determine MMU features.
[ppc64 mm note: The kernel installs a linear mapping at effective
address c000...-c008.... This is a one-to-one mapping with physical
memory from 0000... onward. Because of how memory accesses work on
powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
same memory both with translations on (accessing as an 'effective
address'), and with translations off (accessing as a 'real
address'). This works in both guests and the hypervisor. For more
details, see s5.7 of Book III of version 3 of the ISA, in particular
the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
KASAN implementation currently only supports Radix.]
- Some code - most notably a lot of KVM code - also runs with translations
off after boot.
- Therefore any offset has to point to memory that is valid with
translations on or off.
One approach is just to give up on inline instrumentation. This way
boot-time checks can be delayed until after the MMU is set is up, and we
can just not instrument any code that runs with translations off after
booting. Take this approach for now and require outline instrumentation.
Previous attempts allowed inline instrumentation. However, they came with
some unfortunate restrictions: only physically contiguous memory could be
used and it had to be specified at compile time. Maybe we can do better in
the future.
[paulus@ozlabs.org - Rebased onto 5.17. Note that a kernel with
CONFIG_KASAN=y will crash during boot on a machine using HPT
translation because not all the entry points to the generic
KASAN code are protected with a call to kasan_arch_is_ready().]
Originally-by: Balbir Singh <bsingharora@gmail.com> # ppc64 out-of-line radix version
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Update copyright year and comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/YoTE69OQwiG7z+Gu@cleo
LoPAPR defines guest visible IOMMU with hypercalls to use it -
H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap
in the KVM in the real mode (with MMU off). The problem with the real mode
is some memory is not available and some API usage crashed the host but
enabling MMU was an expensive operation.
The problems with the real mode handlers are:
1. Occasionally these cannot complete the request so the code is
copied+modified to work in the virtual mode, very little is shared;
2. The real mode handlers have to be linked into vmlinux to work;
3. An exception in real mode immediately reboots the machine.
If the small DMA window is used, the real mode handlers bring better
performance. However since POWER8, there has always been a bigger DMA
window which VMs use to map the entire VM memory to avoid calling
H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT
(a bulk version of H_PUT_TCE) which virtual mode handler is even closer
to its real mode version.
On POWER9 hypercalls trap straight to the virtual mode so the real mode
handlers never execute on POWER9 and later CPUs.
So with the current use of the DMA windows and MMU improvements in
POWER9 and later, there is no point in duplicating the code.
The 32bit passed through devices may slow down but we do not have many
of these in practice. For example, with this applied, a 1Gbit ethernet
adapter still demostrates above 800Mbit/s of actual throughput.
This removes the real mode handlers from KVM and related code from
the powernv platform.
This updates the list of implemented hcalls in KVM-HV as the realmode
handlers are removed.
This changes ABI - kvmppc_h_get_tce() moves to the KVM module and
kvmppc_find_table() is static now.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru
Various spelling mistakes in comments.
Detected with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220430185654.5855-1-Julia.Lawall@inria.fr
arch_randomize_brk() is only needed for hash on book3s/64, for other
platforms the one provided by the default mmap layout is good enough.
Move it to hash_utils.c and use randomize_page() like the generic one.
And properly opt out the radix case instead of making an assumption
on mmu_highuser_ssize.
Also change to a 32M range like most other architectures instead of 8M.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eafa4d18ec8ac7b98dd02b40181e61643707cc7c.1649523076.git.christophe.leroy@csgroup.eu
Do like most other architectures and provide randomisation also to
"legacy" memory mappings, by adding the random factor to
mm->mmap_base in arch_pick_mmap_layout().
See commit 8b8addf891de ("x86/mm/32: Enable full randomization on
i386 and X86_32") for all explanations and benefits of that mmap
randomisation.
At the moment, slice_find_area_bottomup() doesn't use mm->mmap_base
but uses the fixed TASK_UNMAPPED_BASE instead.
slice_find_area_bottomup() being used as a fallback to
slice_find_area_topdown(), it can't use mm->mmap_base
directly.
Instead of always using TASK_UNMAPPED_BASE as base address, leave
it to the caller. When called from slice_find_area_topdown()
TASK_UNMAPPED_BASE is used. Otherwise mm->mmap_base is used.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/417fb10dde828534c73a03138b49621d74f4e5be.1649523076.git.christophe.leroy@csgroup.eu
hugetlb_get_unmapped_area() is now identical to the
generic version if only RADIX is enabled, so move it
to slice.c and let it fallback on the generic one
when HASH MMU is not compiled in.
Do the same with arch_get_unmapped_area() and
arch_get_unmapped_area_topdown().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b5d9c124e82889e0cb115c150915a0c0d84eb960.1649523076.git.christophe.leroy@csgroup.eu