linux/Documentation/translations/zh_CN/mm/vmalloced-kernel-stacks.rst
Linus Torvalds 6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00

134 lines
5.1 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/mm/vmalloced-kernel-stacks.rst
:翻译:
司延腾 Yanteng Si <siyanteng@loongson.cn>
:校译:
====================
支持虚拟映射的内核栈
====================
:作者: Shuah Khan <skhan@linuxfoundation.org>
.. contents:: :local:
概览
----
这是介绍 `虚拟映射内核栈功能 <https://lwn.net/Articles/694348/>` 的代码
和原始补丁系列的信息汇总。
简介
----
内核堆栈溢出通常难以调试,并使内核容易被(恶意)利用。问题可能在稍后的时间出现,使其难以
隔离和究其根本原因。
带有保护页的虚拟映射内核堆栈如果溢出,会被立即捕获,而不会放任其导致难以诊断的损
坏。
HAVE_ARCH_VMAP_STACK和VMAP_STACK配置选项能够支持带有保护页的虚拟映射堆栈。
当堆栈溢出时,这个特性会引发可靠的异常。溢出后堆栈跟踪的可用性以及对溢出本身的
响应取决于架构。
.. note::
截至本文撰写时, arm64, powerpc, riscv, s390, um, 和 x86 支持VMAP_STACK。
HAVE_ARCH_VMAP_STACK
--------------------
能够支持虚拟映射内核栈的架构应该启用这个bool配置选项。要求是:
- vmalloc空间必须大到足以容纳许多内核堆栈。这可能排除了许多32位架构。
- vmalloc空间的堆栈需要可靠地工作。例如如果vmap页表是按需创建的当堆栈指向
具有未填充页表的虚拟地址时这种机制需要工作或者架构代码switch_to()和
switch_mm(),很可能)需要确保堆栈的页表项在可能未填充的堆栈上运行之前已经填
充。
- 如果堆栈溢出到一个保护页,就应该发生一些合理的事情。“合理”的定义是灵活的,但
在没有记录任何东西的情况下立即重启是不友好的。
VMAP_STACK
----------
VMAP_STACK bool配置选项在启用时分配虚拟映射的任务栈。这个选项依赖于
HAVE_ARCH_VMAP_STACK。
- 如果你想使用带有保护页的虚拟映射的内核堆栈,请启用该选项。这将导致内核栈溢出
被立即捕获,而不是难以诊断的损坏。
.. note::
使用KASAN的这个功能需要架构支持用真实的影子内存来支持虚拟映射并且
必须启用KASAN_VMALLOC。
.. note::
启用VMAP_STACK时无法在堆栈分配的数据上运行DMA。
内核配置选项和依赖性不断变化。请参考最新的代码库:
`Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>`
分配方法
--------
当一个新的内核线程被创建时,线程堆栈是由页级分配器分配的虚拟连续的内存页组成。这
些页面被映射到有PAGE_KERNEL保护的连续的内核虚拟空间。
alloc_thread_stack_node()调用__vmalloc_node_range()来分配带有PAGE_KERNEL
保护的栈。
- 分配的堆栈被缓存起来,以后会被新的线程重用,所以在分配/释放堆栈给任务时,要手动
进行memcg核算。因此__vmalloc_node_range被调用时没有__GFP_ACCOUNT。
- vm_struct被缓存起来以便能够找到在中断上下文中启动的空闲线程。 free_thread_stack()
可以在中断上下文中调用。
- 在arm64上所有VMAP的堆栈都需要有相同的对齐方式以确保VMAP的堆栈溢出检测正常
工作。架构特定的vmap堆栈分配器照顾到了这个细节。
- 这并不涉及中断堆栈--参考原始补丁
线程栈分配是由clone()、fork()、vfork()、kernel_thread()通过kernel_clone()
启动的。留点提示在这,以便搜索代码库,了解线程栈何时以及如何分配。
大量的代码是在:
`kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`.
task_struct中的stack_vm_area指针可以跟踪虚拟分配的堆栈一个非空的stack_vm_area
指针可以表明虚拟映射的内核堆栈已经启用。
::
struct vm_struct *stack_vm_area;
堆栈溢出处理
------------
前守护页和后守护页有助于检测堆栈溢出。当堆栈溢出到守护页时,处理程序必须小心不要再
次溢出堆栈。当处理程序被调用时,很可能只留下很少的堆栈空间。
在x86上这是通过处理表明内核堆栈溢出的双异常堆栈的缺页异常来实现的。
用守护页测试VMAP分配
--------------------
我们如何确保VMAP_STACK在分配时确实有前守护页和后守护页的保护下面的 lkdtm 测试
可以帮助检测任何回归。
::
void lkdtm_STACK_GUARD_PAGE_LEADING()
void lkdtm_STACK_GUARD_PAGE_TRAILING()
结论
----
- vmalloced堆栈的percpu缓存似乎比高阶堆栈分配要快一些至少在缓存命中时是这样。
- THREAD_INFO_IN_TASK完全摆脱了arch-specific thread_info并简单地将
thread_info仅包含标志和'int cpu'嵌入task_struct中。
- 一旦任务死亡线程栈就可以被释放无需等待RCU然后如果使用vmapped栈
可以将整个栈缓存起来以便在同一cpu上重复使用。