Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton: "147 patches, based on 7d2a07b769. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...
2021-09-08 12:55:35 -07:00
parent cc09ee80c3 b285437d1d
commit 2d338201d5
149 changed files with 5357 additions and 946 deletions
--- a/Documentation/admin-guide/mm/damon/index.rst
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -0,0 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+Monitoring Data Accesses
+========================
+
+:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
+Using DAMON, users can analyze the memory access patterns of their systems and
+optimize those.
+
+.. toctree::
+   :maxdepth: 2
+
+   start
+   usage
--- a/Documentation/admin-guide/mm/damon/start.rst
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -0,0 +1,114 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Getting Started
+===============
+
+This document briefly describes how you can use DAMON by demonstrating its
+default user space tool.  Please note that this document describes only a part
+of its features for brevity.  Please refer to :doc:`usage` for more details.
+
+
+TL; DR
+======
+
+Follow the commands below to monitor and visualize the memory access pattern of
+your workload. ::
+
+    # # build the kernel with CONFIG_DAMON_*=y, install it, and reboot
+    # mount -t debugfs none /sys/kernel/debug/
+    # git clone https://github.com/awslabs/damo
+    # ./damo/damo record $(pidof <your workload>)
+    # ./damo/damo report heat --plot_ascii
+
+The final command draws the access heatmap of ``<your workload>``.  The heatmap
+shows which memory region (x-axis) is accessed when (y-axis) and how frequently
+(number; the higher the more accesses have been observed). ::
+
+    111111111111111111111111111111111111111111111111111111110000
+    111121111111111111111111111111211111111111111111111111110000
+    000000000000000000000000000000000000000000000000001555552000
+    000000000000000000000000000000000000000000000222223555552000
+    000000000000000000000000000000000000000011111677775000000000
+    000000000000000000000000000000000000000488888000000000000000
+    000000000000000000000000000000000177888400000000000000000000
+    000000000000000000000000000046666522222100000000000000000000
+    000000000000000000000014444344444300000000000000000000000000
+    000000000000000002222245555510000000000000000000000000000000
+    # access_frequency:  0  1  2  3  4  5  6  7  8  9
+    # x-axis: space (140286319947776-140286426374096: 101.496 MiB)
+    # y-axis: time (605442256436361-605479951866441: 37.695430s)
+    # resolution: 60x10 (1.692 MiB and 3.770s for each character)
+
+
+Prerequisites
+=============
+
+Kernel
+------
+
+You should first ensure your system is running on a kernel built with
+``CONFIG_DAMON_*=y``.
+
+
+User Space Tool
+---------------
+
+For the demonstration, we will use the default user space tool for DAMON,
+called DAMON Operator (DAMO).  It is available at
+https://github.com/awslabs/damo.  The examples below assume that ``damo`` is on
+your ``$PATH``.  It's not mandatory, though.
+
+Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure debugfs is mounted.  Mount it manually as
+below::
+
+    # mount -t debugfs none /sys/kernel/debug/
+
+or append the following line to your ``/etc/fstab`` file so that your system
+can automatically mount debugfs upon booting::
+
+    debugfs /sys/kernel/debug debugfs defaults 0 0
+
+
+Recording Data Access Patterns
+==============================
+
+The commands below record the memory access patterns of a program and save the
+monitoring results to a file. ::
+
+    $ git clone https://github.com/sjp38/masim
+    $ cd masim; make; ./masim ./configs/zigzag.cfg &
+    $ sudo damo record -o damon.data $(pidof masim)
+
+The first two lines of the commands download an artificial memory access
+generator program and run it in the background.  The generator will repeatedly
+access two 100 MiB sized memory regions one by one.  You can substitute this
+with your real workload.  The last line asks ``damo`` to record the access
+pattern in the ``damon.data`` file.
+
+
+Visualizing Recorded Patterns
+=============================
+
+The following three commands visualize the recorded access patterns and save
+the results as separate image files. ::
+
+    $ damo report heats --heatmap access_pattern_heatmap.png
+    $ damo report wss --range 0 101 1 --plot wss_dist.png
+    $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
+
+- ``access_pattern_heatmap.png`` will visualize the data access pattern in a
+  heatmap, showing which memory region (y-axis) got accessed when (x-axis)
+  and how frequently (color).
+- ``wss_dist.png`` will show the distribution of the working set size.
+- ``wss_chron_change.png`` will show how the working set size has
+  chronologically changed.
+
+You can view the visualizations of this example workload at [1]_.
+Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
+
+.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
+.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
+.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
+.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,112 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Detailed Usages
+===============
+
+DAMON provides below three interfaces for different users.
+
+- *DAMON user space tool.*
+  This is for privileged people such as system administrators who want a
+  just-working human-friendly interface.  Using this, users can use the DAMON’s
+  major features in a human-friendly way.  It may not be highly tuned for
+  special cases, though.  It supports only virtual address spaces monitoring.
+- *debugfs interface.*
+  This is for privileged user space programmers who want more optimized use of
+  DAMON.  Using this, users can use DAMON’s major features by reading
+  from and writing to special debugfs files.  Therefore, you can write and use
+  your personalized DAMON debugfs wrapper programs that reads/writes the
+  debugfs files instead of you.  The DAMON user space tool is also a reference
+  implementation of such programs.  It supports only virtual address spaces
+  monitoring.
+- *Kernel Space Programming Interface.*
+  This is for kernel space programmers.  Using this, users can utilize every
+  feature of DAMON most flexibly and efficiently by writing kernel space
+  DAMON application programs for you.  You can even extend DAMON for various
+  address spaces.
+
+Nevertheless, you could write your own user space tool using the debugfs
+interface.  A reference implementation is available at
+https://github.com/awslabs/damo.  If you are a kernel programmer, you could
+refer to :doc:`/vm/damon/api` for the kernel space programming interface.  For
+the reason, this document describes only the debugfs interface
+
+debugfs Interface
+=================
+
+DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under
+its debugfs directory, ``<debugfs>/damon/``.
+
+
+Attributes
+----------
+
+Users can get and set the ``sampling interval``, ``aggregation interval``,
+``regions update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file.  To know about the monitoring
+attributes in detail, please refer to the :doc:`/vm/damon/design`.  For
+example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
+1000, and then check it again::
+
+    # cd <debugfs>/damon
+    # echo 5000 100000 1000000 10 1000 > attrs
+    # cat attrs
+    5000 100000 1000000 10 1000
+
+
+Target IDs
+----------
+
+Some types of address spaces supports multiple monitoring target.  For example,
+the virtual memory address spaces monitoring can have multiple processes as the
+monitoring targets.  Users can set the targets by writing relevant id values of
+the targets to, and get the ids of the current targets by reading from the
+``target_ids`` file.  In case of the virtual address spaces monitoring, the
+values should be pids of the monitoring target processes.  For example, below
+commands set processes having pids 42 and 4242 as the monitoring targets and
+check it again::
+
+    # cd <debugfs>/damon
+    # echo 42 4242 > target_ids
+    # cat target_ids
+    42 4242
+
+Note that setting the target ids doesn't start the monitoring.
+
+
+Turning On/Off
+--------------
+
+Setting the files as described above doesn't incur effect unless you explicitly
+start the monitoring.  You can start, stop, and check the current status of the
+monitoring by writing to and reading from the ``monitor_on`` file.  Writing
+``on`` to the file starts the monitoring of the targets with the attributes.
+Writing ``off`` to the file stops those.  DAMON also stops if every target
+process is terminated.  Below example commands turn on, off, and check the
+status of DAMON::
+
+    # cd <debugfs>/damon
+    # echo on > monitor_on
+    # echo off > monitor_on
+    # cat monitor_on
+    off
+
+Please note that you cannot write to the above-mentioned debugfs files while
+the monitoring is turned on.  If you write to the files while DAMON is running,
+an error code such as ``-EBUSY`` will be returned.
+
+
+Tracepoint for Monitoring Results
+=================================
+
+DAMON provides the monitoring results via a tracepoint,
+``damon:damon_aggregated``.  While the monitoring is turned on, you could
+record the tracepoint events and show results using tracepoint supporting tools
+like ``perf``.  For example::
+
+    # echo on > monitor_on
+    # perf record -e damon:damon_aggregated &
+    # sleep 5
+    # kill 9 $(pidof perf)
+    # echo off > monitor_on
+    # perf script
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -27,6 +27,7 @@ the Linux memory management.

   concepts
   cma_debugfs
+   damon/index
   hugetlbpage
   idle_page_tracking
   ksm
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -1,466 +1,576 @@
 .. _admin_guide_memory_hotplug:

-==============
-Memory Hotplug
-==============
+==================
+Memory Hot(Un)Plug
+==================

-:Created:							Jul 28 2007
-:Updated: Add some details about locking internals:		Aug 20 2018
-
-This document is about memory hotplug including how-to-use and current status.
-Because Memory Hotplug is still under development, contents of this text will
-be changed often.
+This document describes generic Linux support for memory hot(un)plug with
+a focus on System RAM, including ZONE_MOVABLE support.

 .. contents:: :local:

-.. note::
-
-    (1) x86_64's has special implementation for memory hotplug.
-        This text does not describe it.
-    (2) This text assumes that sysfs is mounted at ``/sys``.
-
-
 Introduction
 ============

-Purpose of memory hotplug
-------------------------
+Memory hot(un)plug allows for increasing and decreasing the size of physical
+memory available to a machine at runtime. In the simplest case, it consists of
+physically plugging or unplugging a DIMM at runtime, coordinated with the
+operating system.

-Memory Hotplug allows users to increase/decrease the amount of memory.
-Generally, there are two purposes.
+Memory hot(un)plug is used for various purposes:

-(A) For changing the amount of memory.
-    This is to allow a feature like capacity on demand.
-(B) For installing/removing DIMMs or NUMA-nodes physically.
-    This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+- The physical memory available to a machine can be adjusted at runtime, up- or
+  downgrading the memory capacity. This dynamic memory resizing, sometimes
+  referred to as "capacity on demand", is frequently used with virtual machines
+  and logical partitions.

-(A) is required by highly virtualized environments and (B) is required by
-hardware which supports memory power management.
+- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
+  example is replacing failing memory modules.

-Linux memory hotplug is designed for both purpose.
+- Reducing energy consumption either by physically unplugging memory modules or
+  by logically unplugging (parts of) memory modules from Linux.

-Phases of memory hotplug
------------------------
+Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also
+used to expose persistent memory, other performance-differentiated memory and
+reserved memory regions as ordinary system RAM to Linux.

-There are 2 phases in Memory Hotplug:
+Linux only supports memory hot(un)plug on selected 64 bit architectures, such as
+x86_64, arm64, ppc64, s390x and ia64.

-  1) Physical Memory Hotplug phase
-  2) Logical Memory Hotplug phase.
+Memory Hot(Un)Plug Granularity
+------------------------------

-The First phase is to communicate hardware/firmware and make/erase
-environment for hotplugged memory. Basically, this phase is necessary
-for the purpose (B), but this is good phase for communication between
-highly virtualized environments too.
-
-When memory is hotplugged, the kernel recognizes new memory, makes new memory
-management tables, and makes sysfs files for new memory's operation.
-
-If firmware supports notification of connection of new memory to OS,
-this phase is triggered automatically. ACPI can notify this event. If not,
-"probe" operation by system administration is used instead.
-(see :ref:`memory_hotplug_physical_mem`).
-
-Logical Memory Hotplug phase is to change memory state into
-available/unavailable for users. Amount of memory from user's view is
-changed by this phase. The kernel makes all memory in it as free pages
-when a memory range is available.
-
-In this document, this phase is described as online/offline.
-
-Logical Memory Hotplug phase is triggered by write of sysfs file by system
-administrator. For the hot-add case, it must be executed after Physical Hotplug
-phase by hand.
-(However, if you writes udev's hotplug scripts for memory hotplug, these
-phases can be execute in seamless way.)
-
-Unit of Memory online/offline operation
---------------------------------------
-
-Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
-into chunks of the same size. These chunks are called "sections". The size of
-a memory section is architecture dependent. For example, power uses 16MiB, ia64
-uses 1GiB.
+Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
+physical memory address space into chunks of the same size: memory sections. The
+size of a memory section is architecture dependent. For example, x86_64 uses
+128 MiB and ppc64 uses 16 MiB.

 Memory sections are combined into chunks referred to as "memory blocks". The
-size of a memory block is architecture dependent and represents the logical
-unit upon which memory online/offline operations are to be performed. The
-default size of a memory block is the same as memory section size unless an
-architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
+size of a memory block is architecture dependent and corresponds to the smallest
+granularity that can be hot(un)plugged. The default size of a memory block is
+the same as memory section size, unless an architecture specifies otherwise.

-To determine the size (in bytes) of a memory block please read this file::
+All memory blocks have the same size.

-  /sys/devices/system/memory/block_size_bytes
+Phases of Memory Hotplug
+------------------------

-Kernel Configuration
-====================
+Memory hotplug consists of two phases:

-To use memory hotplug feature, kernel must be compiled with following
-config options.
+(1) Adding the memory to Linux
+(2) Onlining memory blocks

- For all memory hotplug:
-    - Memory model -> Sparse Memory  (``CONFIG_SPARSEMEM``)
-    - Allow for memory hot-add       (``CONFIG_MEMORY_HOTPLUG``)
+In the first phase, metadata, such as the memory map ("memmap") and page tables
+for the direct mapping, is allocated and initialized, and memory blocks are
+created; the latter also creates sysfs files for managing newly created memory
+blocks.

- To enable memory removal, the following are also necessary:
-    - Allow for memory hot remove    (``CONFIG_MEMORY_HOTREMOVE``)
-    - Page Migration                 (``CONFIG_MIGRATION``)
+In the second phase, added memory is exposed to the page allocator. After this
+phase, the memory is visible in memory statistics, such as free and total
+memory, of the system.

- For ACPI memory hotplug, the following are also necessary:
-    - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
-    - This option can be kernel module.
+Phases of Memory Hotunplug
+--------------------------

- As a related configuration, if your box has a feature of NUMA-node hotplug
-  via ACPI, then this option is necessary too.
+Memory hotunplug consists of two phases:

-    - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
-      (``CONFIG_ACPI_CONTAINER``).
+(1) Offlining memory blocks
+(2) Removing the memory from Linux

-     This option can be kernel module too.
+In the fist phase, memory is "hidden" from the page allocator again, for
+example, by migrating busy memory to other memory locations and removing all
+relevant free pages from the page allocator After this phase, the memory is no
+longer visible in memory statistics of the system.

+In the second phase, the memory blocks are removed and metadata is freed.

-.. _memory_hotplug_sysfs_files:
+Memory Hotplug Notifications
+============================

-sysfs files for memory hotplug
+There are various ways how Linux is notified about memory hotplug events such
+that it can start adding hotplugged memory. This description is limited to
+systems that support ACPI; mechanisms specific to other firmware interfaces or
+virtual machines are not described.
+
+ACPI Notifications
+------------------
+
+Platforms that support ACPI, such as x86_64, can support memory hotplug
+notifications via ACPI.
+
+In general, a firmware supporting memory hotplug defines a memory class object
+HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
+driver will hotplug the memory to Linux.
+
+If the firmware supports hotplug of NUMA nodes, it defines an object _HID
+"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
+assigned memory devices are added to Linux by the ACPI driver.
+
+Similarly, Linux can be notified about requests to hotunplug a memory device or
+a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
+blocks, and, if successful, hotunplug the memory from Linux.
+
+Manual Probing
+--------------
+
+On some architectures, the firmware may not be able to notify the operating
+system about a memory hotplug event. Instead, the memory has to be manually
+probed from user space.
+
+The probe interface is located at::
+
+	/sys/devices/system/memory/probe
+
+Only complete memory blocks can be probed. Individual memory blocks are probed
+by providing the physical start address of the memory block::
+
+	% echo addr > /sys/devices/system/memory/probe
+
+Which results in a memory block for the range [addr, addr + memory_block_size)
+being created.
+
+.. note::
+
+  Using the probe interface is discouraged as it is easy to crash the kernel,
+  because Linux cannot validate user input; this interface might be removed in
+  the future.
+
+Onlining and Offlining Memory Blocks
+====================================
+
+After a memory block has been created, Linux has to be instructed to actually
+make use of that memory: the memory block has to be "online".
+
+Before a memory block can be removed, Linux has to stop using any memory part of
+the memory block: the memory block has to be "offlined".
+
+The Linux kernel can be configured to automatically online added memory blocks
+and drivers automatically trigger offlining of memory blocks when trying
+hotunplug of memory. Memory blocks can only be removed once offlining succeeded
+and drivers may trigger offlining of memory blocks when attempting hotunplug of
+memory.
+
+Onlining Memory Blocks Manually
+-------------------------------
+
+If auto-onlining of memory blocks isn't enabled, user-space has to manually
+trigger onlining of memory blocks. Often, udev rules are used to automate this
+task in user space.
+
+Onlining of a memory block can be triggered via::
+
+	% echo online > /sys/devices/system/memory/memoryXXX/state
+
+Or alternatively::
+
+	% echo 1 > /sys/devices/system/memory/memoryXXX/online
+
+The kernel will select the target zone automatically, usually defaulting to
+``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel
+command line or if the memory block would intersect the ZONE_MOVABLE already.
+
+One can explicitly request to associate an offline memory block with
+ZONE_MOVABLE by::
+
+	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+
+Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+
+	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+
+In any case, if onlining succeeds, the state of the memory block is changed to
+be "online". If it fails, the state of the memory block will remain unchanged
+and the above commands will fail.
+
+Onlining Memory Blocks Automatically
+------------------------------------
+
+The kernel can be configured to try auto-onlining of newly added memory blocks.
+If this feature is disabled, the memory blocks will stay offline until
+explicitly onlined from user space.
+
+The configured auto-online behavior can be observed via::
+
+	% cat /sys/devices/system/memory/auto_online_blocks
+
+Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
+``online_movable`` to that file, like::
+
+	% echo online > /sys/devices/system/memory/auto_online_blocks
+
+Modifying the auto-online behavior will only affect all subsequently added
+memory blocks only.
+
+.. note::
+
+  In corner cases, auto-onlining can fail. The kernel won't retry. Note that
+  auto-onlining is not expected to fail in default configurations.
+
+.. note::
+
+  DLPAR on ppc64 ignores the ``offline`` setting and will still online added
+  memory blocks; if onlining fails, memory blocks are removed again.
+
+Offlining Memory Blocks
+-----------------------
+
+In the current implementation, Linux's memory offlining will try migrating all
+movable pages off the affected memory block. As most kernel allocations, such as
+page tables, are unmovable, page migration can fail and, therefore, inhibit
+memory offlining from succeeding.
+
+Having the memory provided by memory block managed by ZONE_MOVABLE significantly
+increases memory offlining reliability; still, memory offlining can fail in
+some corner cases.
+
+Further, memory offlining might retry for a long time (or even forever), until
+aborted by the user.
+
+Offlining of a memory block can be triggered via::
+
+	% echo offline > /sys/devices/system/memory/memoryXXX/state
+
+Or alternatively::
+
+	% echo 0 > /sys/devices/system/memory/memoryXXX/online
+
+If offlining succeeds, the state of the memory block is changed to be "offline".
+If it fails, the state of the memory block will remain unchanged and the above
+commands will fail, for example, via::
+
+	bash: echo: write error: Device or resource busy
+
+or via::
+
+	bash: echo: write error: Invalid argument
+
+Observing the State of Memory Blocks
+------------------------------------
+
+The state (online/offline/going-offline) of a memory block can be observed
+either via::
+
+	% cat /sys/device/system/memory/memoryXXX/state
+
+Or alternatively (1/0) via::
+
+	% cat /sys/device/system/memory/memoryXXX/online
+
+For an online memory block, the managing zone can be observed via::
+
+	% cat /sys/device/system/memory/memoryXXX/valid_zones
+
+Configuring Memory Hot(Un)Plug
 ==============================

-All memory blocks have their device information in sysfs.  Each memory block
-is described under ``/sys/devices/system/memory`` as::
+There are various ways how system administrators can configure memory
+hot(un)plug and interact with memory blocks, especially, to online them.
+
+Memory Hot(Un)Plug Configuration via Sysfs
+------------------------------------------
+
+Some memory hot(un)plug properties can be configured or inspected via sysfs in::
+
+	/sys/devices/system/memory/
+
+The following files are currently defined:
+
+====================== =========================================================
+``auto_online_blocks`` read-write: set or get the default state of new memory
+		       blocks; configure auto-onlining.
+
+		       The default value depends on the
+		       CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
+		       option.
+
+		       See the ``state`` property of memory blocks for details.
+``block_size_bytes``   read-only: the size in bytes of a memory block.
+``probe``	       write-only: add (probe) selected memory blocks manually
+		       from user space by supplying the physical start address.
+
+		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
+		       kernel configuration option.
+``uevent``	       read-write: generic udev file for device subsystems.
+====================== =========================================================
+
+.. note::
+
+  When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
+  additional files ``hard_offline_page`` and ``soft_offline_page`` are available
+  to trigger hwpoisoning of pages, for example, for testing purposes. Note that
+  this functionality is not really related to memory hot(un)plug or actual
+  offlining of memory blocks.
+
+Memory Block Configuration via Sysfs
+------------------------------------
+
+Each memory block is represented as a memory block device that can be
+onlined or offlined. All memory blocks have their device information located in
+sysfs. Each present memory block is listed under
+``/sys/devices/system/memory`` as::

 	/sys/devices/system/memory/memoryXXX

-where XXX is the memory block id.
+where XXX is the memory block id; the number of digits is variable.

-For the memory block covered by the sysfs directory.  It is expected that all
-memory sections in this range are present and no memory holes exist in the
-range. Currently there is no way to determine if there is a memory hole, but
-the existence of one should not affect the hotplug capabilities of the memory
-block.
+A present memory block indicates that some memory in the range is present;
+however, a memory block might span memory holes. A memory block spanning memory
+holes cannot be offlined.

-For example, assume 1GiB memory block size. A device for a memory starting at
+For example, assume 1 GiB memory block size. A device for a memory starting at
 0x100000000 is ``/sys/device/system/memory/memory4``::

 	(0x100000000 / 1Gib = 4)

 This device covers address range [0x100000000 ... 0x140000000)

-Under each memory block, you can see 5 files:
-
- ``/sys/devices/system/memory/memoryXXX/phys_index``
- ``/sys/devices/system/memory/memoryXXX/phys_device``
- ``/sys/devices/system/memory/memoryXXX/state``
- ``/sys/devices/system/memory/memoryXXX/removable``
- ``/sys/devices/system/memory/memoryXXX/valid_zones``
+The following files are currently defined:

 =================== ============================================================
-``phys_index``      read-only and contains memory block id, same as XXX.
-``state``           read-write
-
-                    - at read:  contains online/offline state of memory.
-                    - at write: user can specify "online_kernel",
-
-                    "online_movable", "online", "offline" command
-                    which will be performed on all sections in the block.
+``online``	    read-write: simplified interface to trigger onlining /
+		    offlining and to observe the state of a memory block.
+		    When onlining, the zone is selected automatically.
 ``phys_device``	    read-only: legacy interface only ever used on s390x to
 		    expose the covered storage increment.
+``phys_index``	    read-only: the memory block id (XXX).
 ``removable``	    read-only: legacy interface that indicated whether a memory
-		    block was likely to be offlineable or not.  Newer kernel
-		    versions return "1" if and only if the kernel supports
-		    memory offlining.
-``valid_zones``     read-only: designed to show by which zone memory provided by
-		    a memory block is managed, and to show by which zone memory
-		    provided by an offline memory block could be managed when
-		    onlining.
+		    block was likely to be offlineable or not. Nowadays, the
+		    kernel return ``1`` if and only if it supports memory
+		    offlining.
+``state``	    read-write: advanced interface to trigger onlining /
+		    offlining and to observe the state of a memory block.

-		    The first column shows it`s default zone.
+		    When writing, ``online``, ``offline``, ``online_kernel`` and
+		    ``online_movable`` are supported.

-		    "memory6/valid_zones: Normal Movable" shows this memoryblock
-		    can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
-		    by online_movable.
+		    ``online_movable`` specifies onlining to ZONE_MOVABLE.
+		    ``online_kernel`` specifies onlining to the default kernel
+		    zone for the memory block, such as ZONE_NORMAL.
+                    ``online`` let's the kernel select the zone automatically.

-		    "memory7/valid_zones: Movable Normal" shows this memoryblock
-		    can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
-		    by online_kernel.
+		    When reading, ``online``, ``offline`` and ``going-offline``
+		    may be returned.
+``uevent``	    read-write: generic uevent file for devices.
+``valid_zones``     read-only: when a block is online, shows the zone it
+		    belongs to; when a block is offline, shows what zone will
+		    manage it when the block will be onlined.
+
+		    For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
+		    ``Movable`` and ``none`` may be returned. ``none`` indicates
+		    that memory provided by a memory block is managed by
+		    multiple zones or spans multiple nodes; such memory blocks
+		    cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
+		    Other values indicate a kernel zone.
+
+		    For offline memory blocks, the first column shows the
+		    zone the kernel would select when onlining the memory block
+		    right now without further specifying a zone.
+
+		    Availability depends on the CONFIG_MEMORY_HOTREMOVE
+		    kernel configuration option.
 =================== ============================================================

 .. note::

-  These directories/files appear after physical memory hotplug phase.
+  If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
+  directories can also be accessed via symbolic links located in the
+  ``/sys/devices/system/node/node*`` directories.

-If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
-via symbolic links located in the ``/sys/devices/system/node/node*`` directories.
-
-For example::
+  For example::

 	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9

-A backlink will also be created::
+  A backlink will also be created::

 	/sys/devices/system/memory/memory9/node0 -> ../../node/node0

-.. _memory_hotplug_physical_mem:
+Command Line Parameters
+-----------------------

-Physical memory hot-add phase
-=============================
+Some command line parameters affect memory hot(un)plug handling. The following
+command line parameters are relevant:

-Hardware(Firmware) Support
--------------------------
+======================== =======================================================
+``memhp_default_state``	 configure auto-onlining by essentially setting
+                         ``/sys/devices/system/memory/auto_online_blocks``.
+``movablecore``		 configure automatic zone selection of the kernel. When
+			 set, the kernel will default to ZONE_MOVABLE, unless
+			 other zones can be kept contiguous.
+======================== =======================================================

-On x86_64/ia64 platform, memory hotplug by ACPI is supported.
+Module Parameters
+------------------

-In general, the firmware (ACPI) which supports memory hotplug defines
-memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
-Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
-script. This will be done automatically.
+Instead of additional command line parameters or sysfs files, the
+``memory_hotplug`` subsystem now provides a dedicated namespace for module
+parameters. Module parameters can be set via the command line by predicating
+them with ``memory_hotplug.`` such as::

-But scripts for memory hotplug are not contained in generic udev package(now).
-You may have to write it by yourself or online/offline memory by hand.
-Please see :ref:`memory_hotplug_how_to_online_memory` and
-:ref:`memory_hotplug_how_to_offline_memory`.
+	memory_hotplug.memmap_on_memory=1

-If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
-"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
-calls hotplug code for all of objects which are defined in it.
-If memory device is found, memory hotplug code will be called.
+and they can be observed (and some even modified at runtime) via::

-Notify memory hot-add event by hand
-----------------------------------
+	/sys/modules/memory_hotplug/parameters/

-On some architectures, the firmware may not notify the kernel of a memory
-hotplug event.  Therefore, the memory "probe" interface is supported to
-explicitly notify the kernel.  This interface depends on
-CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
-if hotplug is supported, although for x86 this should be handled by ACPI
-notification.
+The following module parameters are currently defined:

-Probe interface is located at::
+======================== =======================================================
+``memmap_on_memory``	 read-write: Allocate memory for the memmap from the
+			 added memory block itself. Even if enabled, actual
+			 support depends on various other system properties and
+			 should only be regarded as a hint whether the behavior
+			 would be desired.

-	/sys/devices/system/memory/probe
+			 While allocating the memmap from the memory block
+			 itself makes memory hotplug less likely to fail and
+			 keeps the memmap on the same NUMA node in any case, it
+			 can fragment physical memory in a way that huge pages
+			 in bigger granularity cannot be formed on hotplugged
+			 memory.
+======================== =======================================================

-You can tell the physical address of new memory to the kernel by::
+ZONE_MOVABLE
+============

-	% echo start_address_of_new_memory > /sys/devices/system/memory/probe
+ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
+Further, having system RAM managed by ZONE_MOVABLE instead of one of the
+kernel zones can increase the number of possible transparent huge pages and
+dynamically allocated huge pages.

-Then, [start_address_of_new_memory, start_address_of_new_memory +
-memory_block_size] memory range is hot-added. In this case, hotplug script is
-not called (in current implementation). You'll have to online memory by
-yourself.  Please see :ref:`memory_hotplug_how_to_online_memory`.
+Most kernel allocations are unmovable. Important examples include the memory
+map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
+can only be served from the kernel zones.

-Logical Memory hot-add phase
-============================
+Most user space pages, such as anonymous memory, and page cache pages are
+movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.

-State of memory
+Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable
+allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
+absolutely no guarantee whether a memory block can be offlined successfully.
+
+Zone Imbalances
 ---------------

-To see (online/offline) state of a memory block, read 'state' file::
+Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
+which can harm the system or degrade performance. As one example, the kernel
+might crash because it runs out of free memory for unmovable allocations,
+although there is still plenty of free memory left in ZONE_MOVABLE.

-	% cat /sys/device/system/memory/memoryXXX/state
+Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
+are definitely impossible due to the overhead for the memory map.

-
- If the memory block is online, you'll read "online".
- If the memory block is offline, you'll read "offline".
-
-
-.. _memory_hotplug_how_to_online_memory:
-
-How to online memory
--------------------
-
-When the memory is hot-added, the kernel decides whether or not to "online"
-it according to the policy which can be read from "auto_online_blocks" file::
-
-	% cat /sys/devices/system/memory/auto_online_blocks
-
-The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
-option. If it is disabled the default is "offline" which means the newly added
-memory is not in a ready-to-use state and you have to "online" the newly added
-memory blocks manually. Automatic onlining can be requested by writing "online"
-to "auto_online_blocks" file::
-
-	% echo online > /sys/devices/system/memory/auto_online_blocks
-
-This sets a global policy and impacts all memory blocks that will subsequently
-be hotplugged. Currently offline blocks keep their state. It is possible, under
-certain circumstances, that some memory blocks will be added but will fail to
-online. User space tools can check their "state" files
-(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
-
-If the automatic onlining wasn't requested, failed, or some memory block was
-offlined it is possible to change the individual block's state by writing to the
-"state" file::
-
-	% echo online > /sys/devices/system/memory/memoryXXX/state
-
-This onlining will not change the ZONE type of the target memory block,
-If the memory block doesn't belong to any zone an appropriate kernel zone
-(usually ZONE_NORMAL) will be used unless movable_node kernel command line
-option is specified when ZONE_MOVABLE will be used.
-
-You can explicitly request to associate it with ZONE_MOVABLE by::
-
-	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
-
-.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
-
-Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
-
-	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
-
-.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
-
-An explicit zone onlining can fail (e.g. when the range is already within
-and existing and incompatible zone already).
-
-After this, memory block XXX's state will be 'online' and the amount of
-available memory will be increased.
-
-This may be changed in future.
-
-Logical memory remove
-=====================
-
-Memory offline and ZONE_MOVABLE
-------------------------------
-
-Memory offlining is more complicated than memory online. Because memory offline
-has to make the whole memory block be unused, memory offline can fail if
-the memory block includes memory which cannot be freed.
-
-In general, memory offline can use 2 techniques.
-
-(1) reclaim and free all memory in the memory block.
-(2) migrate all pages in the memory block.
-
-In the current implementation, Linux's memory offline uses method (2), freeing
-all  pages in the memory block by page migration. But not all pages are
-migratable. Under current Linux, migratable pages are anonymous pages and
-page caches. For offlining a memory block by migration, the kernel has to
-guarantee that the memory block contains only migratable pages.
-
-Now, a boot option for making a memory block which consists of migratable pages
-is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
-create ZONE_MOVABLE...a zone which is just used for movable pages.
-(See also Documentation/admin-guide/kernel-parameters.rst)
-
-Assume the system has "TOTAL" amount of memory at boot time, this boot option
-creates ZONE_MOVABLE as following.
-
-1) When kernelcore=YYYY boot option is used,
-   Size of memory not for movable pages (not for offline) is YYYY.
-   Size of memory for movable pages (for offline) is TOTAL-YYYY.
-
-2) When movablecore=ZZZZ boot option is used,
-   Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
-   Size of memory for movable pages (for offline) is ZZZZ.
+Actual safe zone ratios depend on the workload. Extreme cases, like excessive
+long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.

 .. note::

-   Unfortunately, there is no information to show which memory block belongs
-   to ZONE_MOVABLE. This is TBD.
+  CMA memory part of a kernel zone essentially behaves like memory in
+  ZONE_MOVABLE and similar considerations apply, especially when combining
+  CMA with ZONE_MOVABLE.

-   Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE
-   and the feature of freeing unused vmemmap pages associated with each hugetlb
-   page is enabled.
+ZONE_MOVABLE Sizing Considerations
+----------------------------------

-   This can happen when we have plenty of ZONE_MOVABLE memory, but not enough
-   kernel memory to allocate vmemmmap pages.  We may even be able to migrate
-   huge page contents, but will not be able to dissolve the source huge page.
-   This will prevent an offline operation and is unfortunate as memory offlining
-   is expected to succeed on movable zones.  Users that depend on memory hotplug
-   to succeed for movable zones should carefully consider whether the memory
-   savings gained from this feature are worth the risk of possibly not being
-   able to offline memory in certain situations.
+We usually expect that a large portion of available system RAM will actually
+be consumed by user space, either directly or indirectly via the page cache. In
+the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.

-.. note::
-   Techniques that rely on long-term pinnings of memory (especially, RDMA and
-   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
-   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
-   memory can still get hot removed - be aware that pinning can fail even if
-   there is plenty of free memory in ZONE_MOVABLE. In addition, using
-   ZONE_MOVABLE might make page pinning more expensive, because pages have to be
-   migrated off that zone first.
+With that in mind, it makes sense that we can have a big portion of system RAM
+managed by ZONE_MOVABLE. However, there are some things to consider when using
+ZONE_MOVABLE, especially when fine-tuning zone ratios:

-.. _memory_hotplug_how_to_offline_memory:
+- Having a lot of offline memory blocks. Even offline memory blocks consume
+  memory for metadata and page tables in the direct map; having a lot of offline
+  memory blocks is not a typical case, though.

-How to offline memory
---------------------
+- Memory ballooning without balloon compaction is incompatible with
+  ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
+  pseries CMM, fully support balloon compaction.

-You can offline a memory block by using the same sysfs interface that was used
-in memory onlining::
+  Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
+  disabled. In that case, balloon inflation will only perform unmovable
+  allocations and silently create a zone imbalance, usually triggered by
+  inflation requests from the hypervisor.

-	% echo offline > /sys/devices/system/memory/memoryXXX/state
+- Gigantic pages are unmovable, resulting in user space consuming a
+  lot of unmovable memory.

-If offline succeeds, the state of the memory block is changed to be "offline".
-If it fails, some error core (like -EBUSY) will be returned by the kernel.
-Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
-it.  If it doesn't contain 'unmovable' memory, you'll get success.
+- Huge pages are unmovable when an architectures does not support huge
+  page migration, resulting in a similar issue as with gigantic pages.

-A memory block under ZONE_MOVABLE is considered to be able to be offlined
-easily.  But under some busy state, it may return -EBUSY. Even if a memory
-block cannot be offlined due to -EBUSY, you can retry offlining it and may be
-able to offline it (or not). (For example, a page is referred to by some kernel
-internal call and released soon.)
+- Page tables are unmovable. Excessive swapping, mapping extremely large
+  files or ZONE_DEVICE memory can be problematic, although only really relevant
+  in corner cases. When we manage a lot of user space memory that has been
+  swapped out or is served from a file/persistent memory/... we still need a lot
+  of page tables to manage that memory once user space accessed that memory.

-Consideration:
-  Memory hotplug's design direction is to make the possibility of memory
-  offlining higher and to guarantee unplugging memory under any situation. But
-  it needs more work. Returning -EBUSY under some situation may be good because
-  the user can decide to retry more or not by himself. Currently, memory
-  offlining code does some amount of retry with 120 seconds timeout.
+- In certain DAX configurations the memory map for the device memory will be
+  allocated from the kernel zones.

-Physical memory remove
-======================
+- KASAN can have a significant memory overhead, for example, consuming 1/8th of
+  the total system memory size as (unmovable) tracking metadata.

-Need more implementation yet....
- - Notification completion of remove works by OS to firmware.
- - Guard from remove if not yet.
+- Long-term pinning of pages. Techniques that rely on long-term pinnings
+  (especially, RDMA and vfio/mdev) are fundamentally problematic with
+  ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
+  on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
+  have to be migrated off that zone while pinning. Pinning a page can fail
+  even if there is plenty of free memory in ZONE_MOVABLE.

+  In addition, using ZONE_MOVABLE might make page pinning more expensive,
+  because of the page migration overhead.

-Locking Internals
-=================
+By default, all the memory configured at boot time is managed by the kernel
+zones and ZONE_MOVABLE is not used.

-When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
-the device_hotplug_lock should be held to:
+To enable ZONE_MOVABLE to include the memory present at boot and to control the
+ratio between movable and kernel zones there are two command line options:
+``kernelcore=`` and ``movablecore=``. See
+Documentation/admin-guide/kernel-parameters.rst for their description.

- synchronize against online/offline requests (e.g. via sysfs). This way, memory
-  block devices can only be accessed (.online/.state attributes) by user
-  space once memory has been fully added. And when removing memory, we
-  know nobody is in critical sections.
- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
+Memory Offlining and ZONE_MOVABLE
+---------------------------------

-Especially, there is a possible lock inversion that is avoided using
-device_hotplug_lock when adding memory and user space tries to online that
-memory faster than expected:
+Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
+block might fail:

- device_online() will first take the device_lock(), followed by
-  mem_hotplug_lock
- add_memory_resource() will first take the mem_hotplug_lock, followed by
-  the device_lock() (while creating the devices, during bus_add_device()).
+- Memory blocks with memory holes; this applies to memory blocks present during
+  boot and can apply to memory blocks hotplugged via the XEN balloon and the
+  Hyper-V balloon.

-As the device is visible to user space before taking the device_lock(), this
-can result in a lock inversion.
+- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
+  offlining; this applies to memory blocks present during boot only.

-onlining/offlining of memory should be done via device_online()/
-device_offline() - to make sure it is properly synchronized to actions
-via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+- Special memory blocks prevented by the system from getting offlined. Examples
+  include any memory available during boot on arm64 or memory blocks spanning
+  the crashkernel area on s390x; this usually applies to memory blocks present
+  during boot only.

-When adding/removing/onlining/offlining memory or adding/removing
-heterogeneous/device memory, we should always hold the mem_hotplug_lock in
-write mode to serialise memory hotplug (e.g. access to global/zone
-variables).
+- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
+  memory blocks present during boot only.

-In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
-mode allows for a quite efficient get_online_mems/put_online_mems
-implementation, so code accessing memory can protect from that memory
-vanishing.
+- Concurrent activity that operates on the same physical memory area, such as
+  allocating gigantic pages, can result in temporary offlining failures.

+- Out of memory when dissolving huge pages, especially when freeing unused
+  vmemmap pages associated with each hugetlb page is enabled.

-Future Work
-===========
+  Offlining code may be able to migrate huge page contents, but may not be able
+  to dissolve the source huge page because it fails allocating (unmovable) pages
+  for the vmemmap, because the system might not have free memory in the kernel
+  zones left.

-  - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
-    sysctl or new control file.
-  - showing memory block and physical device relationship.
-  - test and make it better memory offlining.
-  - support HugeTLB page migration and offlining.
-  - memmap removing at memory offline.
-  - physical remove memory.
+  Users that depend on memory offlining to succeed for movable zones should
+  carefully consider whether the memory savings gained from this feature are
+  worth the risk of possibly not being able to offline memory in certain
+  situations.
+
+Further, when running into out of memory situations while migrating pages, or
+when still encountering permanently unmovable pages within ZONE_MOVABLE
+(-> BUG), memory offlining will keep retrying until it eventually succeeds.
+
+When offlining is triggered from user space, the offlining context can be
+terminated by sending a fatal signal. A timeout based offlining can easily be
+implemented via::
+
+	% timeout $TIMEOUT offline_block | failure_handling