linux

iv/linux

Go to file

Aneesh Kumar K.V 992bf77591 mm/demotion: add support for explicit memory tiers Patch series "mm/demotion: Memory tiers and demotion", v15. The current kernel has the basic memory tiering support: Inactive pages on a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to make room for new allocations on the higher tier NUMA node. Frequently accessed pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA node to improve the performance. In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases: * The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) and that should be put into a higher tier. * The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. * Also because the current tier hierarchy always puts CPU nodes into the top tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets changed, even though no memory node is added or removed. This can make the tier hierarchy unstable and make it difficult to support tier-based memory accounting. * A higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), and has resulted in the feature request for an interface to override the system-wide, per-node demotion order from the userspace. This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series make the creation of memory tiers explicit under the control of device driver. Memory Tier Initialization ========================== Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. By default, all memory nodes are assigned to the default tier with abstract distance 512. A device driver can move its memory nodes from the default tier. For example, PMEM can move its memory nodes below the default tier, whereas GPU can move its memory nodes above the default tier. The kernel initialization code makes the decision on which exact tier a memory node should be assigned to based on the requests from the device drivers as well as the memory device hardware information provided by the firmware. Hot-adding/removing CPUs doesn't affect memory tier hierarchy. This patch (of 10): In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. This patch configures the range/chunk size to be 128. The default DRAM abstract distance is 512. We can have 4 memory tiers below the default DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devices can be placed in these faster(higher) memory tiers. Slower memory devices like persistent memory will have abstract distance higher than the default DRAM level. [akpm@linux-foundation.org: fix comment, per Aneesh] Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2022-09-26 19:46:11 -07:00
arch	mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG	2022-09-26 19:46:08 -07:00
block	block-6.0-2022-08-26	2022-08-26 11:05:54 -07:00
certs	Kbuild updates for v5.20	2022-08-10 10:40:41 -07:00
crypto	crypto: blake2b: effectively disable frame size warning	2022-08-10 17:59:11 -07:00
Documentation	mm: multi-gen LRU: design doc	2022-09-26 19:46:11 -07:00
drivers	mm: kill is_memblock_offlined()	2022-09-11 20:26:04 -07:00
fs	mm: multi-gen LRU: support page table walks	2022-09-26 19:46:09 -07:00
include	mm/demotion: add support for explicit memory tiers	2022-09-26 19:46:11 -07:00
init	page_ext: introduce boot parameter 'early_page_ext'	2022-09-11 20:26:02 -07:00
io_uring	io_uring/net: save address for sendzc async execution	2022-08-25 07:52:30 -06:00
ipc	Updates to various subsystems which I help look after. lib, ocfs2,	2022-08-07 10:03:24 -07:00
kernel	mm: multi-gen LRU: kill switch	2022-09-26 19:46:10 -07:00
lib	bitmap fixes for v6.0-rc3	2022-08-28 14:36:27 -07:00
LICENSES	LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers	2021-12-16 14:33:10 +01:00
mm	mm/demotion: add support for explicit memory tiers	2022-09-26 19:46:11 -07:00
net	Including fixes from ipsec and netfilter (with one broken Fixes tag).	2022-08-25 14:03:58 -07:00
samples	Tracing updates for 5.20 / 6.0	2022-08-05 09:41:12 -07:00
scripts	asm goto: eradicate CC_HAS_ASM_GOTO	2022-08-21 10:06:28 -07:00
security	hardening fixes for v6.0-rc2	2022-08-19 13:56:14 -07:00
sound	sound fixes for 6.0-rc2	2022-08-19 09:46:11 -07:00
tools	Merge branch 'mm-hotfixes-stable' into mm-stable	2022-09-26 13:13:15 -07:00
usr	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
virt	KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()	2022-08-19 04:05:43 -04:00
.clang-format	PCI/DOE: Add DOE mailbox support functions	2022-07-19 15:38:04 -07:00
.cocciconfig
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	kbuild: split the second line of .mod into .usyms	2022-05-08 03:16:59 +09:00
.mailmap	.mailmap: update Luca Ceresoli's e-mail address	2022-08-28 14:02:46 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	drm for 5.20/6.0	2022-08-03 19:52:08 -07:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	bitmap fixes for v6.0-rc3	2022-08-28 14:36:27 -07:00
Makefile	Linux 6.0-rc3	2022-08-28 15:05:29 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.