3fcebf9020
Currently, the "auto-movable" online policy does not allow for hotplugged KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can have, primarily, because there is no coordiantion across memory devices and we don't want to create zone-imbalances accidentially when unplugging memory. However, within a single memory device it's different. Let's allow for KERNEL memory within a dynamic memory group to allow for more MOVABLE within the same memory group. The only thing we have to take care of is that the managing driver avoids zone imbalances by unplugging MOVABLE memory first, otherwise there can be corner cases where unplug of memory could result in (accidential) zone imbalances. virtio-mem is the only user of dynamic memory groups and recently added support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we don't need a new toggle to enable it for dynamic memory groups. We limit this handling to dynamic memory groups, because: * We want to keep the runtime overhead for collecting stats when onlining a single memory block small. We tend to have only a handful of dynamic memory groups, but we can have quite some static memory groups (e.g., 256 DIMMs). * It doesn't make too much sense for static memory groups, as we try onlining all applicable memory blocks either completely to ZONE_MOVABLE or not. In ordinary operation, we won't have a mixture of zones within a static memory group. When adding memory to a dynamic memory group, we'll first online memory to ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll online the next unit(s) to ZONE_NORMAL, until we can online the next unit(s) to ZONE_MOVABLE. For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will result in a layout like: [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]... ^ movable memory due to early kernel memory ^ allows for more movable memory ... ^-----^ ... here ^ allows for more movable memory ... ^-----^ ... here While the created layout is sub-optimal when it comes to contiguous zones, it gives us the maximum flexibility when dynamically growing/shrinking a device; we can grow small VMs really big in small steps, and still shrink reliably to e.g., 1/4 of the maximum VM size in this example, removing full memory blocks along with meta data more reliably. Mark dynamic memory groups in the xarray such that we can efficiently iterate over them when collecting stats. In usual setups, we have one virtio-mem device per NUMA node, and usually only a small number of NUMA nodes. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
---|---|---|
.. | ||
accessibility | ||
acpi | ||
amba | ||
android | ||
ata | ||
atm | ||
auxdisplay | ||
base | ||
bcma | ||
block | ||
bluetooth | ||
bus | ||
cdrom | ||
char | ||
clk | ||
clocksource | ||
comedi | ||
connector | ||
counter | ||
cpufreq | ||
cpuidle | ||
crypto | ||
cxl | ||
dax | ||
dca | ||
devfreq | ||
dio | ||
dma | ||
dma-buf | ||
edac | ||
eisa | ||
extcon | ||
firewire | ||
firmware | ||
fpga | ||
fsi | ||
gnss | ||
gpio | ||
gpu | ||
greybus | ||
hid | ||
hsi | ||
hv | ||
hwmon | ||
hwspinlock | ||
hwtracing | ||
i2c | ||
i3c | ||
idle | ||
iio | ||
infiniband | ||
input | ||
interconnect | ||
iommu | ||
ipack | ||
irqchip | ||
isdn | ||
leds | ||
lightnvm | ||
macintosh | ||
mailbox | ||
mcb | ||
md | ||
media | ||
memory | ||
memstick | ||
message | ||
mfd | ||
misc | ||
mmc | ||
most | ||
mtd | ||
mux | ||
net | ||
nfc | ||
ntb | ||
nubus | ||
nvdimm | ||
nvme | ||
nvmem | ||
of | ||
opp | ||
parisc | ||
parport | ||
pci | ||
pcmcia | ||
perf | ||
phy | ||
pinctrl | ||
platform | ||
pnp | ||
power | ||
powercap | ||
pps | ||
ps3 | ||
ptp | ||
pwm | ||
rapidio | ||
ras | ||
regulator | ||
remoteproc | ||
reset | ||
rpmsg | ||
rtc | ||
s390 | ||
sbus | ||
scsi | ||
sh | ||
siox | ||
slimbus | ||
soc | ||
soundwire | ||
spi | ||
spmi | ||
ssb | ||
staging | ||
target | ||
tc | ||
tee | ||
thermal | ||
thunderbolt | ||
tty | ||
uio | ||
usb | ||
vdpa | ||
vfio | ||
vhost | ||
video | ||
virt | ||
virtio | ||
visorbus | ||
vlynq | ||
vme | ||
w1 | ||
watchdog | ||
xen | ||
zorro | ||
Kconfig | ||
Makefile |