f477619990
For hugetlb backed jobs/VMs it's critical to understand the numa information for the memory backing these jobs to deliver optimal performance. Currently this technically can be queried from /proc/self/numa_maps, but there are significant issues with that. Namely: 1. Memory can be mapped or unmapped. 2. numa_maps are per process and need to be aggregated across all processes in the cgroup. For shared memory this is more involved as the userspace needs to make sure it doesn't double count shared mappings. 3. I believe querying numa_maps needs to hold the mmap_lock which adds to the contention on this lock. For these reasons I propose simply adding hugetlb.*.numa_stat file, which shows the numa information of the cgroup similarly to memory.numa_stat. On cgroup-v2: cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 On cgroup-v1: cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 hierarichal_total=2097152 N0=2097152 N1=0 This patch was tested manually by allocating hugetlb memory and querying the hugetlb.*.numa_stat file of the cgroup and its parents. [colin.i.king@googlemail.com: fix spelling mistake "hierarichal" -> "hierarchical"] Link: https://lkml.kernel.org/r/20211125090635.23508-1-colin.i.king@gmail.com [keescook@chromium.org: fix copy/paste array assignment] Link: https://lkml.kernel.org/r/20211203065647.2819707-1-keescook@chromium.org Link: https://lkml.kernel.org/r/20211123001020.4083653-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Jue Wang <juew@google.com> Cc: Yang Yao <ygyao@google.com> Cc: Joanna Li <joannali@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
136 lines
6.0 KiB
ReStructuredText
136 lines
6.0 KiB
ReStructuredText
==================
|
|
HugeTLB Controller
|
|
==================
|
|
|
|
HugeTLB controller can be created by first mounting the cgroup filesystem.
|
|
|
|
# mount -t cgroup -o hugetlb none /sys/fs/cgroup
|
|
|
|
With the above step, the initial or the parent HugeTLB group becomes
|
|
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
|
|
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
|
|
|
|
New groups can be created under the parent group /sys/fs/cgroup::
|
|
|
|
# cd /sys/fs/cgroup
|
|
# mkdir g1
|
|
# echo $$ > g1/tasks
|
|
|
|
The above steps create a new group g1 and move the current shell
|
|
process (bash) into it.
|
|
|
|
Brief summary of control files::
|
|
|
|
hugetlb.<hugepagesize>.rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations
|
|
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults
|
|
hugetlb.<hugepagesize>.rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb
|
|
hugetlb.<hugepagesize>.rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit
|
|
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults
|
|
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
|
|
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
|
|
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
|
|
hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup
|
|
|
|
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
|
|
files include::
|
|
|
|
hugetlb.1GB.limit_in_bytes
|
|
hugetlb.1GB.max_usage_in_bytes
|
|
hugetlb.1GB.numa_stat
|
|
hugetlb.1GB.usage_in_bytes
|
|
hugetlb.1GB.failcnt
|
|
hugetlb.1GB.rsvd.limit_in_bytes
|
|
hugetlb.1GB.rsvd.max_usage_in_bytes
|
|
hugetlb.1GB.rsvd.usage_in_bytes
|
|
hugetlb.1GB.rsvd.failcnt
|
|
hugetlb.64KB.limit_in_bytes
|
|
hugetlb.64KB.max_usage_in_bytes
|
|
hugetlb.64KB.numa_stat
|
|
hugetlb.64KB.usage_in_bytes
|
|
hugetlb.64KB.failcnt
|
|
hugetlb.64KB.rsvd.limit_in_bytes
|
|
hugetlb.64KB.rsvd.max_usage_in_bytes
|
|
hugetlb.64KB.rsvd.usage_in_bytes
|
|
hugetlb.64KB.rsvd.failcnt
|
|
hugetlb.32MB.limit_in_bytes
|
|
hugetlb.32MB.max_usage_in_bytes
|
|
hugetlb.32MB.numa_stat
|
|
hugetlb.32MB.usage_in_bytes
|
|
hugetlb.32MB.failcnt
|
|
hugetlb.32MB.rsvd.limit_in_bytes
|
|
hugetlb.32MB.rsvd.max_usage_in_bytes
|
|
hugetlb.32MB.rsvd.usage_in_bytes
|
|
hugetlb.32MB.rsvd.failcnt
|
|
|
|
|
|
1. Page fault accounting
|
|
|
|
hugetlb.<hugepagesize>.limit_in_bytes
|
|
hugetlb.<hugepagesize>.max_usage_in_bytes
|
|
hugetlb.<hugepagesize>.usage_in_bytes
|
|
hugetlb.<hugepagesize>.failcnt
|
|
|
|
The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
|
|
control group and enforces the limit during page fault. Since HugeTLB
|
|
doesn't support page reclaim, enforcing the limit at page fault time implies
|
|
that, the application will get SIGBUS signal if it tries to fault in HugeTLB
|
|
pages beyond its limit. Therefore the application needs to know exactly how many
|
|
HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
|
|
there are enough available on the machine for all the users to avoid processes
|
|
getting SIGBUS.
|
|
|
|
|
|
2. Reservation accounting
|
|
|
|
hugetlb.<hugepagesize>.rsvd.limit_in_bytes
|
|
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
|
|
hugetlb.<hugepagesize>.rsvd.usage_in_bytes
|
|
hugetlb.<hugepagesize>.rsvd.failcnt
|
|
|
|
The HugeTLB controller allows to limit the HugeTLB reservations per control
|
|
group and enforces the controller limit at reservation time and at the fault of
|
|
HugeTLB memory for which no reservation exists. Since reservation limits are
|
|
enforced at reservation time (on mmap or shget), reservation limits never causes
|
|
the application to get SIGBUS signal if the memory was reserved before hand. For
|
|
MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
|
|
limit, enforcing memory usage at fault time and causing the application to
|
|
receive a SIGBUS if it's crossing its limit.
|
|
|
|
Reservation limits are superior to page fault limits described above, since
|
|
reservation limits are enforced at reservation time (on mmap or shget), and
|
|
never causes the application to get SIGBUS signal if the memory was reserved
|
|
before hand. This allows for easier fallback to alternatives such as
|
|
non-HugeTLB memory for example. In the case of page fault accounting, it's very
|
|
hard to avoid processes getting SIGBUS since the sysadmin needs precisely know
|
|
the HugeTLB usage of all the tasks in the system and make sure there is enough
|
|
pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited
|
|
systems is practically impossible with page fault accounting.
|
|
|
|
|
|
3. Caveats with shared memory
|
|
|
|
For shared HugeTLB memory, both HugeTLB reservation and page faults are charged
|
|
to the first task that causes the memory to be reserved or faulted, and all
|
|
subsequent uses of this reserved or faulted memory is done without charging.
|
|
|
|
Shared HugeTLB memory is only uncharged when it is unreserved or deallocated.
|
|
This is usually when the HugeTLB file is deleted, and not when the task that
|
|
caused the reservation or fault has exited.
|
|
|
|
|
|
4. Caveats with HugeTLB cgroup offline.
|
|
|
|
When a HugeTLB cgroup goes offline with some reservations or faults still
|
|
charged to it, the behavior is as follows:
|
|
|
|
- The fault charges are charged to the parent HugeTLB cgroup (reparented),
|
|
- the reservation charges remain on the offline HugeTLB cgroup.
|
|
|
|
This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB
|
|
reservations charged to it, that cgroup persists as a zombie until all HugeTLB
|
|
reservations are uncharged. HugeTLB reservations behave in this manner to match
|
|
the memory controller whose cgroups also persist as zombie until all charged
|
|
memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more
|
|
complex compared to the tracking of HugeTLB faults, so it is significantly
|
|
harder to reparent reservations at offline time.
|