IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Objcg vectors attached to slab pages to store slab object ownership
information are allocated using gfp flags for the original slab
allocation. Depending on slab page order and the size of slab objects,
objcg vector can take several pages.
If the original allocation was done with the __GFP_NOFAIL flag, it
triggered a warning in the page allocation code. Indeed, order > 1 pages
should not been allocated with the __GFP_NOFAIL flag.
Fix this by simply dropping the __GFP_NOFAIL flag when allocating the
objcg vector. It effectively allows to skip the accounting of a single
slab object under a heavy memory pressure.
An alternative would be to implement the mechanism to fallback to order-0
allocations for accounting metadata, which is also not perfect because it
will increase performance penalty and memory footprint of the kernel
memory accounting under memory pressure.
Link: https://lkml.kernel.org/r/ZUp8ZFGxwmCx4ZFr@P9FQF9L96D.corp.robot.car
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: Christoph Lameter <cl@linux.com>
Closes: https://lkml.kernel.org/r/6b42243e-f197-600a-5d22-56bd728a5ad8@gentwo.org
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To charge a freshly allocated kernel object to a memory cgroup, the kernel
needs to obtain an objcg pointer. Currently it does it indirectly by
obtaining the memcg pointer first and then calling to
__get_obj_cgroup_from_memcg().
Usually tasks spend their entire life belonging to the same object cgroup.
So it makes sense to save the objcg pointer on task_struct directly, so
it can be obtained faster. It requires some work on fork, exit and cgroup
migrate paths, but these paths are way colder.
To avoid any costly synchronization the following rules are applied:
1) A task sets it's objcg pointer itself.
2) If a task is being migrated to another cgroup, the least
significant bit of the objcg pointer is set atomically.
3) On the allocation path the objcg pointer is obtained locklessly
using the READ_ONCE() macro and the least significant bit is
checked. If it's set, the following procedure is used to update
it locklessly:
- task->objcg is zeroed using cmpxcg
- new objcg pointer is obtained
- task->objcg is updated using try_cmpxchg
- operation is repeated if try_cmpxcg fails
It guarantees that no updates will be lost if task migration
is racing against objcg pointer update. It also allows to keep
both read and write paths fully lockless.
Because the task is keeping a reference to the objcg, it can't go away
while the task is alive.
This commit doesn't change the way the remote memcg charging works.
Link: https://lkml.kernel.org/r/20231019225346.1822282-3-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: improve performance of accounted kernel memory
allocations", v5.
This patchset improves the performance of accounted kernel memory
allocations by ~30% as measured by a micro-benchmark [1]. The benchmark
is very straightforward: 1M of 64 bytes-large kmalloc() allocations.
Below are results with the disabled kernel memory accounting, the original state
and with this patchset applied.
| | Kmem disabled | Original | Patched | Delta |
|-------------+---------------+----------+---------+--------|
| User cgroup | 29764 | 84548 | 59078 | -30.0% |
| Root cgroup | 29742 | 48342 | 31501 | -34.8% |
As we can see, the patchset removes the majority of the overhead when
there is no actual accounting (a task belongs to the root memory cgroup)
and almost halves the accounting overhead otherwise.
The main idea is to get rid of unnecessary memcg to objcg conversions and
switch to a scope-based protection of objcgs, which eliminates extra
operations with objcg reference counters under a rcu read lock. More
details are provided in individual commit descriptions.
This patch (of 5):
Manually inline memcg_kmem_bypass() and active_memcg() to speed up
get_obj_cgroup_from_current() by avoiding duplicate in_task() checks and
active_memcg() readings.
Also add a likely() macro to __get_obj_cgroup_from_memcg():
obj_cgroup_tryget() should succeed at almost all times except a very
unlikely race with the memcg deletion path.
Link: https://lkml.kernel.org/r/20231019225346.1822282-1-roman.gushchin@linux.dev
Link: https://lkml.kernel.org/r/20231019225346.1822282-2-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "hugetlb memcg accounting", v4.
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etcetera fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch series rectifies this issue by charging the memcg when the
hugetlb folio is allocated, and uncharging when the folio is freed. In
addition, a new selftest is added to demonstrate and verify this new
behavior.
This patch (of 4):
This patch exposes charge committing and cancelling as parts of the memory
controller interface. These functionalities are useful when the
try_charge() and commit_charge() stages have to be separated by other
actions in between (which can fail). One such example is the new hugetlb
accounting behavior in the following patch.
The patch also adds a helper function to obtain a reference to the
current task's memcg.
Link: https://lkml.kernel.org/r/20231006184629.155543-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231006184629.155543-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memcg_rstat_updated() uses the value of the state update to keep track of
the magnitude of pending updates, so that we only do a stats flush when
it's worth the work. Most values passed into memcg_rstat_updated() are in
pages, however, a few of them are actually in bytes or KBs.
To put this into perspective, a 512 byte slab allocation today would look
the same as allocating 512 pages. This may result in premature flushes,
which means unnecessary work and latency.
Normalize all the state values passed into memcg_rstat_updated() to pages.
Round up non-zero sub-page to 1 page, because memcg_rstat_updated()
ignores 0 page updates.
Link: https://lkml.kernel.org/r/20230922175741.635002-3-yosryahmed@google.com
Fixes: 5b3be698a8 ("memcg: better bounds on the memcg stats updates")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: memcg: fix tracking of pending stats updates values", v2.
While working on adjacent code [1], I realized that the values passed into
memcg_rstat_updated() to keep track of the magnitude of pending updates is
consistent. It is mostly in pages, but sometimes it can be in bytes or
KBs. Fix that.
Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch 2
to check and normalize the units of state updates.
[1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
This patch (of 2):
memcg_page_state_unit() is currently used to identify the unit of a memcg
state item so that all stats in memory.stat are in bytes. However, it
lies about the units of WORKINGSET_* stats. These stats actually
represent pages, but we present them to userspace as a scalar number of
events. In retrospect, maybe those stats should have been memcg "events"
rather than memcg "state".
In preparation for using memcg_page_state_unit() for other purposes that
need to know the truthful units of different stat items, break it down
into two helpers:
- memcg_page_state_unit() retuns the actual unit of the item.
- memcg_page_state_output_unit() returns the unit used for output.
Use the latter instead of the former in memcg_page_state_output() and
lruvec_page_state_output(). While we are at it, let's show cgroup v1 some
love and add memcg_page_state_local_output() for consistency.
No functional change intended.
Link: https://lkml.kernel.org/r/20230922175741.635002-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230922175741.635002-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commits 86327e8eb9 ("memcg: drop kmem.limit_in_bytes") and
partially reverts 58056f7750 ("memcg, kmem: further deprecate
kmem.limit_in_bytes") which have incrementally removed support for the
kernel memory accounting hard limit. Unfortunately it has turned out that
there is still userspace depending on the existence of
memory.kmem.limit_in_bytes [1]. The underlying functionality is not
really required but the non-existent file just confuses the userspace
which fails in the result. The patch to fix this on the userspace side
has been submitted but it is hard to predict how it will propagate through
the maze of 3rd party consumers of the software.
Now, reverting alone 86327e8eb9 is not an option because there is
another set of userspace which cannot cope with ENOTSUPP returned when
writing to the file. Therefore we have to go and revisit 58056f7750 as
well. There are two ways to go ahead. Either we give up on the
deprecation and fully revert 58056f7750 as well or we can keep
kmem.limit_in_bytes but make the write a noop and warn about the fact.
This should work for both known breaking workloads which depend on the
existence but do not depend on the hard limit enforcement.
Note to backporters to stable trees. a8c49af3be ("memcg: add per-memcg
total kernel memory stat") introduced in 4.18 has added memcg_account_kmem
so the accounting is not done by obj_cgroup_charge_pages directly for v1
anymore. Prior kernels need to add it explicitly (thanks to Johannes for
pointing this out).
[akpm@linux-foundation.org: fix build - remove unused local]
Link: http://lkml.kernel.org/r/20230920081101.GA12096@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net [1]
Link: https://lkml.kernel.org/r/ZRE5VJozPZt9bRPy@dhcp22.suse.cz
Fixes: 86327e8eb9 ("memcg: drop kmem.limit_in_bytes")
Fixes: 58056f7750 ("memcg, kmem: further deprecate kmem.limit_in_bytes")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, memcg uses rstat to maintain aggregated hierarchical stats.
Counters are maintained for hierarchical stats at each memcg. Rstat
tracks which cgroups have updates on which cpus to keep those counters
fresh on the read-side.
Non-hierarchical stats are currently not covered by rstat. Their per-cpu
counters are summed up on every read, which is expensive. The original
implementation did the same. At some point before rstat, non-hierarchical
aggregated counters were introduced by commit a983b5ebee ("mm:
memcontrol: fix excessive complexity in memory.stat reporting"). However,
those counters were updated on the performance critical write-side, which
caused regressions, so they were later removed by commit 815744d751
("mm: memcontrol: don't batch updates of local VM stats and events"). See
[1] for more detailed history.
Kernel versions in between a983b5ebee & 815744d751 (a year and a half)
enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1.
When moving to more recent kernels, a performance regression for reading
non-hierarchical stats is observed.
Now that we have rstat, we know exactly which percpu counters have updates
for each stat. We can maintain non-hierarchical counters again, making
reads much more efficient, without affecting the performance critical
write-side. Hence, add non-hierarchical (i.e local) counters for the
stats, and extend rstat flushing to keep those up-to-date.
A caveat is that we now need a stats flush before reading
local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or
memcg_events_local(), where we previously only needed a flush to read
hierarchical stats. Most contexts reading non-hierarchical stats are
already doing a flush, add a flush to the only missing context in
count_shadow_nodes().
With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
machine with 256 cpus on cgroup v1:
# for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
# time cat /sys/fs/cgroup/memory/cg*/memory.stat > /dev/null
real 0m0.125s
user 0m0.005s
sys 0m0.120s
After:
real 0m0.032s
user 0m0.005s
sys 0m0.027s
To make sure there are no regressions on cgroup v2, I ran an artificial
reclaim/refault stress test [2] that creates (NR_CPUS * 2) cgroups,
assigns them limits, runs a worker process in each cgroup that allocates
tmpfs memory equal to quadruple the limit (to invoke reclaim
continuously), and then reads back the entire file (to invoke refaults).
All workers are run in parallel, and zram is used as a swapping backend.
Both reclaim and refault have conditional stats flushing. I ran this on a
machine with 112 cpus, once on mm-unstable, and once on mm-unstable with
this patch reverted.
(1) A few runs without this patch:
# time ./stress_reclaim_refault.sh
real 0m9.949s
user 0m0.496s
sys 14m44.974s
# time ./stress_reclaim_refault.sh
real 0m10.049s
user 0m0.486s
sys 14m55.791s
# time ./stress_reclaim_refault.sh
real 0m9.984s
user 0m0.481s
sys 14m53.841s
(2) A few runs with this patch:
# time ./stress_reclaim_refault.sh
real 0m9.885s
user 0m0.486s
sys 14m48.753s
# time ./stress_reclaim_refault.sh
real 0m9.903s
user 0m0.495s
sys 14m48.339s
# time ./stress_reclaim_refault.sh
real 0m9.861s
user 0m0.507s
sys 14m49.317s
No regressions are observed with this patch. There is actually a very
slight improvement. If I have to guess, maybe it's because we avoid
the percpu loop in count_shadow_nodes() when calling
lruvec_page_state_local(), but I could not prove this using perf, it's
probably in the noise.
[1] https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/
[2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230803185046.1385770-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230726153223.821757-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before commit f53af4285d ("mm: vmscan: fix extreme overreclaim and swap
floods"), proactive reclaim will extreme overreclaim sometimes. But
proactive reclaim still inaccurate and some extent overreclaim.
Problematic case is easy to construct. Allocate lots of anonymous memory
(e.g., 20G) in a memcg, then swapping by writing memory.recalim and there
is a certain probability of overreclaim. For example, request 1G by
writing memory.reclaim will eventually reclaim 1.7G or other values more
than 1G.
The reason is that reclaimer may have already reclaimed part of requested
memory in one loop, but before adjust sc->nr_to_reclaim in outer loop,
call shrink_lruvec() again will still follow the current sc->nr_to_reclaim
to work. It will eventually lead to overreclaim. In theory, the amount
of reclaimed would be in [request, 2 * request).
Reclaimer usually tends to reclaim more than request. But either direct
or kswapd reclaim have much smaller nr_to_reclaim targets, so it is less
noticeable and not have much impact.
Proactive reclaim can usually come in with a larger value, so the error is
difficult to ignore. Considering proactive reclaim is usually low
frequency, handle the batching into smaller chunks is a better approach.
Link: https://lkml.kernel.org/r/20230721014116.3388-1-yangyifei03@kuaishou.com
Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas. Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.
The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks. With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them. The change ensures vmas are properly locked during such
walks.
A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration. Without this change a concurrent page
can be faulted into the area and be left out of migration.
Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmem.limit_in_bytes (v1 way to limit kernel memory usage) has been
deprecated since 58056f7750 ("memcg, kmem: further deprecate
kmem.limit_in_bytes") merged in 5.16. We haven't heard about any serious
users since then but it seems that the mere presence of the file is
causing more harm thatn good. We (SUSE) have had several bug reports from
customers where Docker based containers started to fail because a write to
kmem.limit_in_bytes has failed.
This was unexpected because runc code only expects ENOENT (kmem disabled)
or EBUSY (tasks already running within cgroup). So a new error code was
unexpected and the whole container startup failed. This has been later
addressed by
52390d6804
so current Docker runtimes do not suffer from the problem anymore. There
are still older version of Docker in use and likely hard to get rid of
completely.
Address this by wiping out the file completely and effectively get back to
pre 4.5 era and CONFIG_MEMCG_KMEM=n configuration.
I would recommend backporting to stable trees which have picked up
58056f7750 ("memcg, kmem: further deprecate kmem.limit_in_bytes").
[mhocko@suse.com: restore _KMEM switch case]
Link: https://lkml.kernel.org/r/ZKe5wxdbvPi5Cwd7@dhcp22.suse.cz
Link: https://lkml.kernel.org/r/20230704115240.14672-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch is similar to commit 8e20d4b332 ("mm/memcontrol: export
memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's
watermark.
We allocate jobs to our compute farm using heuristics determined by memory
and swap usage from previous jobs. Tracking the peak swap usage for new
jobs is important for determining when jobs are exceeding their expected
bounds, or when our baseline metrics are getting outdated.
Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file
in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current"
would give less accurate results as well as add complication to the code.
Having this watermark exposed in sysfs is much preferred.
Link: https://lkml.kernel.org/r/20230524181734.125696-1-lars@pixar.com
Signed-off-by: Lars R. Damerow <lars@pixar.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before commit 29ef680ae7 ("memcg, oom: move out_of_memory back to the
charge path"), all memcg oom killers were delayed to page fault path. And
the explicit wakeup is used in this case:
thread A:
...
if (locked) { // complete oom-kill, hold the lock
mem_cgroup_oom_unlock(memcg);
...
}
...
thread B:
...
if (locked && !memcg->oom_kill_disable) {
...
} else {
schedule(); // can't acquire the lock
...
}
...
The reason is that thread A kicks off the OOM-killer, which leads to
wakeups from the uncharges of the exiting task. But thread B is not
guaranteed to see them if it enters the OOM path after the OOM kills but
before thread A releases the lock.
Now only oom_kill_disable case is handled from the #PF path. In that case
it is userspace to trigger the wake up not the #PF path itself. All
potential paths to free some charges are responsible to call
memcg_oom_recover() , so the explicit wakeup is not needed in the
mem_cgroup_oom_synchronize() path which doesn't release any memory itself.
Link: https://lkml.kernel.org/r/20230419030739.115845-2-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memcg: OOM log improvements", v2.
This short patch series brings back some cgroup v1 stats in OOM logs
that were unnecessarily changed before. It also makes memcg OOM logs
less reliant on printk() internals.
This patch (of 2):
Commit c8713d0b23 ("mm: memcontrol: dump memory.stat during cgroup OOM")
made sure we dump all the stats in memory.stat during a cgroup OOM, but it
also introduced a slight behavioral change. The code used to print the
non-hierarchical v1 cgroup stats for the entire cgroup subtree, now it
only prints the v2 cgroup stats for the cgroup under OOM.
For cgroup v1 users, this introduces a few problems:
(a) The non-hierarchical stats of the memcg under OOM are no longer
shown.
(b) A couple of v1-only stats (e.g. pgpgin, pgpgout) are no longer
shown.
(c) We show the list of cgroup v2 stats, even in cgroup v1. This list
of stats is not tracked with v1 in mind. While most of the stats seem
to be working on v1, there may be some stats that are not fully or
correctly tracked.
Although OOM log is not set in stone, we should not change it for no
reason. When upgrading the kernel version to a version including commit
c8713d0b23 ("mm: memcontrol: dump memory.stat during cgroup OOM"), these
behavioral changes are noticed in cgroup v1.
The fix is simple. Commit c8713d0b23 ("mm: memcontrol: dump memory.stat
during cgroup OOM") separated stats formatting from stats display for v2,
to reuse the stats formatting in the OOM logs. Do the same for v1.
Move the v2 specific formatting from memory_stat_format() to
memcg_stat_format(), add memcg1_stat_format() for v1, and make
memory_stat_format() select between them based on cgroup version. Since
memory_stat_show() now works for both v1 & v2, drop memcg_stat_show().
Link: https://lkml.kernel.org/r/20230428132406.2540811-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230428132406.2540811-3-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, we format all the memcg stats into a buffer in
mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs.
However, this buffer is large in size. Although it is currently working
as intended, ther is a dependency between the memcg stats buffer and the
printk record size limit.
If we add more stats in the future and the buffer becomes larger than the
printk record size limit, or if the prink record size limit is reduced,
the logs may be truncated.
It is safer to use seq_buf_do_printk(), which will automatically break up
the buffer at line breaks and issue small printk() calls.
Refactor the code to move the seq_buf from memory_stat_format() to its
callers, and use seq_buf_do_printk() to print the seq_buf in
mem_cgroup_print_oom_meminfo().
Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A memcg pointer in the percpu stock can be accessed by drain_all_stock()
from another cpu in a lockless way. In theory it might lead to an issue,
similar to the one which has been discovered with stock->cached_objcg,
where the pointer was zeroed between the check for being NULL and
dereferencing. In this case the issue is unlikely a real problem, but to
make it bulletproof and similar to stock->cached_objcg, let's annotate all
accesses to stock->cached with READ_ONCE()/WTRITE_ONCE().
Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>