2c6efe9cf2
In doing experimentations with shmem having the option to avoid swap becomes a useful mechanism. One of the *raves* about brd over shmem is you can avoid swap, but that's not really a good reason to use brd if we can instead use shmem. Using brd has its own good reasons to exist, but just because "tmpfs" doesn't let you do that is not a great reason to avoid it if we can easily add support for it. I don't add support for reconfiguring incompatible options, but if we really wanted to we can add support for that. To avoid swap we use mapping_set_unevictable() upon inode creation, and put a WARN_ON_ONCE() stop-gap on writepages() for reclaim. Link: https://lkml.kernel.org/r/20230309230545.2930737-7-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
226 lines
9.9 KiB
ReStructuredText
226 lines
9.9 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=====
|
|
Tmpfs
|
|
=====
|
|
|
|
Tmpfs is a file system which keeps all of its files in virtual memory.
|
|
|
|
|
|
Everything in tmpfs is temporary in the sense that no files will be
|
|
created on your hard drive. If you unmount a tmpfs instance,
|
|
everything stored therein is lost.
|
|
|
|
tmpfs puts everything into the kernel internal caches and grows and
|
|
shrinks to accommodate the files it contains and is able to swap
|
|
unneeded pages out to swap space, if swap was enabled for the tmpfs
|
|
mount. tmpfs also supports THP.
|
|
|
|
tmpfs extends ramfs with a few userspace configurable options listed and
|
|
explained further below, some of which can be reconfigured dynamically on the
|
|
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
|
|
filesystem can be resized but it cannot be resized to a size below its current
|
|
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
|
|
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
|
|
modify any parameter for a ramfs filesystem. The size limit of a ramfs
|
|
filesystem is how much memory you have available, and so care must be taken if
|
|
used so to not run out of memory.
|
|
|
|
An alternative to tmpfs and ramfs is to use brd to create RAM disks
|
|
(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
|
|
To write data you would just then need to create an regular filesystem on top
|
|
this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
|
|
configured in size at initialization and you cannot dynamically resize them.
|
|
Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
|
|
block layer at all.
|
|
|
|
Since tmpfs lives completely in the page cache and optionally on swap,
|
|
all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
|
|
free(1). Notice that these counters also include shared memory
|
|
(shmem, see ipcs(1)). The most reliable way to get the count is
|
|
using df(1) and du(1).
|
|
|
|
tmpfs has the following uses:
|
|
|
|
1) There is always a kernel internal mount which you will not see at
|
|
all. This is used for shared anonymous mappings and SYSV shared
|
|
memory.
|
|
|
|
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
|
|
set, the user visible part of tmpfs is not built. But the internal
|
|
mechanisms are always present.
|
|
|
|
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
|
|
POSIX shared memory (shm_open, shm_unlink). Adding the following
|
|
line to /etc/fstab should take care of this::
|
|
|
|
tmpfs /dev/shm tmpfs defaults 0 0
|
|
|
|
Remember to create the directory that you intend to mount tmpfs on
|
|
if necessary.
|
|
|
|
This mount is _not_ needed for SYSV shared memory. The internal
|
|
mount is used for that. (In the 2.3 kernel versions it was
|
|
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
|
|
shared memory.)
|
|
|
|
3) Some people (including me) find it very convenient to mount it
|
|
e.g. on /tmp and /var/tmp and have a big swap partition. And now
|
|
loop mounts of tmpfs files do work, so mkinitrd shipped by most
|
|
distributions should succeed with a tmpfs /tmp.
|
|
|
|
4) And probably a lot more I do not know about :-)
|
|
|
|
|
|
tmpfs has three mount options for sizing:
|
|
|
|
========= ============================================================
|
|
size The limit of allocated bytes for this tmpfs instance. The
|
|
default is half of your physical RAM without swap. If you
|
|
oversize your tmpfs instances the machine will deadlock
|
|
since the OOM handler will not be able to free that memory.
|
|
nr_blocks The same as size, but in blocks of PAGE_SIZE.
|
|
nr_inodes The maximum number of inodes for this instance. The default
|
|
is half of the number of your physical RAM pages, or (on a
|
|
machine with highmem) the number of lowmem RAM pages,
|
|
whichever is the lower.
|
|
noswap Disables swap. Remounts must respect the original settings.
|
|
By default swap is enabled.
|
|
========= ============================================================
|
|
|
|
These parameters accept a suffix k, m or g for kilo, mega and giga and
|
|
can be changed on remount. The size parameter also accepts a suffix %
|
|
to limit this tmpfs instance to that percentage of your physical RAM:
|
|
the default, when neither size nor nr_blocks is specified, is size=50%
|
|
|
|
If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
|
|
if nr_inodes=0, inodes will not be limited. It is generally unwise to
|
|
mount with such options, since it allows any user with write access to
|
|
use up all the memory on the machine; but enhances the scalability of
|
|
that instance in a system with many CPUs making intensive use of it.
|
|
|
|
tmpfs also supports Transparent Huge Pages which requires a kernel
|
|
configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
|
|
your system (has_transparent_hugepage(), which is architecture specific).
|
|
The mount options for this are:
|
|
|
|
====== ============================================================
|
|
huge=0 never: disables huge pages for the mount
|
|
huge=1 always: enables huge pages for the mount
|
|
huge=2 within_size: only allocate huge pages if the page will be
|
|
fully within i_size, also respect fadvise()/madvise() hints.
|
|
huge=3 advise: only allocate huge pages if requested with
|
|
fadvise()/madvise()
|
|
====== ============================================================
|
|
|
|
There is a sysfs file which you can also use to control system wide THP
|
|
configuration for all tmpfs mounts, the file is:
|
|
|
|
/sys/kernel/mm/transparent_hugepage/shmem_enabled
|
|
|
|
This sysfs file is placed on top of THP sysfs directory and so is registered
|
|
by THP code. It is however only used to control all tmpfs mounts with one
|
|
single knob. Since it controls all tmpfs mounts it should only be used either
|
|
for emergency or testing purposes. The values you can set for shmem_enabled are:
|
|
|
|
== ============================================================
|
|
-1 deny: disables huge on shm_mnt and all mounts, for
|
|
emergency use
|
|
-2 force: enables huge on shm_mnt and all mounts, w/o needing
|
|
option, for testing
|
|
== ============================================================
|
|
|
|
tmpfs has a mount option to set the NUMA memory allocation policy for
|
|
all files in that instance (if CONFIG_NUMA is enabled) - which can be
|
|
adjusted on the fly via 'mount -o remount ...'
|
|
|
|
======================== ==============================================
|
|
mpol=default use the process allocation policy
|
|
(see set_mempolicy(2))
|
|
mpol=prefer:Node prefers to allocate memory from the given Node
|
|
mpol=bind:NodeList allocates memory only from nodes in NodeList
|
|
mpol=interleave prefers to allocate from each node in turn
|
|
mpol=interleave:NodeList allocates from each node of NodeList in turn
|
|
mpol=local prefers to allocate memory from the local node
|
|
======================== ==============================================
|
|
|
|
NodeList format is a comma-separated list of decimal numbers and ranges,
|
|
a range being two hyphen-separated decimal numbers, the smallest and
|
|
largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
|
|
|
|
A memory policy with a valid NodeList will be saved, as specified, for
|
|
use at file creation time. When a task allocates a file in the file
|
|
system, the mount option memory policy will be applied with a NodeList,
|
|
if any, modified by the calling task's cpuset constraints
|
|
[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags,
|
|
listed below. If the resulting NodeLists is the empty set, the effective
|
|
memory policy for the file will revert to "default" policy.
|
|
|
|
NUMA memory allocation policies have optional flags that can be used in
|
|
conjunction with their modes. These optional flags can be specified
|
|
when tmpfs is mounted by appending them to the mode before the NodeList.
|
|
See Documentation/admin-guide/mm/numa_memory_policy.rst for a list of
|
|
all available memory allocation policy mode flags and their effect on
|
|
memory policy.
|
|
|
|
::
|
|
|
|
=static is equivalent to MPOL_F_STATIC_NODES
|
|
=relative is equivalent to MPOL_F_RELATIVE_NODES
|
|
|
|
For example, mpol=bind=static:NodeList, is the equivalent of an
|
|
allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES.
|
|
|
|
Note that trying to mount a tmpfs with an mpol option will fail if the
|
|
running kernel does not support NUMA; and will fail if its nodelist
|
|
specifies a node which is not online. If your system relies on that
|
|
tmpfs being mounted, but from time to time runs a kernel built without
|
|
NUMA capability (perhaps a safe recovery kernel), or with fewer nodes
|
|
online, then it is advisable to omit the mpol option from automatic
|
|
mount options. It can be added later, when the tmpfs is already mounted
|
|
on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
|
|
|
|
|
|
To specify the initial root directory you can use the following mount
|
|
options:
|
|
|
|
==== ==================================
|
|
mode The permissions as an octal number
|
|
uid The user id
|
|
gid The group id
|
|
==== ==================================
|
|
|
|
These options do not have any effect on remount. You can change these
|
|
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
|
|
|
|
|
|
tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
|
|
numbers:
|
|
|
|
======= ========================
|
|
inode64 Use 64-bit inode numbers
|
|
inode32 Use 32-bit inode numbers
|
|
======= ========================
|
|
|
|
On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time.
|
|
On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default. inode64 avoids the
|
|
possibility of multiple files with the same inode number on a single device;
|
|
but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached -
|
|
if a long-lived tmpfs is accessed by 32-bit applications so ancient that
|
|
opening a file larger than 2GiB fails with EINVAL.
|
|
|
|
|
|
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
|
|
will give you tmpfs instance on /mytmpfs which can allocate 10GB
|
|
RAM/SWAP in 10240 inodes and it is only accessible by root.
|
|
|
|
|
|
:Author:
|
|
Christoph Rohland <cr@sap.com>, 1.12.01
|
|
:Updated:
|
|
Hugh Dickins, 4 June 2007
|
|
:Updated:
|
|
KOSAKI Motohiro, 16 Mar 2010
|
|
:Updated:
|
|
Chris Down, 13 July 2020
|