A fair amount of stuff this time around, dominated by yet another massive
set from Mauro toward the completion of the RST conversion. I *really* hope we are getting close to the end of this. Meanwhile, those patches reach pretty far afield to update document references around the tree; there should be no actual code changes there. There will be, alas, more of the usual trivial merge conflicts. Beyond that we have more translations, improvements to the sphinx scripting, a number of additions to the sysctl documentation, and lots of fixes. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7 +NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw= =fDC/ -----END PGP SIGNATURE----- Merge tag 'docs-5.8' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "A fair amount of stuff this time around, dominated by yet another massive set from Mauro toward the completion of the RST conversion. I *really* hope we are getting close to the end of this. Meanwhile, those patches reach pretty far afield to update document references around the tree; there should be no actual code changes there. There will be, alas, more of the usual trivial merge conflicts. Beyond that we have more translations, improvements to the sphinx scripting, a number of additions to the sysctl documentation, and lots of fixes" * tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits) Documentation: fixes to the maintainer-entry-profile template zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst tracing: Fix events.rst section numbering docs: acpi: fix old http link and improve document format docs: filesystems: add info about efivars content Documentation: LSM: Correct the basic LSM description mailmap: change email for Ricardo Ribalda docs: sysctl/kernel: document unaligned controls Documentation: admin-guide: update bug-hunting.rst docs: sysctl/kernel: document ngroups_max nvdimm: fixes to maintainter-entry-profile Documentation/features: Correct RISC-V kprobes support entry Documentation/features: Refresh the arch support status files Revert "docs: sysctl/kernel: document ngroups_max" docs: move locking-specific documents to locking/ docs: move digsig docs to the security book docs: move the kref doc into the core-api book docs: add IRQ documentation at the core-api book docs: debugging-via-ohci1394.txt: add it to the core-api book docs: fix references for ipmi.rst file ...
This commit is contained in:
commit
b23c4771ff
.mailmapCREDITS
Documentation
ABI
MakefilePCI
admin-guide
acpi
bug-hunting.rstcpu-load.rsthw-vuln
init.rstkernel-parameters.txtkernel-per-CPU-kthreads.rstmm
nfs
numastat.rstras.rstsysctl
arm64
conf.pycore-api
debugging-via-ohci1394.rstdma-api-howto.rstdma-api.rstdma-attributes.rstdma-isa-lpc.rstindex.rst
irq
kobject.rstkref.rstprintk-basics.rstprintk-formats.rstrbtree.rstdoc-guide
driver-api
features
core/eBPF-JIT
debug
KASAN
gcov-profile-all
kprobes-on-ftrace
kprobes
kretprobes
stackprotector
uprobes
io/dma-contiguous
locking/lockdep
perf
seccomp/seccomp-filter
vm
filesystems
9p.rstautomount-support.rst
caching
cifs
coda.rstcoda.txtconfigfs.rstdax.txtdebugfs.rstdevpts.rstdevpts.txtdnotify.rstefivarfs.rstfiemap.rstfiles.rstfuse-io.rstindex.rstlocks.rstmandatory-locking.rstmount_api.rstorangefs.rstproc.rstquota.rstramfs-rootfs-initramfs.rstseq_file.rstsharedsubtree.rstspufs
sysfs-pci.rst
5
.mailmap
5
.mailmap
@ -152,6 +152,7 @@ Krzysztof Kozlowski <krzk@kernel.org> <k.kozlowski.k@gmail.com>
|
||||
Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
|
||||
Leon Romanovsky <leon@kernel.org> <leon@leon.nu>
|
||||
Leon Romanovsky <leon@kernel.org> <leonro@mellanox.com>
|
||||
Leonardo Bras <leobras.c@gmail.com> <leonardo@linux.ibm.com>
|
||||
Leonid I Ananiev <leonid.i.ananiev@intel.com>
|
||||
Linas Vepstas <linas@austin.ibm.com>
|
||||
Linus Lüssing <linus.luessing@c0d3.blue> <linus.luessing@web.de>
|
||||
@ -234,7 +235,9 @@ Ralf Baechle <ralf@linux-mips.org>
|
||||
Ralf Wildenhues <Ralf.Wildenhues@gmx.de>
|
||||
Randy Dunlap <rdunlap@infradead.org> <rdunlap@xenotime.net>
|
||||
Rémi Denis-Courmont <rdenis@simphalempin.com>
|
||||
Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
|
||||
Ricardo Ribalda <ribalda@kernel.org> <ricardo.ribalda@gmail.com>
|
||||
Ricardo Ribalda <ribalda@kernel.org> <ricardo@ribalda.com>
|
||||
Ricardo Ribalda <ribalda@kernel.org> Ricardo Ribalda Delgado <ribalda@kernel.org>
|
||||
Ross Zwisler <zwisler@kernel.org> <ross.zwisler@linux.intel.com>
|
||||
Rudolf Marek <R.Marek@sh.cvut.cz>
|
||||
Rui Saraiva <rmps@joel.ist.utl.pt>
|
||||
|
6
CREDITS
6
CREDITS
@ -3104,14 +3104,16 @@ W: http://www.qsl.net/dl1bke/
|
||||
D: Generic Z8530 driver, AX.25 DAMA slave implementation
|
||||
D: Several AX.25 hacks
|
||||
|
||||
N: Ricardo Ribalda Delgado
|
||||
E: ricardo.ribalda@gmail.com
|
||||
N: Ricardo Ribalda
|
||||
E: ribalda@kernel.org
|
||||
W: http://ribalda.com
|
||||
D: PLX USB338x driver
|
||||
D: PCA9634 driver
|
||||
D: Option GTM671WFS
|
||||
D: Fintek F81216A
|
||||
D: AD5761 iio driver
|
||||
D: TI DAC7612 driver
|
||||
D: Sony IMX214 driver
|
||||
D: Various kernel hacks
|
||||
S: Qtechnology A/S
|
||||
S: Valby Langgade 142
|
||||
|
@ -54,7 +54,7 @@ Date: October 2002
|
||||
Contact: Linux Memory Management list <linux-mm@kvack.org>
|
||||
Description:
|
||||
Provides information about the node's distribution and memory
|
||||
utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.txt
|
||||
utilization. Similar to /proc/meminfo, see Documentation/filesystems/proc.rst
|
||||
|
||||
What: /sys/devices/system/node/nodeX/numastat
|
||||
Date: October 2002
|
||||
|
@ -11,7 +11,7 @@ Description:
|
||||
Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem
|
||||
are not present in /proc/pid/smaps. These fields represent
|
||||
the sum of the Pss field of each type (anon, file, shmem).
|
||||
For more details, see Documentation/filesystems/proc.txt
|
||||
For more details, see Documentation/filesystems/proc.rst
|
||||
and the procfs man page.
|
||||
|
||||
Typical output looks like this:
|
||||
|
@ -98,7 +98,11 @@ else # HAVE_PDFLATEX
|
||||
|
||||
pdfdocs: latexdocs
|
||||
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||
$(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;)
|
||||
$(foreach var,$(SPHINXDIRS), \
|
||||
$(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit; \
|
||||
mkdir -p $(BUILDDIR)/$(var)/pdf; \
|
||||
mv $(subst .tex,.pdf,$(wildcard $(BUILDDIR)/$(var)/latex/*.tex)) $(BUILDDIR)/$(var)/pdf/; \
|
||||
)
|
||||
|
||||
endif # HAVE_PDFLATEX
|
||||
|
||||
|
@ -32,12 +32,13 @@ interrupt goes unhandled over time, they are tracked by the Linux kernel as
|
||||
Spurious Interrupts. The IRQ will be disabled by the Linux kernel after it
|
||||
reaches a specific count with the error "nobody cared". This disabled IRQ
|
||||
now prevents valid usage by an existing interrupt which may happen to share
|
||||
the IRQ line.
|
||||
the IRQ line::
|
||||
|
||||
irq 19: nobody cared (try booting with the "irqpoll" option)
|
||||
CPU: 0 PID: 2988 Comm: irq/34-nipalk Tainted: 4.14.87-rt49-02410-g4a640ec-dirty #1
|
||||
Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880, BIOS 2.1.5f1 01/09/2020
|
||||
Call Trace:
|
||||
|
||||
<IRQ>
|
||||
? dump_stack+0x46/0x5e
|
||||
? __report_bad_irq+0x2e/0xb0
|
||||
@ -85,15 +86,18 @@ Mitigations
|
||||
The mitigations take the form of PCI quirks. The preference has been to
|
||||
first identify and make use of a means to disable the routing to the PCH.
|
||||
In such a case a quirk to disable boot interrupt generation can be
|
||||
added.[1]
|
||||
added. [1]_
|
||||
|
||||
Intel® 6300ESB I/O Controller Hub
|
||||
Intel® 6300ESB I/O Controller Hub
|
||||
Alternate Base Address Register:
|
||||
BIE: Boot Interrupt Enable
|
||||
0 = Boot interrupt is enabled.
|
||||
1 = Boot interrupt is disabled.
|
||||
|
||||
Intel® Sandy Bridge through Sky Lake based Xeon servers:
|
||||
== ===========================
|
||||
0 Boot interrupt is enabled.
|
||||
1 Boot interrupt is disabled.
|
||||
== ===========================
|
||||
|
||||
Intel® Sandy Bridge through Sky Lake based Xeon servers:
|
||||
Coherent Interface Protocol Interrupt Control
|
||||
dis_intx_route2pch/dis_intx_route2ich/dis_intx_route2dmi2:
|
||||
When this bit is set. Local INTx messages received from the
|
||||
@ -109,12 +113,12 @@ line by default. Therefore, on chipsets where this INTx routing cannot be
|
||||
disabled, the Linux kernel will reroute the valid interrupt to its legacy
|
||||
interrupt. This redirection of the handler will prevent the occurrence of
|
||||
the spurious interrupt detection which would ordinarily disable the IRQ
|
||||
line due to excessive unhandled counts.[2]
|
||||
line due to excessive unhandled counts. [2]_
|
||||
|
||||
The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable (or
|
||||
disable) the redirection of the interrupt handler to the PCH interrupt
|
||||
line. The option can be overridden by either pci=ioapicreroute or
|
||||
pci=noioapicreroute.[3]
|
||||
pci=noioapicreroute. [3]_
|
||||
|
||||
|
||||
More Documentation
|
||||
@ -127,19 +131,19 @@ into the evolution of its handling with chipsets.
|
||||
Example of disabling of the boot interrupt
|
||||
------------------------------------------
|
||||
|
||||
Intel® 6300ESB I/O Controller Hub (Document # 300641-004US)
|
||||
- Intel® 6300ESB I/O Controller Hub (Document # 300641-004US)
|
||||
5.7.3 Boot Interrupt
|
||||
https://www.intel.com/content/dam/doc/datasheet/6300esb-io-controller-hub-datasheet.pdf
|
||||
|
||||
Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families
|
||||
Datasheet - Volume 2: Registers (Document # 330784-003)
|
||||
- Intel® Xeon® Processor E5-1600/2400/2600/4600 v3 Product Families
|
||||
Datasheet - Volume 2: Registers (Document # 330784-003)
|
||||
6.6.41 cipintrc Coherent Interface Protocol Interrupt Control
|
||||
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf
|
||||
|
||||
Example of handler rerouting
|
||||
----------------------------
|
||||
|
||||
Intel® 6700PXH 64-bit PCI Hub (Document # 302628)
|
||||
- Intel® 6700PXH 64-bit PCI Hub (Document # 302628)
|
||||
2.15.2 PCI Express Legacy INTx Support and Boot Interrupt
|
||||
https://www.intel.com/content/dam/doc/datasheet/6700pxh-64-bit-pci-hub-datasheet.pdf
|
||||
|
||||
@ -150,6 +154,6 @@ Cheers,
|
||||
Sean V Kelley
|
||||
sean.v.kelley@linux.intel.com
|
||||
|
||||
[1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/
|
||||
[2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/
|
||||
[3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/
|
||||
.. [1] https://lore.kernel.org/r/12131949181903-git-send-email-sassmann@suse.de/
|
||||
.. [2] https://lore.kernel.org/r/12131949182094-git-send-email-sassmann@suse.de/
|
||||
.. [3] https://lore.kernel.org/r/487C8EA7.6020205@suse.de/
|
||||
|
@ -63,7 +63,7 @@ which can then be compiled to AML binary format::
|
||||
ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords
|
||||
AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes
|
||||
|
||||
[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29
|
||||
[1] https://www.elinux.org/Minnowboard:MinnowMax#Low_Speed_Expansion_.28Top.29
|
||||
|
||||
The resulting AML code can then be loaded by the kernel using one of the methods
|
||||
below.
|
||||
|
@ -49,15 +49,19 @@ the issue, it may also contain the word **Oops**, as on this one::
|
||||
|
||||
Despite being an **Oops** or some other sort of stack trace, the offended
|
||||
line is usually required to identify and handle the bug. Along this chapter,
|
||||
we'll refer to "Oops" for all kinds of stack traces that need to be analized.
|
||||
we'll refer to "Oops" for all kinds of stack traces that need to be analyzed.
|
||||
|
||||
.. note::
|
||||
If the kernel is compiled with ``CONFIG_DEBUG_INFO``, you can enhance the
|
||||
quality of the stack trace by using file:`scripts/decode_stacktrace.sh`.
|
||||
|
||||
Modules linked in
|
||||
-----------------
|
||||
|
||||
Modules that are tainted or are being loaded or unloaded are marked with
|
||||
"(...)", where the taint flags are described in
|
||||
file:`Documentation/admin-guide/tainted-kernels.rst`, "being loaded" is
|
||||
annotated with "+", and "being unloaded" is annotated with "-".
|
||||
|
||||
``ksymoops`` is useless on 2.6 or upper. Please use the Oops in its original
|
||||
format (from ``dmesg``, etc). Ignore any references in this or other docs to
|
||||
"decoding the Oops" or "running it through ksymoops".
|
||||
If you post an Oops from 2.6+ that has been run through ``ksymoops``,
|
||||
people will just tell you to repost it.
|
||||
|
||||
Where is the Oops message is located?
|
||||
-------------------------------------
|
||||
@ -71,7 +75,7 @@ by running ``journalctl`` command.
|
||||
Sometimes ``klogd`` dies, in which case you can run ``dmesg > file`` to
|
||||
read the data from the kernel buffers and save it. Or you can
|
||||
``cat /proc/kmsg > file``, however you have to break in to stop the transfer,
|
||||
``kmsg`` is a "never ending file".
|
||||
since ``kmsg`` is a "never ending file".
|
||||
|
||||
If the machine has crashed so badly that you cannot enter commands or
|
||||
the disk is not available then you have three options:
|
||||
@ -81,9 +85,9 @@ the disk is not available then you have three options:
|
||||
planned for a crash. Alternatively, you can take a picture of
|
||||
the screen with a digital camera - not nice, but better than
|
||||
nothing. If the messages scroll off the top of the console, you
|
||||
may find that booting with a higher resolution (eg, ``vga=791``)
|
||||
may find that booting with a higher resolution (e.g., ``vga=791``)
|
||||
will allow you to read more of the text. (Caveat: This needs ``vesafb``,
|
||||
so won't help for 'early' oopses)
|
||||
so won't help for 'early' oopses.)
|
||||
|
||||
(2) Boot with a serial console (see
|
||||
:ref:`Documentation/admin-guide/serial-console.rst <serial_console>`),
|
||||
@ -104,7 +108,7 @@ Kernel source file. There are two methods for doing that. Usually, using
|
||||
gdb
|
||||
^^^
|
||||
|
||||
The GNU debug (``gdb``) is the best way to figure out the exact file and line
|
||||
The GNU debugger (``gdb``) is the best way to figure out the exact file and line
|
||||
number of the OOPS from the ``vmlinux`` file.
|
||||
|
||||
The usage of gdb works best on a kernel compiled with ``CONFIG_DEBUG_INFO``.
|
||||
@ -165,7 +169,7 @@ If you have a call trace, such as::
|
||||
[<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
|
||||
...
|
||||
|
||||
this shows the problem likely in the :jbd: module. You can load that module
|
||||
this shows the problem likely is in the :jbd: module. You can load that module
|
||||
in gdb and list the relevant code::
|
||||
|
||||
$ gdb fs/jbd/jbd.ko
|
||||
@ -199,8 +203,9 @@ in the kernel hacking menu of the menu configuration.) For example::
|
||||
You need to be at the top level of the kernel tree for this to pick up
|
||||
your C files.
|
||||
|
||||
If you don't have access to the code you can also debug on some crash dumps
|
||||
e.g. crash dump output as shown by Dave Miller::
|
||||
If you don't have access to the source code you can still debug some crash
|
||||
dumps using the following method (example crash dump output as shown by
|
||||
Dave Miller)::
|
||||
|
||||
EIP is at +0x14/0x4c0
|
||||
...
|
||||
@ -230,6 +235,9 @@ e.g. crash dump output as shown by Dave Miller::
|
||||
mov 0x8(%ebp), %ebx ! %ebx = skb->sk
|
||||
mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt
|
||||
|
||||
file:`scripts/decodecode` can be used to automate most of this, depending
|
||||
on what CPU architecture is being debugged.
|
||||
|
||||
Reporting the bug
|
||||
-----------------
|
||||
|
||||
@ -241,7 +249,7 @@ used for the development of the affected code. This can be done by using
|
||||
the ``get_maintainer.pl`` script.
|
||||
|
||||
For example, if you find a bug at the gspca's sonixj.c file, you can get
|
||||
their maintainers with::
|
||||
its maintainers with::
|
||||
|
||||
$ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c
|
||||
Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%)
|
||||
@ -253,16 +261,17 @@ their maintainers with::
|
||||
|
||||
Please notice that it will point to:
|
||||
|
||||
- The last developers that touched on the source code. On the above example,
|
||||
Tejun and Bhaktipriya (in this specific case, none really envolved on the
|
||||
development of this file);
|
||||
- The last developers that touched the source code (if this is done inside
|
||||
a git tree). On the above example, Tejun and Bhaktipriya (in this
|
||||
specific case, none really envolved on the development of this file);
|
||||
- The driver maintainer (Hans Verkuil);
|
||||
- The subsystem maintainer (Mauro Carvalho Chehab);
|
||||
- The driver and/or subsystem mailing list (linux-media@vger.kernel.org);
|
||||
- the Linux Kernel mailing list (linux-kernel@vger.kernel.org).
|
||||
|
||||
Usually, the fastest way to have your bug fixed is to report it to mailing
|
||||
list used for the development of the code (linux-media ML) copying the driver maintainer (Hans).
|
||||
list used for the development of the code (linux-media ML) copying the
|
||||
driver maintainer (Hans).
|
||||
|
||||
If you are totally stumped as to whom to send the report, and
|
||||
``get_maintainer.pl`` didn't provide you anything useful, send it to
|
||||
@ -303,9 +312,9 @@ protection fault message can be simply cut out of the message files
|
||||
and forwarded to the kernel developers.
|
||||
|
||||
Two types of address resolution are performed by ``klogd``. The first is
|
||||
static translation and the second is dynamic translation. Static
|
||||
translation uses the System.map file in much the same manner that
|
||||
ksymoops does. In order to do static translation the ``klogd`` daemon
|
||||
static translation and the second is dynamic translation.
|
||||
Static translation uses the System.map file.
|
||||
In order to do static translation the ``klogd`` daemon
|
||||
must be able to find a system map file at daemon initialization time.
|
||||
See the klogd man page for information on how ``klogd`` searches for map
|
||||
files.
|
||||
|
@ -105,7 +105,7 @@ References
|
||||
----------
|
||||
|
||||
- http://lkml.org/lkml/2007/2/12/6
|
||||
- Documentation/filesystems/proc.txt (1.8)
|
||||
- Documentation/filesystems/proc.rst (1.8)
|
||||
|
||||
|
||||
Thanks
|
||||
|
@ -268,7 +268,7 @@ Guest mitigation mechanisms
|
||||
/proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
|
||||
available at:
|
||||
|
||||
https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
|
||||
https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst
|
||||
|
||||
.. _smt_control:
|
||||
|
||||
|
@ -1,52 +1,48 @@
|
||||
Explaining the dreaded "No init found." boot hang message
|
||||
Explaining the "No working init found." boot hang message
|
||||
=========================================================
|
||||
:Authors: Andreas Mohr <andi at lisas period de>
|
||||
Cristian Souza <cristianmsbr at gmail period com>
|
||||
|
||||
OK, so you've got this pretty unintuitive message (currently located
|
||||
in init/main.c) and are wondering what the H*** went wrong.
|
||||
Some high-level reasons for failure (listed roughly in order of execution)
|
||||
to load the init binary are:
|
||||
This document provides some high-level reasons for failure
|
||||
(listed roughly in order of execution) to load the init binary.
|
||||
|
||||
A) Unable to mount root FS
|
||||
B) init binary doesn't exist on rootfs
|
||||
C) broken console device
|
||||
D) binary exists but dependencies not available
|
||||
E) binary cannot be loaded
|
||||
1) **Unable to mount root FS**: Set "debug" kernel parameter (in bootloader
|
||||
config file or CONFIG_CMDLINE) to get more detailed kernel messages.
|
||||
|
||||
Detailed explanations:
|
||||
2) **init binary doesn't exist on rootfs**: Make sure you have the correct
|
||||
root FS type (and ``root=`` kernel parameter points to the correct
|
||||
partition), required drivers such as storage hardware (such as SCSI or
|
||||
USB!) and filesystem (ext3, jffs2, etc.) are builtin (alternatively as
|
||||
modules, to be pre-loaded by an initrd).
|
||||
|
||||
A) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE)
|
||||
to get more detailed kernel messages.
|
||||
B) make sure you have the correct root FS type
|
||||
(and ``root=`` kernel parameter points to the correct partition),
|
||||
required drivers such as storage hardware (such as SCSI or USB!)
|
||||
and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules,
|
||||
to be pre-loaded by an initrd)
|
||||
C) Possibly a conflict in ``console= setup`` --> initial console unavailable.
|
||||
E.g. some serial consoles are unreliable due to serial IRQ issues (e.g.
|
||||
missing interrupt-based configuration).
|
||||
3) **Broken console device**: Possibly a conflict in ``console= setup``
|
||||
--> initial console unavailable. E.g. some serial consoles are unreliable
|
||||
due to serial IRQ issues (e.g. missing interrupt-based configuration).
|
||||
Try using a different ``console= device`` or e.g. ``netconsole=``.
|
||||
D) e.g. required library dependencies of the init binary such as
|
||||
``/lib/ld-linux.so.2`` missing or broken. Use
|
||||
``readelf -d <INIT>|grep NEEDED`` to find out which libraries are required.
|
||||
E) make sure the binary's architecture matches your hardware.
|
||||
E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware.
|
||||
In case you tried loading a non-binary file here (shell script?),
|
||||
you should make sure that the script specifies an interpreter in its shebang
|
||||
header line (``#!/...``) that is fully working (including its library
|
||||
dependencies). And before tackling scripts, better first test a simple
|
||||
non-script binary such as ``/bin/sh`` and confirm its successful execution.
|
||||
To find out more, add code ``to init/main.c`` to display kernel_execve()s
|
||||
return values.
|
||||
|
||||
4) **Binary exists but dependencies not available**: E.g. required library
|
||||
dependencies of the init binary such as ``/lib/ld-linux.so.2`` missing or
|
||||
broken. Use ``readelf -d <INIT>|grep NEEDED`` to find out which libraries
|
||||
are required.
|
||||
|
||||
5) **Binary cannot be loaded**: Make sure the binary's architecture matches
|
||||
your hardware. E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM
|
||||
hardware. In case you tried loading a non-binary file here (shell script?),
|
||||
you should make sure that the script specifies an interpreter in its
|
||||
shebang header line (``#!/...``) that is fully working (including its
|
||||
library dependencies). And before tackling scripts, better first test a
|
||||
simple non-script binary such as ``/bin/sh`` and confirm its successful
|
||||
execution. To find out more, add code ``to init/main.c`` to display
|
||||
kernel_execve()s return values.
|
||||
|
||||
Please extend this explanation whenever you find new failure causes
|
||||
(after all loading the init binary is a CRITICAL and hard transition step
|
||||
which needs to be made as painless as possible), then submit patch to LKML.
|
||||
which needs to be made as painless as possible), then submit a patch to LKML.
|
||||
Further TODOs:
|
||||
|
||||
- Implement the various ``run_init_process()`` invocations via a struct array
|
||||
which can then store the ``kernel_execve()`` result value and on failure
|
||||
log it all by iterating over **all** results (very important usability fix).
|
||||
- try to make the implementation itself more helpful in general,
|
||||
e.g. by providing additional error messages at affected places.
|
||||
- Try to make the implementation itself more helpful in general, e.g. by
|
||||
providing additional error messages at affected places.
|
||||
|
||||
Andreas Mohr <andi at lisas period de>
|
||||
|
@ -3336,7 +3336,7 @@
|
||||
See Documentation/admin-guide/sysctl/vm.rst for details.
|
||||
|
||||
ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
|
||||
See Documentation/debugging-via-ohci1394.txt for more
|
||||
See Documentation/core-api/debugging-via-ohci1394.rst for more
|
||||
info.
|
||||
|
||||
olpc_ec_timeout= [OLPC] ms delay when issuing EC commands
|
||||
|
@ -10,7 +10,7 @@ them to a "housekeeping" CPU dedicated to such work.
|
||||
References
|
||||
==========
|
||||
|
||||
- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
|
||||
- Documentation/core-api/irq/irq-affinity.rst: Binding interrupts to sets of CPUs.
|
||||
|
||||
- Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
|
||||
|
||||
|
@ -12,107 +12,107 @@ and more generally they allow userland to take control of various
|
||||
memory page faults, something otherwise only the kernel code could do.
|
||||
|
||||
For example userfaults allows a proper and more optimal implementation
|
||||
of the PROT_NONE+SIGSEGV trick.
|
||||
of the ``PROT_NONE+SIGSEGV`` trick.
|
||||
|
||||
Design
|
||||
======
|
||||
|
||||
Userfaults are delivered and resolved through the userfaultfd syscall.
|
||||
Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
|
||||
|
||||
The userfaultfd (aside from registering and unregistering virtual
|
||||
The ``userfaultfd`` (aside from registering and unregistering virtual
|
||||
memory ranges) provides two primary functionalities:
|
||||
|
||||
1) read/POLLIN protocol to notify a userland thread of the faults
|
||||
1) ``read/POLLIN`` protocol to notify a userland thread of the faults
|
||||
happening
|
||||
|
||||
2) various UFFDIO_* ioctls that can manage the virtual memory regions
|
||||
registered in the userfaultfd that allows userland to efficiently
|
||||
2) various ``UFFDIO_*`` ioctls that can manage the virtual memory regions
|
||||
registered in the ``userfaultfd`` that allows userland to efficiently
|
||||
resolve the userfaults it receives via 1) or to manage the virtual
|
||||
memory in the background
|
||||
|
||||
The real advantage of userfaults if compared to regular virtual memory
|
||||
management of mremap/mprotect is that the userfaults in all their
|
||||
operations never involve heavyweight structures like vmas (in fact the
|
||||
userfaultfd runtime load never takes the mmap_sem for writing).
|
||||
``userfaultfd`` runtime load never takes the mmap_sem for writing).
|
||||
|
||||
Vmas are not suitable for page- (or hugepage) granular fault tracking
|
||||
when dealing with virtual address spaces that could span
|
||||
Terabytes. Too many vmas would be needed for that.
|
||||
|
||||
The userfaultfd once opened by invoking the syscall, can also be
|
||||
The ``userfaultfd`` once opened by invoking the syscall, can also be
|
||||
passed using unix domain sockets to a manager process, so the same
|
||||
manager process could handle the userfaults of a multitude of
|
||||
different processes without them being aware about what is going on
|
||||
(well of course unless they later try to use the userfaultfd
|
||||
(well of course unless they later try to use the ``userfaultfd``
|
||||
themselves on the same region the manager is already tracking, which
|
||||
is a corner case that would currently return -EBUSY).
|
||||
is a corner case that would currently return ``-EBUSY``).
|
||||
|
||||
API
|
||||
===
|
||||
|
||||
When first opened the userfaultfd must be enabled invoking the
|
||||
UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
|
||||
a later API version) which will specify the read/POLLIN protocol
|
||||
userland intends to speak on the UFFD and the uffdio_api.features
|
||||
userland requires. The UFFDIO_API ioctl if successful (i.e. if the
|
||||
requested uffdio_api.api is spoken also by the running kernel and the
|
||||
When first opened the ``userfaultfd`` must be enabled invoking the
|
||||
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
|
||||
a later API version) which will specify the ``read/POLLIN`` protocol
|
||||
userland intends to speak on the ``UFFD`` and the ``uffdio_api.features``
|
||||
userland requires. The ``UFFDIO_API`` ioctl if successful (i.e. if the
|
||||
requested ``uffdio_api.api`` is spoken also by the running kernel and the
|
||||
requested features are going to be enabled) will return into
|
||||
uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
|
||||
``uffdio_api.features`` and ``uffdio_api.ioctls`` two 64bit bitmasks of
|
||||
respectively all the available features of the read(2) protocol and
|
||||
the generic ioctl available.
|
||||
|
||||
The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
|
||||
defines what memory types are supported by the userfaultfd and what
|
||||
The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
|
||||
defines what memory types are supported by the ``userfaultfd`` and what
|
||||
events, except page fault notifications, may be generated.
|
||||
|
||||
If the kernel supports registering userfaultfd ranges on hugetlbfs
|
||||
virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
|
||||
uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
|
||||
set if the kernel supports registering userfaultfd ranges on shared
|
||||
memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
|
||||
MAP_SHARED, memfd_create, etc).
|
||||
If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
|
||||
virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
|
||||
``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
|
||||
set if the kernel supports registering ``userfaultfd`` ranges on shared
|
||||
memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
|
||||
``MAP_SHARED``, ``memfd_create``, etc).
|
||||
|
||||
The userland application that wants to use userfaultfd with hugetlbfs
|
||||
The userland application that wants to use ``userfaultfd`` with hugetlbfs
|
||||
or shared memory need to set the corresponding flag in
|
||||
uffdio_api.features to enable those features.
|
||||
``uffdio_api.features`` to enable those features.
|
||||
|
||||
If the userland desires to receive notifications for events other than
|
||||
page faults, it has to verify that uffdio_api.features has appropriate
|
||||
UFFD_FEATURE_EVENT_* bits set. These events are described in more
|
||||
detail below in "Non-cooperative userfaultfd" section.
|
||||
page faults, it has to verify that ``uffdio_api.features`` has appropriate
|
||||
``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
|
||||
detail below in `Non-cooperative userfaultfd`_ section.
|
||||
|
||||
Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
|
||||
be invoked (if present in the returned uffdio_api.ioctls bitmask) to
|
||||
register a memory range in the userfaultfd by setting the
|
||||
uffdio_register structure accordingly. The uffdio_register.mode
|
||||
Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
|
||||
be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
|
||||
register a memory range in the ``userfaultfd`` by setting the
|
||||
uffdio_register structure accordingly. The ``uffdio_register.mode``
|
||||
bitmask will specify to the kernel which kind of faults to track for
|
||||
the range (UFFDIO_REGISTER_MODE_MISSING would track missing
|
||||
pages). The UFFDIO_REGISTER ioctl will return the
|
||||
uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
|
||||
the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
|
||||
pages). The ``UFFDIO_REGISTER`` ioctl will return the
|
||||
``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
|
||||
userfaults on the range registered. Not all ioctls will necessarily be
|
||||
supported for all memory types depending on the underlying virtual
|
||||
memory backend (anonymous memory vs tmpfs vs real filebacked
|
||||
mappings).
|
||||
|
||||
Userland can use the uffdio_register.ioctls to manage the virtual
|
||||
Userland can use the ``uffdio_register.ioctls`` to manage the virtual
|
||||
address space in the background (to add or potentially also remove
|
||||
memory from the userfaultfd registered range). This means a userfault
|
||||
memory from the ``userfaultfd`` registered range). This means a userfault
|
||||
could be triggering just before userland maps in the background the
|
||||
user-faulted page.
|
||||
|
||||
The primary ioctl to resolve userfaults is UFFDIO_COPY. That
|
||||
The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
|
||||
atomically copies a page into the userfault registered range and wakes
|
||||
up the blocked userfaults (unless uffdio_copy.mode &
|
||||
UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
|
||||
UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
|
||||
half copied page since it'll keep userfaulting until the copy has
|
||||
finished.
|
||||
up the blocked userfaults
|
||||
(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
|
||||
Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
|
||||
guaranteeing that nothing can see an half copied page since it'll
|
||||
keep userfaulting until the copy has finished.
|
||||
|
||||
Notes:
|
||||
|
||||
- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
|
||||
- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
|
||||
you must provide some kind of page in your thread after reading from
|
||||
the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
|
||||
the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
|
||||
The normal behavior of the OS automatically providing a zero page on
|
||||
an annonymous mmaping is not in place.
|
||||
|
||||
@ -122,13 +122,13 @@ Notes:
|
||||
|
||||
- You get the address of the access that triggered the missing page
|
||||
event out of a struct uffd_msg that you read in the thread from the
|
||||
uffd. You can supply as many pages as you want with UFFDIO_COPY or
|
||||
UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then
|
||||
uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or
|
||||
``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then
|
||||
the first of any of those IOCTLs wakes up the faulting thread.
|
||||
|
||||
- Be sure to test for all errors including (pollfd[0].revents &
|
||||
POLLERR). This can happen, e.g. when ranges supplied were
|
||||
incorrect.
|
||||
- Be sure to test for all errors including
|
||||
(``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
|
||||
supplied were incorrect.
|
||||
|
||||
Write Protect Notifications
|
||||
---------------------------
|
||||
@ -136,41 +136,42 @@ Write Protect Notifications
|
||||
This is equivalent to (but faster than) using mprotect and a SIGSEGV
|
||||
signal handler.
|
||||
|
||||
Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
|
||||
Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
|
||||
struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
|
||||
Firstly you need to register a range with ``UFFDIO_REGISTER_MODE_WP``.
|
||||
Instead of using mprotect(2) you use
|
||||
``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
|
||||
while ``mode = UFFDIO_WRITEPROTECT_MODE_WP``
|
||||
in the struct passed in. The range does not default to and does not
|
||||
have to be identical to the range you registered with. You can write
|
||||
protect as many ranges as you like (inside the registered range).
|
||||
Then, in the thread reading from uffd the struct will have
|
||||
msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
|
||||
ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
|
||||
while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
|
||||
This wakes up the thread which will continue to run with writes. This
|
||||
``msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP`` set. Now you send
|
||||
``ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect)``
|
||||
again while ``pagefault.mode`` does not have ``UFFDIO_WRITEPROTECT_MODE_WP``
|
||||
set. This wakes up the thread which will continue to run with writes. This
|
||||
allows you to do the bookkeeping about the write in the uffd reading
|
||||
thread before the ioctl.
|
||||
|
||||
If you registered with both UFFDIO_REGISTER_MODE_MISSING and
|
||||
UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
|
||||
If you registered with both ``UFFDIO_REGISTER_MODE_MISSING`` and
|
||||
``UFFDIO_REGISTER_MODE_WP`` then you need to think about the sequence in
|
||||
which you supply a page and undo write protect. Note that there is a
|
||||
difference between writes into a WP area and into a !WP area. The
|
||||
former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
|
||||
UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but
|
||||
you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
|
||||
former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
|
||||
``UFFD_PAGEFAULT_FLAG_WRITE``. The latter did not fail on protection but
|
||||
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
|
||||
used.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
QEMU/KVM is using the userfaultfd syscall to implement postcopy live
|
||||
QEMU/KVM is using the ``userfaultfd`` syscall to implement postcopy live
|
||||
migration. Postcopy live migration is one form of memory
|
||||
externalization consisting of a virtual machine running with part or
|
||||
all of its memory residing on a different node in the cloud. The
|
||||
userfaultfd abstraction is generic enough that not a single line of
|
||||
``userfaultfd`` abstraction is generic enough that not a single line of
|
||||
KVM kernel code had to be modified in order to add postcopy live
|
||||
migration to QEMU.
|
||||
|
||||
Guest async page faults, FOLL_NOWAIT and all other GUP features work
|
||||
Guest async page faults, ``FOLL_NOWAIT`` and all other ``GUP*`` features work
|
||||
just fine in combination with userfaults. Userfaults trigger async
|
||||
page faults in the guest scheduler so those guest processes that
|
||||
aren't waiting for userfaults (i.e. network bound) can keep running in
|
||||
@ -183,19 +184,19 @@ generating userfaults for readonly guest regions.
|
||||
The implementation of postcopy live migration currently uses one
|
||||
single bidirectional socket but in the future two different sockets
|
||||
will be used (to reduce the latency of the userfaults to the minimum
|
||||
possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
|
||||
possible without having to decrease ``/proc/sys/net/ipv4/tcp_wmem``).
|
||||
|
||||
The QEMU in the source node writes all pages that it knows are missing
|
||||
in the destination node, into the socket, and the migration thread of
|
||||
the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
|
||||
ioctls on the userfaultfd in order to map the received pages into the
|
||||
guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
|
||||
the QEMU running in the destination node runs ``UFFDIO_COPY|ZEROPAGE``
|
||||
ioctls on the ``userfaultfd`` in order to map the received pages into the
|
||||
guest (``UFFDIO_ZEROCOPY`` is used if the source page was a zero page).
|
||||
|
||||
A different postcopy thread in the destination node listens with
|
||||
poll() to the userfaultfd in parallel. When a POLLIN event is
|
||||
poll() to the ``userfaultfd`` in parallel. When a ``POLLIN`` event is
|
||||
generated after a userfault triggers, the postcopy thread read() from
|
||||
the userfaultfd and receives the fault address (or -EAGAIN in case the
|
||||
userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
|
||||
the ``userfaultfd`` and receives the fault address (or ``-EAGAIN`` in case the
|
||||
userfault was already resolved and waken by a ``UFFDIO_COPY|ZEROPAGE`` run
|
||||
by the parallel QEMU migration thread).
|
||||
|
||||
After the QEMU postcopy thread (running in the destination node) gets
|
||||
@ -206,7 +207,7 @@ remaining missing pages from that new page offset. Soon after that
|
||||
(just the time to flush the tcp_wmem queue through the network) the
|
||||
migration thread in the QEMU running in the destination node will
|
||||
receive the page that triggered the userfault and it'll map it as
|
||||
usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
|
||||
usual with the ``UFFDIO_COPY|ZEROPAGE`` (without actually knowing if it
|
||||
was spontaneously sent by the source or if it was an urgent page
|
||||
requested through a userfault).
|
||||
|
||||
@ -219,74 +220,74 @@ checked to find which missing pages to send in round robin and we seek
|
||||
over it when receiving incoming userfaults. After sending each page of
|
||||
course the bitmap is updated accordingly. It's also useful to avoid
|
||||
sending the same page twice (in case the userfault is read by the
|
||||
postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
|
||||
postcopy thread just before ``UFFDIO_COPY|ZEROPAGE`` runs in the migration
|
||||
thread).
|
||||
|
||||
Non-cooperative userfaultfd
|
||||
===========================
|
||||
|
||||
When the userfaultfd is monitored by an external manager, the manager
|
||||
When the ``userfaultfd`` is monitored by an external manager, the manager
|
||||
must be able to track changes in the process virtual memory
|
||||
layout. Userfaultfd can notify the manager about such changes using
|
||||
the same read(2) protocol as for the page fault notifications. The
|
||||
manager has to explicitly enable these events by setting appropriate
|
||||
bits in uffdio_api.features passed to UFFDIO_API ioctl:
|
||||
bits in ``uffdio_api.features`` passed to ``UFFDIO_API`` ioctl:
|
||||
|
||||
UFFD_FEATURE_EVENT_FORK
|
||||
enable userfaultfd hooks for fork(). When this feature is
|
||||
enabled, the userfaultfd context of the parent process is
|
||||
``UFFD_FEATURE_EVENT_FORK``
|
||||
enable ``userfaultfd`` hooks for fork(). When this feature is
|
||||
enabled, the ``userfaultfd`` context of the parent process is
|
||||
duplicated into the newly created process. The manager
|
||||
receives UFFD_EVENT_FORK with file descriptor of the new
|
||||
userfaultfd context in the uffd_msg.fork.
|
||||
receives ``UFFD_EVENT_FORK`` with file descriptor of the new
|
||||
``userfaultfd`` context in the ``uffd_msg.fork``.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMAP
|
||||
``UFFD_FEATURE_EVENT_REMAP``
|
||||
enable notifications about mremap() calls. When the
|
||||
non-cooperative process moves a virtual memory area to a
|
||||
different location, the manager will receive
|
||||
UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
|
||||
``UFFD_EVENT_REMAP``. The ``uffd_msg.remap`` will contain the old and
|
||||
new addresses of the area and its original length.
|
||||
|
||||
UFFD_FEATURE_EVENT_REMOVE
|
||||
``UFFD_FEATURE_EVENT_REMOVE``
|
||||
enable notifications about madvise(MADV_REMOVE) and
|
||||
madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
|
||||
be generated upon these calls to madvise. The uffd_msg.remove
|
||||
madvise(MADV_DONTNEED) calls. The event ``UFFD_EVENT_REMOVE`` will
|
||||
be generated upon these calls to madvise(). The ``uffd_msg.remove``
|
||||
will contain start and end addresses of the removed area.
|
||||
|
||||
UFFD_FEATURE_EVENT_UNMAP
|
||||
``UFFD_FEATURE_EVENT_UNMAP``
|
||||
enable notifications about memory unmapping. The manager will
|
||||
get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
|
||||
get ``UFFD_EVENT_UNMAP`` with ``uffd_msg.remove`` containing start and
|
||||
end addresses of the unmapped area.
|
||||
|
||||
Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
|
||||
Although the ``UFFD_FEATURE_EVENT_REMOVE`` and ``UFFD_FEATURE_EVENT_UNMAP``
|
||||
are pretty similar, they quite differ in the action expected from the
|
||||
userfaultfd manager. In the former case, the virtual memory is
|
||||
``userfaultfd`` manager. In the former case, the virtual memory is
|
||||
removed, but the area is not, the area remains monitored by the
|
||||
userfaultfd, and if a page fault occurs in that area it will be
|
||||
``userfaultfd``, and if a page fault occurs in that area it will be
|
||||
delivered to the manager. The proper resolution for such page fault is
|
||||
to zeromap the faulting address. However, in the latter case, when an
|
||||
area is unmapped, either explicitly (with munmap() system call), or
|
||||
implicitly (e.g. during mremap()), the area is removed and in turn the
|
||||
userfaultfd context for such area disappears too and the manager will
|
||||
``userfaultfd`` context for such area disappears too and the manager will
|
||||
not get further userland page faults from the removed area. Still, the
|
||||
notification is required in order to prevent manager from using
|
||||
UFFDIO_COPY on the unmapped area.
|
||||
``UFFDIO_COPY`` on the unmapped area.
|
||||
|
||||
Unlike userland page faults which have to be synchronous and require
|
||||
explicit or implicit wakeup, all the events are delivered
|
||||
asynchronously and the non-cooperative process resumes execution as
|
||||
soon as manager executes read(). The userfaultfd manager should
|
||||
carefully synchronize calls to UFFDIO_COPY with the events
|
||||
processing. To aid the synchronization, the UFFDIO_COPY ioctl will
|
||||
return -ENOSPC when the monitored process exits at the time of
|
||||
UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
|
||||
its virtual memory layout simultaneously with outstanding UFFDIO_COPY
|
||||
soon as manager executes read(). The ``userfaultfd`` manager should
|
||||
carefully synchronize calls to ``UFFDIO_COPY`` with the events
|
||||
processing. To aid the synchronization, the ``UFFDIO_COPY`` ioctl will
|
||||
return ``-ENOSPC`` when the monitored process exits at the time of
|
||||
``UFFDIO_COPY``, and ``-ENOENT``, when the non-cooperative process has changed
|
||||
its virtual memory layout simultaneously with outstanding ``UFFDIO_COPY``
|
||||
operation.
|
||||
|
||||
The current asynchronous model of the event delivery is optimal for
|
||||
single threaded non-cooperative userfaultfd manager implementations. A
|
||||
single threaded non-cooperative ``userfaultfd`` manager implementations. A
|
||||
synchronous event delivery model can be added later as a new
|
||||
userfaultfd feature to facilitate multithreading enhancements of the
|
||||
non cooperative manager, for example to allow UFFDIO_COPY ioctls to
|
||||
``userfaultfd`` feature to facilitate multithreading enhancements of the
|
||||
non cooperative manager, for example to allow ``UFFDIO_COPY`` ioctls to
|
||||
run in parallel to the event reception. Single threaded
|
||||
implementations should continue to use the current async event
|
||||
delivery model instead.
|
||||
|
@ -18,7 +18,7 @@ Mounting the root filesystem via NFS (nfsroot)
|
||||
In order to use a diskless system, such as an X-terminal or printer server for
|
||||
example, it is necessary for the root filesystem to be present on a non-disk
|
||||
device. This may be an initramfs (see
|
||||
Documentation/filesystems/ramfs-rootfs-initramfs.txt), a ramdisk (see
|
||||
Documentation/filesystems/ramfs-rootfs-initramfs.rst), a ramdisk (see
|
||||
Documentation/admin-guide/initrd.rst) or a filesystem mounted via NFS. The
|
||||
following text describes on how to use NFS for the root filesystem. For the rest
|
||||
of this text 'client' means the diskless system, and 'server' means the NFS
|
||||
|
@ -6,6 +6,21 @@ Numa policy hit/miss statistics
|
||||
|
||||
All units are pages. Hugepages have separate counters.
|
||||
|
||||
The numa_hit, numa_miss and numa_foreign counters reflect how well processes
|
||||
are able to allocate memory from nodes they prefer. If they succeed, numa_hit
|
||||
is incremented on the preferred node, otherwise numa_foreign is incremented on
|
||||
the preferred node and numa_miss on the node where allocation succeeded.
|
||||
|
||||
Usually preferred node is the one local to the CPU where the process executes,
|
||||
but restrictions such as mempolicies can change that, so there are also two
|
||||
counters based on CPU local node. local_node is similar to numa_hit and is
|
||||
incremented on allocation from a node by CPU on the same node. other_node is
|
||||
similar to numa_miss and is incremented on the node where allocation succeeds
|
||||
from a CPU from a different node. Note there is no counter analogical to
|
||||
numa_foreign.
|
||||
|
||||
In more detail:
|
||||
|
||||
=============== ============================================================
|
||||
numa_hit A process wanted to allocate memory from this node,
|
||||
and succeeded.
|
||||
@ -14,11 +29,13 @@ numa_miss A process wanted to allocate memory from another node,
|
||||
but ended up with memory from this node.
|
||||
|
||||
numa_foreign A process wanted to allocate on this node,
|
||||
but ended up with memory from another one.
|
||||
but ended up with memory from another node.
|
||||
|
||||
local_node A process ran on this node and got memory from it.
|
||||
local_node A process ran on this node's CPU,
|
||||
and got memory from this node.
|
||||
|
||||
other_node A process ran on this node and got memory from another node.
|
||||
other_node A process ran on a different node's CPU
|
||||
and got memory from this node.
|
||||
|
||||
interleave_hit Interleaving wanted to allocate from this node
|
||||
and succeeded.
|
||||
@ -28,3 +45,11 @@ For easier reading you can use the numastat utility from the numactl package
|
||||
(http://oss.sgi.com/projects/libnuma/). Note that it only works
|
||||
well right now on machines with a small number of CPUs.
|
||||
|
||||
Note that on systems with memoryless nodes (where a node has CPUs but no
|
||||
memory) the numa_hit, numa_miss and numa_foreign statistics can be skewed
|
||||
heavily. In the current kernel implementation, if a process prefers a
|
||||
memoryless node (i.e. because it is running on one of its local CPU), the
|
||||
implementation actually treats one of the nearest nodes with memory as the
|
||||
preferred node. As a result, such allocation will not increase the numa_foreign
|
||||
counter on the memoryless node, and will skew the numa_hit, numa_miss and
|
||||
numa_foreign statistics of the nearest node.
|
||||
|
@ -156,11 +156,11 @@ the labels provided by the BIOS won't match the real ones.
|
||||
ECC memory
|
||||
----------
|
||||
|
||||
As mentioned on the previous section, ECC memory has extra bits to be
|
||||
used for error correction. So, on 64 bit systems, a memory module
|
||||
has 64 bits of *data width*, and 74 bits of *total width*. So, there are
|
||||
8 bits extra bits to be used for the error detection and correction
|
||||
mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.
|
||||
As mentioned in the previous section, ECC memory has extra bits to be
|
||||
used for error correction. In the above example, a memory module has
|
||||
64 bits of *data width*, and 72 bits of *total width*. The extra 8
|
||||
bits which are used for the error detection and correction mechanisms
|
||||
are referred to as the *syndrome*\ [#f1]_\ [#f2]_.
|
||||
|
||||
So, when the cpu requests the memory controller to write a word with
|
||||
*data width*, the memory controller calculates the *syndrome* in real time,
|
||||
@ -212,7 +212,7 @@ EDAC - Error Detection And Correction
|
||||
purposes.
|
||||
|
||||
When the subsystem was pushed upstream for the first time, on
|
||||
Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
|
||||
Kernel 2.6.16, it was renamed to ``EDAC``.
|
||||
|
||||
Purpose
|
||||
-------
|
||||
@ -351,15 +351,17 @@ controllers. The following example will assume 2 channels:
|
||||
+------------+-----------+-----------+
|
||||
| | ``ch0`` | ``ch1`` |
|
||||
+============+===========+===========+
|
||||
| ``csrow0`` | DIMM_A0 | DIMM_B0 |
|
||||
| | rank0 | rank0 |
|
||||
+------------+ - | - |
|
||||
| |**DIMM_A0**|**DIMM_B0**|
|
||||
+------------+-----------+-----------+
|
||||
| ``csrow0`` | rank0 | rank0 |
|
||||
+------------+-----------+-----------+
|
||||
| ``csrow1`` | rank1 | rank1 |
|
||||
+------------+-----------+-----------+
|
||||
| ``csrow2`` | DIMM_A1 | DIMM_B1 |
|
||||
| | rank0 | rank0 |
|
||||
+------------+ - | - |
|
||||
| ``csrow3`` | rank1 | rank1 |
|
||||
| |**DIMM_A1**|**DIMM_B1**|
|
||||
+------------+-----------+-----------+
|
||||
| ``csrow2`` | rank0 | rank0 |
|
||||
+------------+-----------+-----------+
|
||||
| ``csrow3`` | rank1 | rank1 |
|
||||
+------------+-----------+-----------+
|
||||
|
||||
In the above example, there are 4 physical slots on the motherboard
|
||||
|
@ -102,6 +102,30 @@ See the ``type_of_loader`` and ``ext_loader_ver`` fields in
|
||||
:doc:`/x86/boot` for additional information.
|
||||
|
||||
|
||||
bpf_stats_enabled
|
||||
=================
|
||||
|
||||
Controls whether the kernel should collect statistics on BPF programs
|
||||
(total time spent running, number of times run...). Enabling
|
||||
statistics causes a slight reduction in performance on each program
|
||||
run. The statistics can be seen using ``bpftool``.
|
||||
|
||||
= ===================================
|
||||
0 Don't collect statistics (default).
|
||||
1 Collect statistics.
|
||||
= ===================================
|
||||
|
||||
|
||||
cad_pid
|
||||
=======
|
||||
|
||||
This is the pid which will be signalled on reboot (notably, by
|
||||
Ctrl-Alt-Delete). Writing a value to this file which doesn't
|
||||
correspond to a running process will result in ``-ESRCH``.
|
||||
|
||||
See also `ctrl-alt-del`_.
|
||||
|
||||
|
||||
cap_last_cap
|
||||
============
|
||||
|
||||
@ -241,6 +265,40 @@ domain names are in general different. For a detailed discussion
|
||||
see the ``hostname(1)`` man page.
|
||||
|
||||
|
||||
firmware_config
|
||||
===============
|
||||
|
||||
See :doc:`/driver-api/firmware/fallback-mechanisms`.
|
||||
|
||||
The entries in this directory allow the firmware loader helper
|
||||
fallback to be controlled:
|
||||
|
||||
* ``force_sysfs_fallback``, when set to 1, forces the use of the
|
||||
fallback;
|
||||
* ``ignore_sysfs_fallback``, when set to 1, ignores any fallback.
|
||||
|
||||
|
||||
ftrace_dump_on_oops
|
||||
===================
|
||||
|
||||
Determines whether ``ftrace_dump()`` should be called on an oops (or
|
||||
kernel panic). This will output the contents of the ftrace buffers to
|
||||
the console. This is very useful for capturing traces that lead to
|
||||
crashes and outputting them to a serial console.
|
||||
|
||||
= ===================================================
|
||||
0 Disabled (default).
|
||||
1 Dump buffers of all CPUs.
|
||||
2 Dump the buffer of the CPU that triggered the oops.
|
||||
= ===================================================
|
||||
|
||||
|
||||
ftrace_enabled, stack_tracer_enabled
|
||||
====================================
|
||||
|
||||
See :doc:`/trace/ftrace`.
|
||||
|
||||
|
||||
hardlockup_all_cpu_backtrace
|
||||
============================
|
||||
|
||||
@ -344,6 +402,25 @@ Controls whether the panic kmsg data should be reported to Hyper-V.
|
||||
= =========================================================
|
||||
|
||||
|
||||
ignore-unaligned-usertrap
|
||||
=========================
|
||||
|
||||
On architectures where unaligned accesses cause traps, and where this
|
||||
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``;
|
||||
currently, ``arc`` and ``ia64``), controls whether all unaligned traps
|
||||
are logged.
|
||||
|
||||
= =============================================================
|
||||
0 Log all unaligned accesses.
|
||||
1 Only warn the first time a process traps. This is the default
|
||||
setting.
|
||||
= =============================================================
|
||||
|
||||
See also `unaligned-trap`_ and `unaligned-dump-stack`_. On ``ia64``,
|
||||
this allows system administrators to override the
|
||||
``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded.
|
||||
|
||||
|
||||
kexec_load_disabled
|
||||
===================
|
||||
|
||||
@ -459,6 +536,15 @@ Notes:
|
||||
successful IPC object allocation. If an IPC object allocation syscall
|
||||
fails, it is undefined if the value remains unmodified or is reset to -1.
|
||||
|
||||
|
||||
ngroups_max
|
||||
===========
|
||||
|
||||
Maximum number of supplementary groups, _i.e._ the maximum size which
|
||||
``setgroups`` will accept. Exports ``NGROUPS_MAX`` from the kernel.
|
||||
|
||||
|
||||
|
||||
nmi_watchdog
|
||||
============
|
||||
|
||||
@ -877,7 +963,7 @@ this sysctl interface anymore.
|
||||
pty
|
||||
===
|
||||
|
||||
See Documentation/filesystems/devpts.txt.
|
||||
See Documentation/filesystems/devpts.rst.
|
||||
|
||||
|
||||
randomize_va_space
|
||||
@ -1173,6 +1259,65 @@ If a value outside of this range is written to ``threads-max`` an
|
||||
``EINVAL`` error occurs.
|
||||
|
||||
|
||||
traceoff_on_warning
|
||||
===================
|
||||
|
||||
When set, disables tracing (see :doc:`/trace/ftrace`) when a
|
||||
``WARN()`` is hit.
|
||||
|
||||
|
||||
tracepoint_printk
|
||||
=================
|
||||
|
||||
When tracepoints are sent to printk() (enabled by the ``tp_printk``
|
||||
boot parameter), this entry provides runtime control::
|
||||
|
||||
echo 0 > /proc/sys/kernel/tracepoint_printk
|
||||
|
||||
will stop tracepoints from being sent to printk(), and::
|
||||
|
||||
echo 1 > /proc/sys/kernel/tracepoint_printk
|
||||
|
||||
will send them to printk() again.
|
||||
|
||||
This only works if the kernel was booted with ``tp_printk`` enabled.
|
||||
|
||||
See :doc:`/admin-guide/kernel-parameters` and
|
||||
:doc:`/trace/boottime-trace`.
|
||||
|
||||
|
||||
.. _unaligned-dump-stack:
|
||||
|
||||
unaligned-dump-stack (ia64)
|
||||
===========================
|
||||
|
||||
When logging unaligned accesses, controls whether the stack is
|
||||
dumped.
|
||||
|
||||
= ===================================================
|
||||
0 Do not dump the stack. This is the default setting.
|
||||
1 Dump the stack.
|
||||
= ===================================================
|
||||
|
||||
See also `ignore-unaligned-usertrap`_.
|
||||
|
||||
|
||||
unaligned-trap
|
||||
==============
|
||||
|
||||
On architectures where unaligned accesses cause traps, and where this
|
||||
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently,
|
||||
``arc`` and ``parisc``), controls whether unaligned traps are caught
|
||||
and emulated (instead of failing).
|
||||
|
||||
= ========================================================
|
||||
0 Do not emulate unaligned accesses.
|
||||
1 Emulate unaligned accesses. This is the default setting.
|
||||
= ========================================================
|
||||
|
||||
See also `ignore-unaligned-usertrap`_.
|
||||
|
||||
|
||||
unknown_nmi_panic
|
||||
=================
|
||||
|
||||
@ -1184,6 +1329,16 @@ NMI switch that most IA32 servers have fires unknown NMI up, for
|
||||
example. If a system hangs up, try pressing the NMI switch.
|
||||
|
||||
|
||||
unprivileged_bpf_disabled
|
||||
=========================
|
||||
|
||||
Writing 1 to this entry will disable unprivileged calls to ``bpf()``;
|
||||
once disabled, calling ``bpf()`` without ``CAP_SYS_ADMIN`` will return
|
||||
``-EPERM``.
|
||||
|
||||
Once set, this can't be cleared.
|
||||
|
||||
|
||||
watchdog
|
||||
========
|
||||
|
||||
|
@ -24,13 +24,13 @@ optional external memory-mapped interface.
|
||||
Version 1 of the Activity Monitors architecture implements a counter group
|
||||
of four fixed and architecturally defined 64-bit event counters.
|
||||
|
||||
- CPU cycle counter: increments at the frequency of the CPU.
|
||||
- Constant counter: increments at the fixed frequency of the system
|
||||
clock.
|
||||
- Instructions retired: increments with every architecturally executed
|
||||
instruction.
|
||||
- Memory stall cycles: counts instruction dispatch stall cycles caused by
|
||||
misses in the last level cache within the clock domain.
|
||||
- CPU cycle counter: increments at the frequency of the CPU.
|
||||
- Constant counter: increments at the fixed frequency of the system
|
||||
clock.
|
||||
- Instructions retired: increments with every architecturally executed
|
||||
instruction.
|
||||
- Memory stall cycles: counts instruction dispatch stall cycles caused by
|
||||
misses in the last level cache within the clock domain.
|
||||
|
||||
When in WFI or WFE these counters do not increment.
|
||||
|
||||
@ -59,11 +59,11 @@ counters, only the presence of the extension.
|
||||
Firmware (code running at higher exception levels, e.g. arm-tf) support is
|
||||
needed to:
|
||||
|
||||
- Enable access for lower exception levels (EL2 and EL1) to the AMU
|
||||
registers.
|
||||
- Enable the counters. If not enabled these will read as 0.
|
||||
- Save/restore the counters before/after the CPU is being put/brought up
|
||||
from the 'off' power state.
|
||||
- Enable access for lower exception levels (EL2 and EL1) to the AMU
|
||||
registers.
|
||||
- Enable the counters. If not enabled these will read as 0.
|
||||
- Save/restore the counters before/after the CPU is being put/brought up
|
||||
from the 'off' power state.
|
||||
|
||||
When using kernels that have this feature enabled but boot with broken
|
||||
firmware the user may experience panics or lockups when accessing the
|
||||
@ -81,10 +81,10 @@ are not trapped in EL2/EL3.
|
||||
The fixed counters of AMUv1 are accessible though the following system
|
||||
register definitions:
|
||||
|
||||
- SYS_AMEVCNTR0_CORE_EL0
|
||||
- SYS_AMEVCNTR0_CONST_EL0
|
||||
- SYS_AMEVCNTR0_INST_RET_EL0
|
||||
- SYS_AMEVCNTR0_MEM_STALL_EL0
|
||||
- SYS_AMEVCNTR0_CORE_EL0
|
||||
- SYS_AMEVCNTR0_CONST_EL0
|
||||
- SYS_AMEVCNTR0_INST_RET_EL0
|
||||
- SYS_AMEVCNTR0_MEM_STALL_EL0
|
||||
|
||||
Auxiliary platform specific counters can be accessed using
|
||||
SYS_AMEVCNTR1_EL0(n), where n is a value between 0 and 15.
|
||||
@ -97,9 +97,9 @@ Userspace access
|
||||
|
||||
Currently, access from userspace to the AMU registers is disabled due to:
|
||||
|
||||
- Security reasons: they might expose information about code executed in
|
||||
secure mode.
|
||||
- Purpose: AMU counters are intended for system management use.
|
||||
- Security reasons: they might expose information about code executed in
|
||||
secure mode.
|
||||
- Purpose: AMU counters are intended for system management use.
|
||||
|
||||
Also, the presence of the feature is not visible to userspace.
|
||||
|
||||
@ -110,8 +110,8 @@ Virtualization
|
||||
Currently, access from userspace (EL0) and kernelspace (EL1) on the KVM
|
||||
guest side is disabled due to:
|
||||
|
||||
- Security reasons: they might expose information about code executed
|
||||
by other guests or the host.
|
||||
- Security reasons: they might expose information about code executed
|
||||
by other guests or the host.
|
||||
|
||||
Any attempt to access the AMU registers will result in an UNDEFINED
|
||||
exception being injected into the guest.
|
||||
|
@ -173,8 +173,10 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
- Caches, MMUs
|
||||
|
||||
The MMU must be off.
|
||||
|
||||
The instruction cache may be on or off, and must not hold any stale
|
||||
entries corresponding to the loaded kernel image.
|
||||
|
||||
The address range corresponding to the loaded kernel image must be
|
||||
cleaned to the PoC. In the presence of a system cache or other
|
||||
coherent masters with caches enabled, this will typically require
|
||||
@ -239,6 +241,7 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
- The DT or ACPI tables must describe a GICv2 interrupt controller.
|
||||
|
||||
For CPUs with pointer authentication functionality:
|
||||
|
||||
- If EL3 is present:
|
||||
|
||||
- SCR_EL3.APK (bit 16) must be initialised to 0b1
|
||||
@ -250,18 +253,22 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
- HCR_EL2.API (bit 41) must be initialised to 0b1
|
||||
|
||||
For CPUs with Activity Monitors Unit v1 (AMUv1) extension present:
|
||||
|
||||
- If EL3 is present:
|
||||
CPTR_EL3.TAM (bit 30) must be initialised to 0b0
|
||||
CPTR_EL2.TAM (bit 30) must be initialised to 0b0
|
||||
AMCNTENSET0_EL0 must be initialised to 0b1111
|
||||
AMCNTENSET1_EL0 must be initialised to a platform specific value
|
||||
having 0b1 set for the corresponding bit for each of the auxiliary
|
||||
counters present.
|
||||
|
||||
- CPTR_EL3.TAM (bit 30) must be initialised to 0b0
|
||||
- CPTR_EL2.TAM (bit 30) must be initialised to 0b0
|
||||
- AMCNTENSET0_EL0 must be initialised to 0b1111
|
||||
- AMCNTENSET1_EL0 must be initialised to a platform specific value
|
||||
having 0b1 set for the corresponding bit for each of the auxiliary
|
||||
counters present.
|
||||
|
||||
- If the kernel is entered at EL1:
|
||||
AMCNTENSET0_EL0 must be initialised to 0b1111
|
||||
AMCNTENSET1_EL0 must be initialised to a platform specific value
|
||||
having 0b1 set for the corresponding bit for each of the auxiliary
|
||||
counters present.
|
||||
|
||||
- AMCNTENSET0_EL0 must be initialised to 0b1111
|
||||
- AMCNTENSET1_EL0 must be initialised to a platform specific value
|
||||
having 0b1 set for the corresponding bit for each of the auxiliary
|
||||
counters present.
|
||||
|
||||
The requirements described above for CPU mode, caches, MMUs, architected
|
||||
timers, coherency and system registers apply to all CPUs. All CPUs must
|
||||
@ -305,7 +312,8 @@ following manner:
|
||||
Documentation/devicetree/bindings/arm/psci.yaml.
|
||||
|
||||
- Secondary CPU general-purpose register settings
|
||||
x0 = 0 (reserved for future use)
|
||||
x1 = 0 (reserved for future use)
|
||||
x2 = 0 (reserved for future use)
|
||||
x3 = 0 (reserved for future use)
|
||||
|
||||
- x0 = 0 (reserved for future use)
|
||||
- x1 = 0 (reserved for future use)
|
||||
- x2 = 0 (reserved for future use)
|
||||
- x3 = 0 (reserved for future use)
|
||||
|
@ -388,44 +388,6 @@ if major == 1 and minor < 6:
|
||||
# author, documentclass [howto, manual, or own class]).
|
||||
# Sorted in alphabetical order
|
||||
latex_documents = [
|
||||
('admin-guide/index', 'linux-user.tex', 'Linux Kernel User Documentation',
|
||||
'The kernel development community', 'manual'),
|
||||
('core-api/index', 'core-api.tex', 'The kernel core API manual',
|
||||
'The kernel development community', 'manual'),
|
||||
('crypto/index', 'crypto-api.tex', 'Linux Kernel Crypto API manual',
|
||||
'The kernel development community', 'manual'),
|
||||
('dev-tools/index', 'dev-tools.tex', 'Development tools for the Kernel',
|
||||
'The kernel development community', 'manual'),
|
||||
('doc-guide/index', 'kernel-doc-guide.tex', 'Linux Kernel Documentation Guide',
|
||||
'The kernel development community', 'manual'),
|
||||
('driver-api/index', 'driver-api.tex', 'The kernel driver API manual',
|
||||
'The kernel development community', 'manual'),
|
||||
('filesystems/index', 'filesystems.tex', 'Linux Filesystems API',
|
||||
'The kernel development community', 'manual'),
|
||||
('admin-guide/ext4', 'ext4-admin-guide.tex', 'ext4 Administration Guide',
|
||||
'ext4 Community', 'manual'),
|
||||
('filesystems/ext4/index', 'ext4-data-structures.tex',
|
||||
'ext4 Data Structures and Algorithms', 'ext4 Community', 'manual'),
|
||||
('gpu/index', 'gpu.tex', 'Linux GPU Driver Developer\'s Guide',
|
||||
'The kernel development community', 'manual'),
|
||||
('input/index', 'linux-input.tex', 'The Linux input driver subsystem',
|
||||
'The kernel development community', 'manual'),
|
||||
('kernel-hacking/index', 'kernel-hacking.tex', 'Unreliable Guide To Hacking The Linux Kernel',
|
||||
'The kernel development community', 'manual'),
|
||||
('media/index', 'media.tex', 'Linux Media Subsystem Documentation',
|
||||
'The kernel development community', 'manual'),
|
||||
('networking/index', 'networking.tex', 'Linux Networking Documentation',
|
||||
'The kernel development community', 'manual'),
|
||||
('process/index', 'development-process.tex', 'Linux Kernel Development Documentation',
|
||||
'The kernel development community', 'manual'),
|
||||
('security/index', 'security.tex', 'The kernel security subsystem manual',
|
||||
'The kernel development community', 'manual'),
|
||||
('sh/index', 'sh.tex', 'SuperH architecture implementation manual',
|
||||
'The kernel development community', 'manual'),
|
||||
('sound/index', 'sound.tex', 'Linux Sound Subsystem Documentation',
|
||||
'The kernel development community', 'manual'),
|
||||
('userspace-api/index', 'userspace-api.tex', 'The Linux kernel user-space API guide',
|
||||
'The kernel development community', 'manual'),
|
||||
]
|
||||
|
||||
# Add all other index files from Documentation/ subdirectories
|
||||
|
@ -18,6 +18,7 @@ it.
|
||||
|
||||
kernel-api
|
||||
workqueue
|
||||
printk-basics
|
||||
printk-formats
|
||||
symbol-namespaces
|
||||
|
||||
@ -30,10 +31,12 @@ Library functionality that is used throughout the kernel.
|
||||
:maxdepth: 1
|
||||
|
||||
kobject
|
||||
kref
|
||||
assoc_array
|
||||
xarray
|
||||
idr
|
||||
circular-buffers
|
||||
rbtree
|
||||
generic-radix-tree
|
||||
packing
|
||||
timekeeping
|
||||
@ -50,6 +53,7 @@ How Linux keeps everything from happening at the same time. See
|
||||
|
||||
atomic_ops
|
||||
refcount-vs-atomic
|
||||
irq/index
|
||||
local_ops
|
||||
padata
|
||||
../RCU/index
|
||||
@ -78,6 +82,10 @@ more memory-management documentation in :doc:`/vm/index`.
|
||||
:maxdepth: 1
|
||||
|
||||
memory-allocation
|
||||
dma-api
|
||||
dma-api-howto
|
||||
dma-attributes
|
||||
dma-isa-lpc
|
||||
mm-api
|
||||
genalloc
|
||||
pin_user_pages
|
||||
@ -92,6 +100,7 @@ Interfaces for kernel debugging
|
||||
|
||||
debug-objects
|
||||
tracepoint
|
||||
debugging-via-ohci1394
|
||||
|
||||
Everything else
|
||||
===============
|
||||
|
11
Documentation/core-api/irq/index.rst
Normal file
11
Documentation/core-api/irq/index.rst
Normal file
@ -0,0 +1,11 @@
|
||||
====
|
||||
IRQs
|
||||
====
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
concepts
|
||||
irq-affinity
|
||||
irq-domain
|
||||
irqflags-tracing
|
@ -263,7 +263,8 @@ needs to:
|
||||
Hierarchy irq_domain is in no way x86 specific, and is heavily used to
|
||||
support other architectures, such as ARM, ARM64 etc.
|
||||
|
||||
=== Debugging ===
|
||||
Debugging
|
||||
=========
|
||||
|
||||
Most of the internals of the IRQ subsystem are exposed in debugfs by
|
||||
turning CONFIG_GENERIC_IRQ_DEBUGFS on.
|
@ -80,11 +80,11 @@ what is the pointer to the containing structure? You must avoid tricks
|
||||
(such as assuming that the kobject is at the beginning of the structure)
|
||||
and, instead, use the container_of() macro, found in ``<linux/kernel.h>``::
|
||||
|
||||
container_of(pointer, type, member)
|
||||
container_of(ptr, type, member)
|
||||
|
||||
where:
|
||||
|
||||
* ``pointer`` is the pointer to the embedded kobject,
|
||||
* ``ptr`` is the pointer to the embedded kobject,
|
||||
* ``type`` is the type of the containing structure, and
|
||||
* ``member`` is the name of the structure field to which ``pointer`` points.
|
||||
|
||||
@ -140,7 +140,7 @@ the name of the kobject, call kobject_rename()::
|
||||
|
||||
int kobject_rename(struct kobject *kobj, const char *new_name);
|
||||
|
||||
kobject_rename does not perform any locking or have a solid notion of
|
||||
kobject_rename() does not perform any locking or have a solid notion of
|
||||
what names are valid so the caller must provide their own sanity checking
|
||||
and serialization.
|
||||
|
||||
@ -210,7 +210,7 @@ statically and will warn the developer of this improper usage.
|
||||
If all that you want to use a kobject for is to provide a reference counter
|
||||
for your structure, please use the struct kref instead; a kobject would be
|
||||
overkill. For more information on how to use struct kref, please see the
|
||||
file Documentation/kref.txt in the Linux kernel source tree.
|
||||
file Documentation/core-api/kref.rst in the Linux kernel source tree.
|
||||
|
||||
|
||||
Creating "simple" kobjects
|
||||
@ -222,17 +222,17 @@ ksets, show and store functions, and other details. This is the one
|
||||
exception where a single kobject should be created. To create such an
|
||||
entry, use the function::
|
||||
|
||||
struct kobject *kobject_create_and_add(char *name, struct kobject *parent);
|
||||
struct kobject *kobject_create_and_add(const char *name, struct kobject *parent);
|
||||
|
||||
This function will create a kobject and place it in sysfs in the location
|
||||
underneath the specified parent kobject. To create simple attributes
|
||||
associated with this kobject, use::
|
||||
|
||||
int sysfs_create_file(struct kobject *kobj, struct attribute *attr);
|
||||
int sysfs_create_file(struct kobject *kobj, const struct attribute *attr);
|
||||
|
||||
or::
|
||||
|
||||
int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp);
|
||||
int sysfs_create_group(struct kobject *kobj, const struct attribute_group *grp);
|
||||
|
||||
Both types of attributes used here, with a kobject that has been created
|
||||
with the kobject_create_and_add(), can be of type kobj_attribute, so no
|
||||
@ -300,8 +300,10 @@ kobj_type::
|
||||
void (*release)(struct kobject *kobj);
|
||||
const struct sysfs_ops *sysfs_ops;
|
||||
struct attribute **default_attrs;
|
||||
const struct attribute_group **default_groups;
|
||||
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
|
||||
const void *(*namespace)(struct kobject *kobj);
|
||||
void (*get_ownership)(struct kobject *kobj, kuid_t *uid, kgid_t *gid);
|
||||
};
|
||||
|
||||
This structure is used to describe a particular type of kobject (or, more
|
||||
@ -352,12 +354,12 @@ created and never declared statically or on the stack. To create a new
|
||||
kset use::
|
||||
|
||||
struct kset *kset_create_and_add(const char *name,
|
||||
struct kset_uevent_ops *u,
|
||||
struct kobject *parent);
|
||||
const struct kset_uevent_ops *uevent_ops,
|
||||
struct kobject *parent_kobj);
|
||||
|
||||
When you are finished with the kset, call::
|
||||
|
||||
void kset_unregister(struct kset *kset);
|
||||
void kset_unregister(struct kset *k);
|
||||
|
||||
to destroy it. This removes the kset from sysfs and decrements its reference
|
||||
count. When the reference count goes to zero, the kset will be released.
|
||||
@ -371,9 +373,9 @@ If a kset wishes to control the uevent operations of the kobjects
|
||||
associated with it, it can use the struct kset_uevent_ops to handle it::
|
||||
|
||||
struct kset_uevent_ops {
|
||||
int (*filter)(struct kset *kset, struct kobject *kobj);
|
||||
const char *(*name)(struct kset *kset, struct kobject *kobj);
|
||||
int (*uevent)(struct kset *kset, struct kobject *kobj,
|
||||
int (* const filter)(struct kset *kset, struct kobject *kobj);
|
||||
const char *(* const name)(struct kset *kset, struct kobject *kobj);
|
||||
int (* const uevent)(struct kset *kset, struct kobject *kobj,
|
||||
struct kobj_uevent_env *env);
|
||||
};
|
||||
|
||||
|
115
Documentation/core-api/printk-basics.rst
Normal file
115
Documentation/core-api/printk-basics.rst
Normal file
@ -0,0 +1,115 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================
|
||||
Message logging with printk
|
||||
===========================
|
||||
|
||||
printk() is one of the most widely known functions in the Linux kernel. It's the
|
||||
standard tool we have for printing messages and usually the most basic way of
|
||||
tracing and debugging. If you're familiar with printf(3) you can tell printk()
|
||||
is based on it, although it has some functional differences:
|
||||
|
||||
- printk() messages can specify a log level.
|
||||
|
||||
- the format string, while largely compatible with C99, doesn't follow the
|
||||
exact same specification. It has some extensions and a few limitations
|
||||
(no ``%n`` or floating point conversion specifiers). See :ref:`How to get
|
||||
printk format specifiers right <printk-specifiers>`.
|
||||
|
||||
All printk() messages are printed to the kernel log buffer, which is a ring
|
||||
buffer exported to userspace through /dev/kmsg. The usual way to read it is
|
||||
using ``dmesg``.
|
||||
|
||||
printk() is typically used like this::
|
||||
|
||||
printk(KERN_INFO "Message: %s\n", arg);
|
||||
|
||||
where ``KERN_INFO`` is the log level (note that it's concatenated to the format
|
||||
string, the log level is not a separate argument). The available log levels are:
|
||||
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| Name | String | Alias function |
|
||||
+================+========+===============================================+
|
||||
| KERN_EMERG | "0" | pr_emerg() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_ALERT | "1" | pr_alert() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_CRIT | "2" | pr_crit() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_ERR | "3" | pr_err() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_WARNING | "4" | pr_warn() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_NOTICE | "5" | pr_notice() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_INFO | "6" | pr_info() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_DEBUG | "7" | pr_debug() and pr_devel() if DEBUG is defined |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_DEFAULT | "" | |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
| KERN_CONT | "c" | pr_cont() |
|
||||
+----------------+--------+-----------------------------------------------+
|
||||
|
||||
|
||||
The log level specifies the importance of a message. The kernel decides whether
|
||||
to show the message immediately (printing it to the current console) depending
|
||||
on its log level and the current *console_loglevel* (a kernel variable). If the
|
||||
message priority is higher (lower log level value) than the *console_loglevel*
|
||||
the message will be printed to the console.
|
||||
|
||||
If the log level is omitted, the message is printed with ``KERN_DEFAULT``
|
||||
level.
|
||||
|
||||
You can check the current *console_loglevel* with::
|
||||
|
||||
$ cat /proc/sys/kernel/printk
|
||||
4 4 1 7
|
||||
|
||||
The result shows the *current*, *default*, *minimum* and *boot-time-default* log
|
||||
levels.
|
||||
|
||||
To change the current console_loglevel simply write the the desired level to
|
||||
``/proc/sys/kernel/printk``. For example, to print all messages to the console::
|
||||
|
||||
# echo 8 > /proc/sys/kernel/printk
|
||||
|
||||
Another way, using ``dmesg``::
|
||||
|
||||
# dmesg -n 5
|
||||
|
||||
sets the console_loglevel to print KERN_WARNING (4) or more severe messages to
|
||||
console. See ``dmesg(1)`` for more information.
|
||||
|
||||
As an alternative to printk() you can use the ``pr_*()`` aliases for
|
||||
logging. This family of macros embed the log level in the macro names. For
|
||||
example::
|
||||
|
||||
pr_info("Info message no. %d\n", msg_num);
|
||||
|
||||
prints a ``KERN_INFO`` message.
|
||||
|
||||
Besides being more concise than the equivalent printk() calls, they can use a
|
||||
common definition for the format string through the pr_fmt() macro. For
|
||||
instance, defining this at the top of a source file (before any ``#include``
|
||||
directive)::
|
||||
|
||||
#define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__
|
||||
|
||||
would prefix every pr_*() message in that file with the module and function name
|
||||
that originated the message.
|
||||
|
||||
For debugging purposes there are also two conditionally-compiled macros:
|
||||
pr_debug() and pr_devel(), which are compiled-out unless ``DEBUG`` (or
|
||||
also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined.
|
||||
|
||||
|
||||
Function reference
|
||||
==================
|
||||
|
||||
.. kernel-doc:: kernel/printk/printk.c
|
||||
:functions: printk
|
||||
|
||||
.. kernel-doc:: include/linux/printk.h
|
||||
:functions: pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info
|
||||
pr_fmt pr_debug pr_devel pr_cont
|
@ -2,6 +2,8 @@
|
||||
How to get printk format specifiers right
|
||||
=========================================
|
||||
|
||||
.. _printk-specifiers:
|
||||
|
||||
:Author: Randy Dunlap <rdunlap@infradead.org>
|
||||
:Author: Andrew Murray <amurray@mpc-data.co.uk>
|
||||
|
||||
|
@ -6,7 +6,7 @@ Documentation subsystem maintainer entry profile
|
||||
The documentation "subsystem" is the central coordinating point for the
|
||||
kernel's documentation and associated infrastructure. It covers the
|
||||
hierarchy under Documentation/ (with the exception of
|
||||
Documentation/device-tree), various utilities under scripts/ and, at least
|
||||
Documentation/devicetree), various utilities under scripts/ and, at least
|
||||
some of the time, LICENSES/.
|
||||
|
||||
It's worth noting, though, that the boundaries of this subsystem are rather
|
||||
|
@ -11,7 +11,7 @@ course not limited to GPU use cases.
|
||||
The three main components of this are: (1) dma-buf, representing a
|
||||
sg_table and exposed to userspace as a file descriptor to allow passing
|
||||
between devices, (2) fence, which provides a mechanism to signal when
|
||||
one device as finished access, and (3) reservation, which manages the
|
||||
one device has finished access, and (3) reservation, which manages the
|
||||
shared or exclusive fence(s) associated with the buffer.
|
||||
|
||||
Shared DMA Buffers
|
||||
@ -31,7 +31,7 @@ The exporter
|
||||
- implements and manages operations in :c:type:`struct dma_buf_ops
|
||||
<dma_buf_ops>` for the buffer,
|
||||
- allows other users to share the buffer by using dma_buf sharing APIs,
|
||||
- manages the details of buffer allocation, wrapped int a :c:type:`struct
|
||||
- manages the details of buffer allocation, wrapped in a :c:type:`struct
|
||||
dma_buf <dma_buf>`,
|
||||
- decides about the actual backing storage where this allocation happens,
|
||||
- and takes care of any migration of scatterlist - for all (shared) users of
|
||||
|
@ -50,10 +50,10 @@ Attributes
|
||||
|
||||
Attributes of devices can be exported by a device driver through sysfs.
|
||||
|
||||
Please see Documentation/filesystems/sysfs.txt for more information
|
||||
Please see Documentation/filesystems/sysfs.rst for more information
|
||||
on how sysfs works.
|
||||
|
||||
As explained in Documentation/kobject.txt, device attributes must be
|
||||
As explained in Documentation/core-api/kobject.rst, device attributes must be
|
||||
created before the KOBJ_ADD uevent is generated. The only way to realize
|
||||
that is by defining an attribute group.
|
||||
|
||||
|
@ -121,4 +121,4 @@ device-specific data or tunable interfaces.
|
||||
|
||||
More information about the sysfs directory layout can be found in
|
||||
the other documents in this directory and in the file
|
||||
Documentation/filesystems/sysfs.txt.
|
||||
Documentation/filesystems/sysfs.rst.
|
||||
|
@ -39,6 +39,7 @@ available subsections can be seen below.
|
||||
spi
|
||||
i2c
|
||||
ipmb
|
||||
ipmi
|
||||
i3c/index
|
||||
interconnect
|
||||
devfreq
|
||||
|
@ -278,8 +278,8 @@ by a region device with a dynamically assigned id (REGION0 - REGION5).
|
||||
be contiguous in DPA-space.
|
||||
|
||||
This bus is provided by the kernel under the device
|
||||
/sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
|
||||
the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the
|
||||
/sys/devices/platform/nfit_test.0 when the nfit_test.ko module from
|
||||
tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the
|
||||
acpi_nfit.ko driver as well.
|
||||
|
||||
|
||||
|
@ -1,3 +1,6 @@
|
||||
================
|
||||
CPU Idle Cooling
|
||||
================
|
||||
|
||||
Situation:
|
||||
----------
|
||||
|
@ -8,6 +8,7 @@ Thermal
|
||||
:maxdepth: 1
|
||||
|
||||
cpu-cooling-api
|
||||
cpu-idle-cooling
|
||||
sysfs-api
|
||||
power_allocator
|
||||
|
||||
|
@ -23,7 +23,7 @@
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | ok |
|
||||
|
@ -22,9 +22,9 @@
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
| um: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | TODO |
|
||||
| arm64: | TODO |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | ok |
|
||||
@ -23,7 +23,7 @@
|
||||
| openrisc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
| sh: | ok |
|
||||
| sparc: | ok |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | ok |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
|
@ -16,7 +16,7 @@
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| microblaze: | ok |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | ok |
|
||||
| ia64: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | ok |
|
||||
| ia64: | TODO |
|
||||
@ -21,7 +21,7 @@
|
||||
| nds32: | ok |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | ok |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
@ -23,7 +23,7 @@
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -11,7 +11,7 @@
|
||||
| arm: | ok |
|
||||
| arm64: | ok |
|
||||
| c6x: | TODO |
|
||||
| csky: | TODO |
|
||||
| csky: | ok |
|
||||
| h8300: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
@ -23,7 +23,7 @@
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -23,7 +23,7 @@
|
||||
| openrisc: | TODO |
|
||||
| parisc: | ok |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| riscv: | ok |
|
||||
| s390: | ok |
|
||||
| sh: | TODO |
|
||||
| sparc: | TODO |
|
||||
|
@ -22,7 +22,7 @@
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
| parisc: | TODO |
|
||||
| powerpc: | TODO |
|
||||
| powerpc: | ok |
|
||||
| riscv: | TODO |
|
||||
| s390: | TODO |
|
||||
| sh: | TODO |
|
||||
|
@ -17,7 +17,7 @@
|
||||
| ia64: | TODO |
|
||||
| m68k: | TODO |
|
||||
| microblaze: | TODO |
|
||||
| mips: | TODO |
|
||||
| mips: | ok |
|
||||
| nds32: | TODO |
|
||||
| nios2: | TODO |
|
||||
| openrisc: | TODO |
|
||||
|
@ -192,4 +192,4 @@ For more information on the Plan 9 Operating System check out
|
||||
http://plan9.bell-labs.com/plan9
|
||||
|
||||
For information on Plan 9 from User Space (Plan 9 applications and libraries
|
||||
ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
|
||||
ported to Linux/BSD/OSX/etc) check out https://9fans.github.io/plan9port/
|
||||
|
@ -1,3 +1,10 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=================
|
||||
Automount Support
|
||||
=================
|
||||
|
||||
|
||||
Support is available for filesystems that wish to do automounting
|
||||
support (such as kAFS which can be found in fs/afs/ and NFS in
|
||||
fs/nfs/). This facility includes allowing in-kernel mounts to be
|
||||
@ -5,13 +12,12 @@ performed and mountpoint degradation to be requested. The latter can
|
||||
also be requested by userspace.
|
||||
|
||||
|
||||
======================
|
||||
IN-KERNEL AUTOMOUNTING
|
||||
In-Kernel Automounting
|
||||
======================
|
||||
|
||||
See section "Mount Traps" of Documentation/filesystems/autofs.rst
|
||||
|
||||
Then from userspace, you can just do something like:
|
||||
Then from userspace, you can just do something like::
|
||||
|
||||
[root@andromeda root]# mount -t afs \#root.afs. /afs
|
||||
[root@andromeda root]# ls /afs
|
||||
@ -21,7 +27,7 @@ Then from userspace, you can just do something like:
|
||||
[root@andromeda root]# ls /afs/cambridge/afsdoc/
|
||||
ChangeLog html LICENSE pdf RELNOTES-1.2.2
|
||||
|
||||
And then if you look in the mountpoint catalogue, you'll see something like:
|
||||
And then if you look in the mountpoint catalogue, you'll see something like::
|
||||
|
||||
[root@andromeda root]# cat /proc/mounts
|
||||
...
|
||||
@ -30,8 +36,7 @@ And then if you look in the mountpoint catalogue, you'll see something like:
|
||||
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
|
||||
|
||||
|
||||
===========================
|
||||
AUTOMATIC MOUNTPOINT EXPIRY
|
||||
Automatic Mountpoint Expiry
|
||||
===========================
|
||||
|
||||
Automatic expiration of mountpoints is easy, provided you've mounted the
|
||||
@ -43,7 +48,8 @@ To do expiration, you need to follow these steps:
|
||||
hung.
|
||||
|
||||
(2) When a new mountpoint is created in the ->d_automount method, add
|
||||
the mnt to the list using mnt_set_expiry()
|
||||
the mnt to the list using mnt_set_expiry()::
|
||||
|
||||
mnt_set_expiry(newmnt, &afs_vfsmounts);
|
||||
|
||||
(3) When you want mountpoints to be expired, call mark_mounts_for_expiry()
|
||||
@ -70,8 +76,7 @@ and the copies of those that are on an expiration list will be added to the
|
||||
same expiration list.
|
||||
|
||||
|
||||
=======================
|
||||
USERSPACE DRIVEN EXPIRY
|
||||
Userspace Driven Expiry
|
||||
=======================
|
||||
|
||||
As an alternative, it is possible for userspace to request expiry of any
|
@ -1,6 +1,8 @@
|
||||
==========================
|
||||
FS-CACHE CACHE BACKEND API
|
||||
==========================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
FS-Cache Cache backend API
|
||||
==========================
|
||||
|
||||
The FS-Cache system provides an API by which actual caches can be supplied to
|
||||
FS-Cache for it to then serve out to network filesystems and other interested
|
||||
@ -9,15 +11,14 @@ parties.
|
||||
This API is declared in <linux/fscache-cache.h>.
|
||||
|
||||
|
||||
====================================
|
||||
INITIALISING AND REGISTERING A CACHE
|
||||
Initialising and Registering a Cache
|
||||
====================================
|
||||
|
||||
To start off, a cache definition must be initialised and registered for each
|
||||
cache the backend wants to make available. For instance, CacheFS does this in
|
||||
the fill_super() operation on mounting.
|
||||
|
||||
The cache definition (struct fscache_cache) should be initialised by calling:
|
||||
The cache definition (struct fscache_cache) should be initialised by calling::
|
||||
|
||||
void fscache_init_cache(struct fscache_cache *cache,
|
||||
struct fscache_cache_ops *ops,
|
||||
@ -26,17 +27,17 @@ The cache definition (struct fscache_cache) should be initialised by calling:
|
||||
|
||||
Where:
|
||||
|
||||
(*) "cache" is a pointer to the cache definition;
|
||||
* "cache" is a pointer to the cache definition;
|
||||
|
||||
(*) "ops" is a pointer to the table of operations that the backend supports on
|
||||
* "ops" is a pointer to the table of operations that the backend supports on
|
||||
this cache; and
|
||||
|
||||
(*) "idfmt" is a format and printf-style arguments for constructing a label
|
||||
* "idfmt" is a format and printf-style arguments for constructing a label
|
||||
for the cache.
|
||||
|
||||
|
||||
The cache should then be registered with FS-Cache by passing a pointer to the
|
||||
previously initialised cache definition to:
|
||||
previously initialised cache definition to::
|
||||
|
||||
int fscache_add_cache(struct fscache_cache *cache,
|
||||
struct fscache_object *fsdef,
|
||||
@ -44,12 +45,12 @@ previously initialised cache definition to:
|
||||
|
||||
Two extra arguments should also be supplied:
|
||||
|
||||
(*) "fsdef" which should point to the object representation for the FS-Cache
|
||||
* "fsdef" which should point to the object representation for the FS-Cache
|
||||
master index in this cache. Netfs primary index entries will be created
|
||||
here. FS-Cache keeps the caller's reference to the index object if
|
||||
successful and will release it upon withdrawal of the cache.
|
||||
|
||||
(*) "tagname" which, if given, should be a text string naming this cache. If
|
||||
* "tagname" which, if given, should be a text string naming this cache. If
|
||||
this is NULL, the identifier will be used instead. For CacheFS, the
|
||||
identifier is set to name the underlying block device and the tag can be
|
||||
supplied by mount.
|
||||
@ -58,20 +59,18 @@ This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
|
||||
is already in use. 0 will be returned on success.
|
||||
|
||||
|
||||
=====================
|
||||
UNREGISTERING A CACHE
|
||||
Unregistering a Cache
|
||||
=====================
|
||||
|
||||
A cache can be withdrawn from the system by calling this function with a
|
||||
pointer to the cache definition:
|
||||
pointer to the cache definition::
|
||||
|
||||
void fscache_withdraw_cache(struct fscache_cache *cache);
|
||||
|
||||
In CacheFS's case, this is called by put_super().
|
||||
|
||||
|
||||
========
|
||||
SECURITY
|
||||
Security
|
||||
========
|
||||
|
||||
The cache methods are executed one of two contexts:
|
||||
@ -89,8 +88,7 @@ be masqueraded for the duration of the cache driver's access to the cache.
|
||||
This is left to the cache to handle; FS-Cache makes no effort in this regard.
|
||||
|
||||
|
||||
===================================
|
||||
CONTROL AND STATISTICS PRESENTATION
|
||||
Control and Statistics Presentation
|
||||
===================================
|
||||
|
||||
The cache may present data to the outside world through FS-Cache's interfaces
|
||||
@ -101,11 +99,10 @@ is enabled. This is accessible through the kobject struct fscache_cache::kobj
|
||||
and is for use by the cache as it sees fit.
|
||||
|
||||
|
||||
========================
|
||||
RELEVANT DATA STRUCTURES
|
||||
Relevant Data Structures
|
||||
========================
|
||||
|
||||
(*) Index/Data file FS-Cache representation cookie:
|
||||
* Index/Data file FS-Cache representation cookie::
|
||||
|
||||
struct fscache_cookie {
|
||||
struct fscache_object_def *def;
|
||||
@ -121,7 +118,7 @@ RELEVANT DATA STRUCTURES
|
||||
cache operations.
|
||||
|
||||
|
||||
(*) In-cache object representation:
|
||||
* In-cache object representation::
|
||||
|
||||
struct fscache_object {
|
||||
int debug_id;
|
||||
@ -150,7 +147,7 @@ RELEVANT DATA STRUCTURES
|
||||
initialised by calling fscache_object_init(object).
|
||||
|
||||
|
||||
(*) FS-Cache operation record:
|
||||
* FS-Cache operation record::
|
||||
|
||||
struct fscache_operation {
|
||||
atomic_t usage;
|
||||
@ -173,7 +170,7 @@ RELEVANT DATA STRUCTURES
|
||||
an operation needs more processing time, it should be enqueued again.
|
||||
|
||||
|
||||
(*) FS-Cache retrieval operation record:
|
||||
* FS-Cache retrieval operation record::
|
||||
|
||||
struct fscache_retrieval {
|
||||
struct fscache_operation op;
|
||||
@ -198,7 +195,7 @@ RELEVANT DATA STRUCTURES
|
||||
it sees fit.
|
||||
|
||||
|
||||
(*) FS-Cache storage operation record:
|
||||
* FS-Cache storage operation record::
|
||||
|
||||
struct fscache_storage {
|
||||
struct fscache_operation op;
|
||||
@ -212,16 +209,17 @@ RELEVANT DATA STRUCTURES
|
||||
storage.
|
||||
|
||||
|
||||
================
|
||||
CACHE OPERATIONS
|
||||
Cache Operations
|
||||
================
|
||||
|
||||
The cache backend provides FS-Cache with a table of operations that can be
|
||||
performed on the denizens of the cache. These are held in a structure of type:
|
||||
|
||||
struct fscache_cache_ops
|
||||
::
|
||||
|
||||
(*) Name of cache provider [mandatory]:
|
||||
struct fscache_cache_ops
|
||||
|
||||
* Name of cache provider [mandatory]::
|
||||
|
||||
const char *name
|
||||
|
||||
@ -229,7 +227,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
the backend.
|
||||
|
||||
|
||||
(*) Allocate a new object [mandatory]:
|
||||
* Allocate a new object [mandatory]::
|
||||
|
||||
struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
|
||||
struct fscache_cookie *cookie)
|
||||
@ -244,7 +242,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
form once lookup is complete or aborted.
|
||||
|
||||
|
||||
(*) Look up and create object [mandatory]:
|
||||
* Look up and create object [mandatory]::
|
||||
|
||||
void (*lookup_object)(struct fscache_object *object)
|
||||
|
||||
@ -263,7 +261,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
to abort the lookup of that object.
|
||||
|
||||
|
||||
(*) Release lookup data [mandatory]:
|
||||
* Release lookup data [mandatory]::
|
||||
|
||||
void (*lookup_complete)(struct fscache_object *object)
|
||||
|
||||
@ -271,7 +269,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
using to perform a lookup.
|
||||
|
||||
|
||||
(*) Increment object refcount [mandatory]:
|
||||
* Increment object refcount [mandatory]::
|
||||
|
||||
struct fscache_object *(*grab_object)(struct fscache_object *object)
|
||||
|
||||
@ -280,7 +278,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
It should return the object pointer if successful.
|
||||
|
||||
|
||||
(*) Lock/Unlock object [mandatory]:
|
||||
* Lock/Unlock object [mandatory]::
|
||||
|
||||
void (*lock_object)(struct fscache_object *object)
|
||||
void (*unlock_object)(struct fscache_object *object)
|
||||
@ -289,7 +287,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
to schedule with the lock held, so a spinlock isn't sufficient.
|
||||
|
||||
|
||||
(*) Pin/Unpin object [optional]:
|
||||
* Pin/Unpin object [optional]::
|
||||
|
||||
int (*pin_object)(struct fscache_object *object)
|
||||
void (*unpin_object)(struct fscache_object *object)
|
||||
@ -299,7 +297,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
enough space in the cache to permit this.
|
||||
|
||||
|
||||
(*) Check coherency state of an object [mandatory]:
|
||||
* Check coherency state of an object [mandatory]::
|
||||
|
||||
int (*check_consistency)(struct fscache_object *object)
|
||||
|
||||
@ -308,7 +306,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS
|
||||
may also be returned.
|
||||
|
||||
(*) Update object [mandatory]:
|
||||
* Update object [mandatory]::
|
||||
|
||||
int (*update_object)(struct fscache_object *object)
|
||||
|
||||
@ -317,7 +315,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
obtained by calling object->cookie->def->get_aux()/get_attr().
|
||||
|
||||
|
||||
(*) Invalidate data object [mandatory]:
|
||||
* Invalidate data object [mandatory]::
|
||||
|
||||
int (*invalidate_object)(struct fscache_operation *op)
|
||||
|
||||
@ -329,7 +327,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
fscache_op_complete() must be called on op before returning.
|
||||
|
||||
|
||||
(*) Discard object [mandatory]:
|
||||
* Discard object [mandatory]::
|
||||
|
||||
void (*drop_object)(struct fscache_object *object)
|
||||
|
||||
@ -341,7 +339,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
caller. The caller will invoke the put_object() method as appropriate.
|
||||
|
||||
|
||||
(*) Release object reference [mandatory]:
|
||||
* Release object reference [mandatory]::
|
||||
|
||||
void (*put_object)(struct fscache_object *object)
|
||||
|
||||
@ -349,7 +347,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
be freed when all the references to it are released.
|
||||
|
||||
|
||||
(*) Synchronise a cache [mandatory]:
|
||||
* Synchronise a cache [mandatory]::
|
||||
|
||||
void (*sync)(struct fscache_cache *cache)
|
||||
|
||||
@ -357,7 +355,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
device.
|
||||
|
||||
|
||||
(*) Dissociate a cache [mandatory]:
|
||||
* Dissociate a cache [mandatory]::
|
||||
|
||||
void (*dissociate_pages)(struct fscache_cache *cache)
|
||||
|
||||
@ -365,7 +363,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
cache withdrawal.
|
||||
|
||||
|
||||
(*) Notification that the attributes on a netfs file changed [mandatory]:
|
||||
* Notification that the attributes on a netfs file changed [mandatory]::
|
||||
|
||||
int (*attr_changed)(struct fscache_object *object);
|
||||
|
||||
@ -386,7 +384,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
execution of this operation.
|
||||
|
||||
|
||||
(*) Reserve cache space for an object's data [optional]:
|
||||
* Reserve cache space for an object's data [optional]::
|
||||
|
||||
int (*reserve_space)(struct fscache_object *object, loff_t size);
|
||||
|
||||
@ -404,7 +402,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
size if larger than that already.
|
||||
|
||||
|
||||
(*) Request page be read from cache [mandatory]:
|
||||
* Request page be read from cache [mandatory]::
|
||||
|
||||
int (*read_or_alloc_page)(struct fscache_retrieval *op,
|
||||
struct page *page,
|
||||
@ -446,7 +444,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
with. This will complete the operation when all pages are dealt with.
|
||||
|
||||
|
||||
(*) Request pages be read from cache [mandatory]:
|
||||
* Request pages be read from cache [mandatory]::
|
||||
|
||||
int (*read_or_alloc_pages)(struct fscache_retrieval *op,
|
||||
struct list_head *pages,
|
||||
@ -457,7 +455,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
of pages instead of one page. Any pages on which a read operation is
|
||||
started must be added to the page cache for the specified mapping and also
|
||||
to the LRU. Such pages must also be removed from the pages list and
|
||||
*nr_pages decremented per page.
|
||||
``*nr_pages`` decremented per page.
|
||||
|
||||
If there was an error such as -ENOMEM, then that should be returned; else
|
||||
if one or more pages couldn't be read or allocated, then -ENOBUFS should
|
||||
@ -466,7 +464,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
returned.
|
||||
|
||||
|
||||
(*) Request page be allocated in the cache [mandatory]:
|
||||
* Request page be allocated in the cache [mandatory]::
|
||||
|
||||
int (*allocate_page)(struct fscache_retrieval *op,
|
||||
struct page *page,
|
||||
@ -482,7 +480,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
allocated, then the netfs page should be marked and 0 returned.
|
||||
|
||||
|
||||
(*) Request pages be allocated in the cache [mandatory]:
|
||||
* Request pages be allocated in the cache [mandatory]::
|
||||
|
||||
int (*allocate_pages)(struct fscache_retrieval *op,
|
||||
struct list_head *pages,
|
||||
@ -493,7 +491,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
nr_pages should be treated as for the read_or_alloc_pages() method.
|
||||
|
||||
|
||||
(*) Request page be written to cache [mandatory]:
|
||||
* Request page be written to cache [mandatory]::
|
||||
|
||||
int (*write_page)(struct fscache_storage *op,
|
||||
struct page *page);
|
||||
@ -514,7 +512,7 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
appropriately.
|
||||
|
||||
|
||||
(*) Discard retained per-page metadata [mandatory]:
|
||||
* Discard retained per-page metadata [mandatory]::
|
||||
|
||||
void (*uncache_page)(struct fscache_object *object, struct page *page)
|
||||
|
||||
@ -523,13 +521,12 @@ performed on the denizens of the cache. These are held in a structure of type:
|
||||
maintains for this page.
|
||||
|
||||
|
||||
==================
|
||||
FS-CACHE UTILITIES
|
||||
FS-Cache Utilities
|
||||
==================
|
||||
|
||||
FS-Cache provides some utilities that a cache backend may make use of:
|
||||
|
||||
(*) Note occurrence of an I/O error in a cache:
|
||||
* Note occurrence of an I/O error in a cache::
|
||||
|
||||
void fscache_io_error(struct fscache_cache *cache)
|
||||
|
||||
@ -541,7 +538,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
This does not actually withdraw the cache. That must be done separately.
|
||||
|
||||
|
||||
(*) Invoke the retrieval I/O completion function:
|
||||
* Invoke the retrieval I/O completion function::
|
||||
|
||||
void fscache_end_io(struct fscache_retrieval *op, struct page *page,
|
||||
int error);
|
||||
@ -550,8 +547,8 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
error value should be 0 if successful and an error otherwise.
|
||||
|
||||
|
||||
(*) Record that one or more pages being retrieved or allocated have been dealt
|
||||
with:
|
||||
* Record that one or more pages being retrieved or allocated have been dealt
|
||||
with::
|
||||
|
||||
void fscache_retrieval_complete(struct fscache_retrieval *op,
|
||||
int n_pages);
|
||||
@ -562,7 +559,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
completed.
|
||||
|
||||
|
||||
(*) Record operation completion:
|
||||
* Record operation completion::
|
||||
|
||||
void fscache_op_complete(struct fscache_operation *op);
|
||||
|
||||
@ -571,7 +568,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
one or more pending operations to start running.
|
||||
|
||||
|
||||
(*) Set highest store limit:
|
||||
* Set highest store limit::
|
||||
|
||||
void fscache_set_store_limit(struct fscache_object *object,
|
||||
loff_t i_size);
|
||||
@ -581,7 +578,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
rejected by fscache_read_alloc_page() and co with -ENOBUFS.
|
||||
|
||||
|
||||
(*) Mark pages as being cached:
|
||||
* Mark pages as being cached::
|
||||
|
||||
void fscache_mark_pages_cached(struct fscache_retrieval *op,
|
||||
struct pagevec *pagevec);
|
||||
@ -590,7 +587,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
the netfs must call fscache_uncache_page() to unmark the pages.
|
||||
|
||||
|
||||
(*) Perform coherency check on an object:
|
||||
* Perform coherency check on an object::
|
||||
|
||||
enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
|
||||
const void *data,
|
||||
@ -603,29 +600,26 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
|
||||
One of three values will be returned:
|
||||
|
||||
(*) FSCACHE_CHECKAUX_OKAY
|
||||
|
||||
FSCACHE_CHECKAUX_OKAY
|
||||
The coherency data indicates the object is valid as is.
|
||||
|
||||
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE
|
||||
|
||||
FSCACHE_CHECKAUX_NEEDS_UPDATE
|
||||
The coherency data needs updating, but otherwise the object is
|
||||
valid.
|
||||
|
||||
(*) FSCACHE_CHECKAUX_OBSOLETE
|
||||
|
||||
FSCACHE_CHECKAUX_OBSOLETE
|
||||
The coherency data indicates that the object is obsolete and should
|
||||
be discarded.
|
||||
|
||||
|
||||
(*) Initialise a freshly allocated object:
|
||||
* Initialise a freshly allocated object::
|
||||
|
||||
void fscache_object_init(struct fscache_object *object);
|
||||
|
||||
This initialises all the fields in an object representation.
|
||||
|
||||
|
||||
(*) Indicate the destruction of an object:
|
||||
* Indicate the destruction of an object::
|
||||
|
||||
void fscache_object_destroyed(struct fscache_cache *cache);
|
||||
|
||||
@ -635,7 +629,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
all the objects.
|
||||
|
||||
|
||||
(*) Indicate negative lookup on an object:
|
||||
* Indicate negative lookup on an object::
|
||||
|
||||
void fscache_object_lookup_negative(struct fscache_object *object);
|
||||
|
||||
@ -650,7 +644,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
significant - all subsequent calls are ignored.
|
||||
|
||||
|
||||
(*) Indicate an object has been obtained:
|
||||
* Indicate an object has been obtained::
|
||||
|
||||
void fscache_obtained_object(struct fscache_object *object);
|
||||
|
||||
@ -667,7 +661,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
(2) that writes may now proceed against this object.
|
||||
|
||||
|
||||
(*) Indicate that object lookup failed:
|
||||
* Indicate that object lookup failed::
|
||||
|
||||
void fscache_object_lookup_error(struct fscache_object *object);
|
||||
|
||||
@ -676,7 +670,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
as possible.
|
||||
|
||||
|
||||
(*) Indicate that a stale object was found and discarded:
|
||||
* Indicate that a stale object was found and discarded::
|
||||
|
||||
void fscache_object_retrying_stale(struct fscache_object *object);
|
||||
|
||||
@ -685,7 +679,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
discarded from the cache and the lookup will be performed again.
|
||||
|
||||
|
||||
(*) Indicate that the caching backend killed an object:
|
||||
* Indicate that the caching backend killed an object::
|
||||
|
||||
void fscache_object_mark_killed(struct fscache_object *object,
|
||||
enum fscache_why_object_killed why);
|
||||
@ -693,13 +687,20 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
This is called to indicate that the cache backend preemptively killed an
|
||||
object. The why parameter should be set to indicate the reason:
|
||||
|
||||
FSCACHE_OBJECT_IS_STALE - the object was stale and needs discarding.
|
||||
FSCACHE_OBJECT_NO_SPACE - there was insufficient cache space
|
||||
FSCACHE_OBJECT_WAS_RETIRED - the object was retired when relinquished.
|
||||
FSCACHE_OBJECT_WAS_CULLED - the object was culled to make space.
|
||||
FSCACHE_OBJECT_IS_STALE
|
||||
- the object was stale and needs discarding.
|
||||
|
||||
FSCACHE_OBJECT_NO_SPACE
|
||||
- there was insufficient cache space
|
||||
|
||||
FSCACHE_OBJECT_WAS_RETIRED
|
||||
- the object was retired when relinquished.
|
||||
|
||||
FSCACHE_OBJECT_WAS_CULLED
|
||||
- the object was culled to make space.
|
||||
|
||||
|
||||
(*) Get and release references on a retrieval record:
|
||||
* Get and release references on a retrieval record::
|
||||
|
||||
void fscache_get_retrieval(struct fscache_retrieval *op);
|
||||
void fscache_put_retrieval(struct fscache_retrieval *op);
|
||||
@ -708,7 +709,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
asynchronous data retrieval and block allocation.
|
||||
|
||||
|
||||
(*) Enqueue a retrieval record for processing.
|
||||
* Enqueue a retrieval record for processing::
|
||||
|
||||
void fscache_enqueue_retrieval(struct fscache_retrieval *op);
|
||||
|
||||
@ -718,7 +719,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
|
||||
within the callback function.
|
||||
|
||||
|
||||
(*) List of object state names:
|
||||
* List of object state names::
|
||||
|
||||
const char *fscache_object_states[];
|
||||
|
@ -1,8 +1,10 @@
|
||||
===============================================
|
||||
CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
|
||||
===============================================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Contents:
|
||||
===============================================
|
||||
CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
|
||||
===============================================
|
||||
|
||||
.. Contents:
|
||||
|
||||
(*) Overview.
|
||||
|
||||
@ -27,8 +29,8 @@ Contents:
|
||||
(*) Debugging.
|
||||
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
CacheFiles is a caching backend that's meant to use as a cache a directory on
|
||||
@ -58,8 +60,8 @@ spare space and automatically contract when the set of data requires more
|
||||
space.
|
||||
|
||||
|
||||
============
|
||||
REQUIREMENTS
|
||||
|
||||
Requirements
|
||||
============
|
||||
|
||||
The use of CacheFiles and its daemon requires the following features to be
|
||||
@ -79,84 +81,70 @@ It is strongly recommended that the "dir_index" option is enabled on Ext3
|
||||
filesystems being used as a cache.
|
||||
|
||||
|
||||
=============
|
||||
CONFIGURATION
|
||||
Configuration
|
||||
=============
|
||||
|
||||
The cache is configured by a script in /etc/cachefilesd.conf. These commands
|
||||
set up cache ready for use. The following script commands are available:
|
||||
|
||||
(*) brun <N>%
|
||||
(*) bcull <N>%
|
||||
(*) bstop <N>%
|
||||
(*) frun <N>%
|
||||
(*) fcull <N>%
|
||||
(*) fstop <N>%
|
||||
|
||||
brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>%
|
||||
Configure the culling limits. Optional. See the section on culling
|
||||
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
|
||||
|
||||
The commands beginning with a 'b' are file space (block) limits, those
|
||||
beginning with an 'f' are file count limits.
|
||||
|
||||
(*) dir <path>
|
||||
|
||||
dir <path>
|
||||
Specify the directory containing the root of the cache. Mandatory.
|
||||
|
||||
(*) tag <name>
|
||||
|
||||
tag <name>
|
||||
Specify a tag to FS-Cache to use in distinguishing multiple caches.
|
||||
Optional. The default is "CacheFiles".
|
||||
|
||||
(*) debug <mask>
|
||||
|
||||
debug <mask>
|
||||
Specify a numeric bitmask to control debugging in the kernel module.
|
||||
Optional. The default is zero (all off). The following values can be
|
||||
OR'd into the mask to collect various information:
|
||||
|
||||
== =================================================
|
||||
1 Turn on trace of function entry (_enter() macros)
|
||||
2 Turn on trace of function exit (_leave() macros)
|
||||
4 Turn on trace of internal debug points (_debug())
|
||||
== =================================================
|
||||
|
||||
This mask can also be set through sysfs, eg:
|
||||
This mask can also be set through sysfs, eg::
|
||||
|
||||
echo 5 >/sys/modules/cachefiles/parameters/debug
|
||||
|
||||
|
||||
==================
|
||||
STARTING THE CACHE
|
||||
Starting the Cache
|
||||
==================
|
||||
|
||||
The cache is started by running the daemon. The daemon opens the cache device,
|
||||
configures the cache and tells it to begin caching. At that point the cache
|
||||
binds to fscache and the cache becomes live.
|
||||
|
||||
The daemon is run as follows:
|
||||
The daemon is run as follows::
|
||||
|
||||
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
|
||||
|
||||
The flags are:
|
||||
|
||||
(*) -d
|
||||
|
||||
``-d``
|
||||
Increase the debugging level. This can be specified multiple times and
|
||||
is cumulative with itself.
|
||||
|
||||
(*) -s
|
||||
|
||||
``-s``
|
||||
Send messages to stderr instead of syslog.
|
||||
|
||||
(*) -n
|
||||
|
||||
``-n``
|
||||
Don't daemonise and go into background.
|
||||
|
||||
(*) -f <configfile>
|
||||
|
||||
``-f <configfile>``
|
||||
Use an alternative configuration file rather than the default one.
|
||||
|
||||
|
||||
===============
|
||||
THINGS TO AVOID
|
||||
Things to Avoid
|
||||
===============
|
||||
|
||||
Do not mount other things within the cache as this will cause problems. The
|
||||
@ -179,8 +167,7 @@ Do not chmod files in the cache. The module creates things with minimal
|
||||
permissions to prevent random users being able to access them directly.
|
||||
|
||||
|
||||
=============
|
||||
CACHE CULLING
|
||||
Cache Culling
|
||||
=============
|
||||
|
||||
The cache may need culling occasionally to make space. This involves
|
||||
@ -192,27 +179,21 @@ Cache culling is done on the basis of the percentage of blocks and the
|
||||
percentage of files available in the underlying filesystem. There are six
|
||||
"limits":
|
||||
|
||||
(*) brun
|
||||
(*) frun
|
||||
|
||||
brun, frun
|
||||
If the amount of free space and the number of available files in the cache
|
||||
rises above both these limits, then culling is turned off.
|
||||
|
||||
(*) bcull
|
||||
(*) fcull
|
||||
|
||||
bcull, fcull
|
||||
If the amount of available space or the number of available files in the
|
||||
cache falls below either of these limits, then culling is started.
|
||||
|
||||
(*) bstop
|
||||
(*) fstop
|
||||
|
||||
bstop, fstop
|
||||
If the amount of available space or the number of available files in the
|
||||
cache falls below either of these limits, then no further allocation of
|
||||
disk space or files is permitted until culling has raised things above
|
||||
these limits again.
|
||||
|
||||
These must be configured thusly:
|
||||
These must be configured thusly::
|
||||
|
||||
0 <= bstop < bcull < brun < 100
|
||||
0 <= fstop < fcull < frun < 100
|
||||
@ -226,16 +207,14 @@ started as soon as space is made in the table. Objects will be skipped if
|
||||
their atimes have changed or if the kernel module says it is still using them.
|
||||
|
||||
|
||||
===============
|
||||
CACHE STRUCTURE
|
||||
Cache Structure
|
||||
===============
|
||||
|
||||
The CacheFiles module will create two directories in the directory it was
|
||||
given:
|
||||
|
||||
(*) cache/
|
||||
|
||||
(*) graveyard/
|
||||
* cache/
|
||||
* graveyard/
|
||||
|
||||
The active cache objects all reside in the first directory. The CacheFiles
|
||||
kernel module moves any retired or culled objects that it can't simply unlink
|
||||
@ -261,10 +240,10 @@ If an object has children, then it will be represented as a directory.
|
||||
Immediately in the representative directory are a collection of directories
|
||||
named for hash values of the child object keys with an '@' prepended. Into
|
||||
this directory, if possible, will be placed the representations of the child
|
||||
objects:
|
||||
objects::
|
||||
|
||||
INDEX INDEX INDEX DATA FILES
|
||||
========= ========== ================================= ================
|
||||
/INDEX /INDEX /INDEX /DATA FILES
|
||||
/=========/==========/=================================/================
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
|
||||
@ -275,7 +254,7 @@ If the key is so long that it exceeds NAME_MAX with the decorations added on to
|
||||
it, then it will be cut into pieces, the first few of which will be used to
|
||||
make a nest of directories, and the last one of which will be the objects
|
||||
inside the last directory. The names of the intermediate directories will have
|
||||
'+' prepended:
|
||||
'+' prepended::
|
||||
|
||||
J1223/@23/+xy...z/+kl...m/Epqr
|
||||
|
||||
@ -288,11 +267,13 @@ To handle this, CacheFiles will use a suitably printable filename directly and
|
||||
"base-64" encode ones that aren't directly suitable. The two versions of
|
||||
object filenames indicate the encoding:
|
||||
|
||||
=============== =============== ===============
|
||||
OBJECT TYPE PRINTABLE ENCODED
|
||||
=============== =============== ===============
|
||||
Index "I..." "J..."
|
||||
Data "D..." "E..."
|
||||
Special "S..." "T..."
|
||||
=============== =============== ===============
|
||||
|
||||
Intermediate directories are always "@" or "+" as appropriate.
|
||||
|
||||
@ -307,8 +288,7 @@ Note that CacheFiles will erase from the cache any file it doesn't recognise or
|
||||
any file of an incorrect type (such as a FIFO file or a device file).
|
||||
|
||||
|
||||
==========================
|
||||
SECURITY MODEL AND SELINUX
|
||||
Security Model and SELinux
|
||||
==========================
|
||||
|
||||
CacheFiles is implemented to deal properly with the LSM security features of
|
||||
@ -331,26 +311,26 @@ When the CacheFiles module is asked to bind to its cache, it:
|
||||
|
||||
(1) Finds the security label attached to the root cache directory and uses
|
||||
that as the security label with which it will create files. By default,
|
||||
this is:
|
||||
this is::
|
||||
|
||||
cachefiles_var_t
|
||||
|
||||
(2) Finds the security label of the process which issued the bind request
|
||||
(presumed to be the cachefilesd daemon), which by default will be:
|
||||
(presumed to be the cachefilesd daemon), which by default will be::
|
||||
|
||||
cachefilesd_t
|
||||
|
||||
and asks LSM to supply a security ID as which it should act given the
|
||||
daemon's label. By default, this will be:
|
||||
daemon's label. By default, this will be::
|
||||
|
||||
cachefiles_kernel_t
|
||||
|
||||
SELinux transitions the daemon's security ID to the module's security ID
|
||||
based on a rule of this form in the policy.
|
||||
based on a rule of this form in the policy::
|
||||
|
||||
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
|
||||
|
||||
For instance:
|
||||
For instance::
|
||||
|
||||
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
|
||||
|
||||
@ -370,7 +350,7 @@ There are policy source files available in:
|
||||
|
||||
http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
|
||||
|
||||
and later versions. In that tarball, see the files:
|
||||
and later versions. In that tarball, see the files::
|
||||
|
||||
cachefilesd.te
|
||||
cachefilesd.fc
|
||||
@ -379,7 +359,7 @@ and later versions. In that tarball, see the files:
|
||||
They are built and installed directly by the RPM.
|
||||
|
||||
If a non-RPM based system is being used, then copy the above files to their own
|
||||
directory and run:
|
||||
directory and run::
|
||||
|
||||
make -f /usr/share/selinux/devel/Makefile
|
||||
semodule -i cachefilesd.pp
|
||||
@ -394,7 +374,7 @@ an auxiliary policy must be installed to label the alternate location of the
|
||||
cache.
|
||||
|
||||
For instructions on how to add an auxiliary policy to enable the cache to be
|
||||
located elsewhere when SELinux is in enforcing mode, please see:
|
||||
located elsewhere when SELinux is in enforcing mode, please see::
|
||||
|
||||
/usr/share/doc/cachefilesd-*/move-cache.txt
|
||||
|
||||
@ -402,8 +382,7 @@ When the cachefilesd rpm is installed; alternatively, the document can be found
|
||||
in the sources.
|
||||
|
||||
|
||||
==================
|
||||
A NOTE ON SECURITY
|
||||
A Note on Security
|
||||
==================
|
||||
|
||||
CacheFiles makes use of the split security in the task_struct. It allocates
|
||||
@ -445,17 +424,18 @@ for CacheFiles to run in a context of a specific security label, or to create
|
||||
files and directories with another security label.
|
||||
|
||||
|
||||
=======================
|
||||
STATISTICAL INFORMATION
|
||||
Statistical Information
|
||||
=======================
|
||||
|
||||
If FS-Cache is compiled with the following option enabled:
|
||||
If FS-Cache is compiled with the following option enabled::
|
||||
|
||||
CONFIG_CACHEFILES_HISTOGRAM=y
|
||||
|
||||
then it will gather certain statistics and display them through a proc file.
|
||||
|
||||
(*) /proc/fs/cachefiles/histogram
|
||||
/proc/fs/cachefiles/histogram
|
||||
|
||||
::
|
||||
|
||||
cat /proc/fs/cachefiles/histogram
|
||||
JIFS SECS LOOKUPS MKDIRS CREATES
|
||||
@ -465,36 +445,39 @@ then it will gather certain statistics and display them through a proc file.
|
||||
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
|
||||
columns are as follows:
|
||||
|
||||
======= =======================================================
|
||||
COLUMN TIME MEASUREMENT
|
||||
======= =======================================================
|
||||
LOOKUPS Length of time to perform a lookup on the backing fs
|
||||
MKDIRS Length of time to perform a mkdir on the backing fs
|
||||
CREATES Length of time to perform a create on the backing fs
|
||||
======= =======================================================
|
||||
|
||||
Each row shows the number of events that took a particular range of times.
|
||||
Each step is 1 jiffy in size. The JIFS column indicates the particular
|
||||
jiffy range covered, and the SECS field the equivalent number of seconds.
|
||||
|
||||
|
||||
=========
|
||||
DEBUGGING
|
||||
Debugging
|
||||
=========
|
||||
|
||||
If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
|
||||
debugging enabled by adjusting the value in:
|
||||
debugging enabled by adjusting the value in::
|
||||
|
||||
/sys/module/cachefiles/parameters/debug
|
||||
|
||||
This is a bitmask of debugging streams to enable:
|
||||
|
||||
======= ======= =============================== =======================
|
||||
BIT VALUE STREAM POINT
|
||||
======= ======= =============================== =======================
|
||||
0 1 General Function entry trace
|
||||
1 2 Function exit trace
|
||||
2 4 General
|
||||
======= ======= =============================== =======================
|
||||
|
||||
The appropriate set of values should be OR'd together and the result written to
|
||||
the control file. For example:
|
||||
the control file. For example::
|
||||
|
||||
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
|
||||
|
565
Documentation/filesystems/caching/fscache.rst
Normal file
565
Documentation/filesystems/caching/fscache.rst
Normal file
@ -0,0 +1,565 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
General Filesystem Caching
|
||||
==========================
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
This facility is a general purpose cache for network filesystems, though it
|
||||
could be used for caching other things such as ISO9660 filesystems too.
|
||||
|
||||
FS-Cache mediates between cache backends (such as CacheFS) and network
|
||||
filesystems::
|
||||
|
||||
+---------+
|
||||
| | +--------------+
|
||||
| NFS |--+ | |
|
||||
| | | +-->| CacheFS |
|
||||
+---------+ | +----------+ | | /dev/hda5 |
|
||||
| | | | +--------------+
|
||||
+---------+ +-->| | |
|
||||
| | | |--+
|
||||
| AFS |----->| FS-Cache |
|
||||
| | | |--+
|
||||
+---------+ +-->| | |
|
||||
| | | | +--------------+
|
||||
+---------+ | +----------+ | | |
|
||||
| | | +-->| CacheFiles |
|
||||
| ISOFS |--+ | /var/cache |
|
||||
| | +--------------+
|
||||
+---------+
|
||||
|
||||
Or to look at it another way, FS-Cache is a module that provides a caching
|
||||
facility to a network filesystem such that the cache is transparent to the
|
||||
user::
|
||||
|
||||
+---------+
|
||||
| |
|
||||
| Server |
|
||||
| |
|
||||
+---------+
|
||||
| NETWORK
|
||||
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
|
||||
| +----------+
|
||||
V | |
|
||||
+---------+ | |
|
||||
| | | |
|
||||
| NFS |----->| FS-Cache |
|
||||
| | | |--+
|
||||
+---------+ | | | +--------------+ +--------------+
|
||||
| | | | | | | |
|
||||
V +----------+ +-->| CacheFiles |-->| Ext3 |
|
||||
+---------+ | /var/cache | | /dev/sda6 |
|
||||
| | +--------------+ +--------------+
|
||||
| VFS | ^ ^
|
||||
| | | |
|
||||
+---------+ +--------------+ |
|
||||
| KERNEL SPACE | |
|
||||
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
|
||||
| USER SPACE | |
|
||||
V | |
|
||||
+---------+ +--------------+
|
||||
| | | |
|
||||
| Process | | cachefilesd |
|
||||
| | | |
|
||||
+---------+ +--------------+
|
||||
|
||||
|
||||
FS-Cache does not follow the idea of completely loading every netfs file
|
||||
opened in its entirety into a cache before permitting it to be accessed and
|
||||
then serving the pages out of that cache rather than the netfs inode because:
|
||||
|
||||
(1) It must be practical to operate without a cache.
|
||||
|
||||
(2) The size of any accessible file must not be limited to the size of the
|
||||
cache.
|
||||
|
||||
(3) The combined size of all opened files (this includes mapped libraries)
|
||||
must not be limited to the size of the cache.
|
||||
|
||||
(4) The user should not be forced to download an entire file just to do a
|
||||
one-off access of a small portion of it (such as might be done with the
|
||||
"file" program).
|
||||
|
||||
It instead serves the cache out in PAGE_SIZE chunks as and when requested by
|
||||
the netfs('s) using it.
|
||||
|
||||
|
||||
FS-Cache provides the following facilities:
|
||||
|
||||
(1) More than one cache can be used at once. Caches can be selected
|
||||
explicitly by use of tags.
|
||||
|
||||
(2) Caches can be added / removed at any time.
|
||||
|
||||
(3) The netfs is provided with an interface that allows either party to
|
||||
withdraw caching facilities from a file (required for (2)).
|
||||
|
||||
(4) The interface to the netfs returns as few errors as possible, preferring
|
||||
rather to let the netfs remain oblivious.
|
||||
|
||||
(5) Cookies are used to represent indices, files and other objects to the
|
||||
netfs. The simplest cookie is just a NULL pointer - indicating nothing
|
||||
cached there.
|
||||
|
||||
(6) The netfs is allowed to propose - dynamically - any index hierarchy it
|
||||
desires, though it must be aware that the index search function is
|
||||
recursive, stack space is limited, and indices can only be children of
|
||||
indices.
|
||||
|
||||
(7) Data I/O is done direct to and from the netfs's pages. The netfs
|
||||
indicates that page A is at index B of the data-file represented by cookie
|
||||
C, and that it should be read or written. The cache backend may or may
|
||||
not start I/O on that page, but if it does, a netfs callback will be
|
||||
invoked to indicate completion. The I/O may be either synchronous or
|
||||
asynchronous.
|
||||
|
||||
(8) Cookies can be "retired" upon release. At this point FS-Cache will mark
|
||||
them as obsolete and the index hierarchy rooted at that point will get
|
||||
recycled.
|
||||
|
||||
(9) The netfs provides a "match" function for index searches. In addition to
|
||||
saying whether a match was made or not, this can also specify that an
|
||||
entry should be updated or deleted.
|
||||
|
||||
(10) As much as possible is done asynchronously.
|
||||
|
||||
|
||||
FS-Cache maintains a virtual indexing tree in which all indices, files, objects
|
||||
and pages are kept. Bits of this tree may actually reside in one or more
|
||||
caches::
|
||||
|
||||
FSDEF
|
||||
|
|
||||
+------------------------------------+
|
||||
| |
|
||||
NFS AFS
|
||||
| |
|
||||
+--------------------------+ +-----------+
|
||||
| | | |
|
||||
homedir mirror afs.org redhat.com
|
||||
| | |
|
||||
+------------+ +---------------+ +----------+
|
||||
| | | | | |
|
||||
00001 00002 00007 00125 vol00001 vol00002
|
||||
| | | | |
|
||||
+---+---+ +-----+ +---+ +------+------+ +-----+----+
|
||||
| | | | | | | | | | | | |
|
||||
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
|
||||
| |
|
||||
PG0 +-------+
|
||||
| |
|
||||
00001 00003
|
||||
|
|
||||
+---+---+
|
||||
| | |
|
||||
PG0 PG1 PG2
|
||||
|
||||
In the example above, you can see two netfs's being backed: NFS and AFS. These
|
||||
have different index hierarchies:
|
||||
|
||||
* The NFS primary index contains per-server indices. Each server index is
|
||||
indexed by NFS file handles to get data file objects. Each data file
|
||||
objects can have an array of pages, but may also have further child
|
||||
objects, such as extended attributes and directory entries. Extended
|
||||
attribute objects themselves have page-array contents.
|
||||
|
||||
* The AFS primary index contains per-cell indices. Each cell index contains
|
||||
per-logical-volume indices. Each of volume index contains up to three
|
||||
indices for the read-write, read-only and backup mirrors of those volumes.
|
||||
Each of these contains vnode data file objects, each of which contains an
|
||||
array of pages.
|
||||
|
||||
The very top index is the FS-Cache master index in which individual netfs's
|
||||
have entries.
|
||||
|
||||
Any index object may reside in more than one cache, provided it only has index
|
||||
children. Any index with non-index object children will be assumed to only
|
||||
reside in one cache.
|
||||
|
||||
|
||||
The netfs API to FS-Cache can be found in:
|
||||
|
||||
Documentation/filesystems/caching/netfs-api.rst
|
||||
|
||||
The cache backend API to FS-Cache can be found in:
|
||||
|
||||
Documentation/filesystems/caching/backend-api.rst
|
||||
|
||||
A description of the internal representations and object state machine can be
|
||||
found in:
|
||||
|
||||
Documentation/filesystems/caching/object.rst
|
||||
|
||||
|
||||
Statistical Information
|
||||
=======================
|
||||
|
||||
If FS-Cache is compiled with the following options enabled::
|
||||
|
||||
CONFIG_FSCACHE_STATS=y
|
||||
CONFIG_FSCACHE_HISTOGRAM=y
|
||||
|
||||
then it will gather certain statistics and display them through a number of
|
||||
proc files.
|
||||
|
||||
/proc/fs/fscache/stats
|
||||
----------------------
|
||||
|
||||
This shows counts of a number of events that can happen in FS-Cache:
|
||||
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|CLASS |EVENT |MEANING |
|
||||
+==============+=======+=======================================================+
|
||||
|Cookies |idx=N |Number of index cookies allocated |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |dat=N |Number of data storage cookies allocated |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |spc=N |Number of special cookies allocated |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Objects |alc=N |Number of objects allocated |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nal=N |Number of object allocation failures |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |avl=N |Number of objects that reached the available state |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ded=N |Number of objects that reached the dead state |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|ChkAux |non=N |Number of objects that didn't have a coherency check |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ok=N |Number of objects that passed a coherency check |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |upd=N |Number of objects that needed a coherency data update |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |obs=N |Number of objects that were declared obsolete |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Pages |mrk=N |Number of pages marked as being cached |
|
||||
| |unc=N |Number of uncache page requests seen |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Acquire |n=N |Number of acquire cookie requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nul=N |Number of acq reqs given a NULL parent |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |noc=N |Number of acq reqs rejected due to no cache available |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ok=N |Number of acq reqs succeeded |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nbf=N |Number of acq reqs rejected due to error |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |oom=N |Number of acq reqs failed on ENOMEM |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Lookups |n=N |Number of lookup calls made on cache backends |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |neg=N |Number of negative lookups made |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |pos=N |Number of positive lookups made |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |crt=N |Number of objects created by lookup |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |tmo=N |Number of lookups timed out and requeued |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Updates |n=N |Number of update cookie requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nul=N |Number of upd reqs given a NULL parent |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |run=N |Number of upd reqs granted CPU time |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Relinqs |n=N |Number of relinquish cookie requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nul=N |Number of rlq reqs given a NULL parent |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |wcr=N |Number of rlq reqs waited on completion of creation |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|AttrChg |n=N |Number of attribute changed requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ok=N |Number of attr changed requests queued |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nbf=N |Number of attr changed rejected -ENOBUFS |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |oom=N |Number of attr changed failed -ENOMEM |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |run=N |Number of attr changed ops given CPU time |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Allocs |n=N |Number of allocation requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ok=N |Number of successful alloc reqs |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |wt=N |Number of alloc reqs that waited on lookup completion |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nbf=N |Number of alloc reqs rejected -ENOBUFS |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |int=N |Number of alloc reqs aborted -ERESTARTSYS |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ops=N |Number of alloc reqs submitted |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |owt=N |Number of alloc reqs waited for CPU time |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |abt=N |Number of alloc reqs aborted due to object death |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Retrvls |n=N |Number of retrieval (read) requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ok=N |Number of successful retr reqs |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |wt=N |Number of retr reqs that waited on lookup completion |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nod=N |Number of retr reqs returned -ENODATA |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nbf=N |Number of retr reqs rejected -ENOBUFS |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |int=N |Number of retr reqs aborted -ERESTARTSYS |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |oom=N |Number of retr reqs failed -ENOMEM |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ops=N |Number of retr reqs submitted |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |owt=N |Number of retr reqs waited for CPU time |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |abt=N |Number of retr reqs aborted due to object death |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Stores |n=N |Number of storage (write) requests seen |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ok=N |Number of successful store reqs |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |agn=N |Number of store reqs on a page already pending storage |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |nbf=N |Number of store reqs rejected -ENOBUFS |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |oom=N |Number of store reqs failed -ENOMEM |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ops=N |Number of store reqs submitted |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |run=N |Number of store reqs granted CPU time |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |pgs=N |Number of pages given store req processing time |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |rxd=N |Number of store reqs deleted from tracking tree |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |olm=N |Number of store reqs over store limit |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|VmScan |nos=N |Number of release reqs against pages with no |
|
||||
| | |pending store |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |gon=N |Number of release reqs against pages stored by |
|
||||
| | |time lock granted |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |bsy=N |Number of release reqs ignored due to in-progress store|
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |can=N |Number of page stores cancelled due to release req |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|Ops |pend=N |Number of times async ops added to pending queues |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |run=N |Number of times async ops given CPU time |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |enq=N |Number of times async ops queued for processing |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |can=N |Number of async ops cancelled |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |rej=N |Number of async ops rejected due to object |
|
||||
| | |lookup/create failure |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ini=N |Number of async ops initialised |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |dfr=N |Number of async ops queued for deferred release |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |rel=N |Number of async ops released |
|
||||
| | |(should equal ini=N when idle) |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |gc=N |Number of deferred-release async ops garbage collected |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|CacheOp |alo=N |Number of in-progress alloc_object() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |luo=N |Number of in-progress lookup_object() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |luc=N |Number of in-progress lookup_complete() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |gro=N |Number of in-progress grab_object() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |upo=N |Number of in-progress update_object() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |dro=N |Number of in-progress drop_object() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |pto=N |Number of in-progress put_object() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |syn=N |Number of in-progress sync_cache() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |atc=N |Number of in-progress attr_changed() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |rap=N |Number of in-progress read_or_alloc_page() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ras=N |Number of in-progress read_or_alloc_pages() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |alp=N |Number of in-progress allocate_page() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |als=N |Number of in-progress allocate_pages() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |wrp=N |Number of in-progress write_page() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |ucp=N |Number of in-progress uncache_page() cache ops |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |dsp=N |Number of in-progress dissociate_pages() cache ops |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|CacheEv |nsp=N |Number of object lookups/creations rejected due to |
|
||||
| | |lack of space |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |stl=N |Number of stale objects deleted |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |rtr=N |Number of objects retired when relinquished |
|
||||
+ +-------+-------------------------------------------------------+
|
||||
| |cul=N |Number of objects culled |
|
||||
+--------------+-------+-------------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
/proc/fs/fscache/histogram
|
||||
--------------------------
|
||||
|
||||
::
|
||||
|
||||
cat /proc/fs/fscache/histogram
|
||||
JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
|
||||
===== ===== ========= ========= ========= ========= =========
|
||||
|
||||
This shows the breakdown of the number of times each amount of time
|
||||
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
|
||||
columns are as follows:
|
||||
|
||||
========= =======================================================
|
||||
COLUMN TIME MEASUREMENT
|
||||
========= =======================================================
|
||||
OBJ INST Length of time to instantiate an object
|
||||
OP RUNS Length of time a call to process an operation took
|
||||
OBJ RUNS Length of time a call to process an object event took
|
||||
RETRV DLY Time between an requesting a read and lookup completing
|
||||
RETRIEVLS Time between beginning and end of a retrieval
|
||||
========= =======================================================
|
||||
|
||||
Each row shows the number of events that took a particular range of times.
|
||||
Each step is 1 jiffy in size. The JIFS column indicates the particular
|
||||
jiffy range covered, and the SECS field the equivalent number of seconds.
|
||||
|
||||
|
||||
|
||||
Object List
|
||||
===========
|
||||
|
||||
If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
|
||||
list of all the objects currently allocated and allow them to be viewed
|
||||
through::
|
||||
|
||||
/proc/fs/fscache/objects
|
||||
|
||||
This will look something like::
|
||||
|
||||
[root@andromeda ~]# head /proc/fs/fscache/objects
|
||||
OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
|
||||
======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
|
||||
17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
|
||||
1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
|
||||
|
||||
where the first set of columns before the '|' describe the object:
|
||||
|
||||
======= ===============================================================
|
||||
COLUMN DESCRIPTION
|
||||
======= ===============================================================
|
||||
OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
|
||||
PARENT Debugging ID of parent object
|
||||
STAT Object state
|
||||
CHLDN Number of child objects of this object
|
||||
OPS Number of outstanding operations on this object
|
||||
OOP Number of outstanding child object management operations
|
||||
IPR
|
||||
EX Number of outstanding exclusive operations
|
||||
READS Number of outstanding read operations
|
||||
EM Object's event mask
|
||||
EV Events raised on this object
|
||||
F Object flags
|
||||
S Object work item busy state mask (1:pending 2:running)
|
||||
======= ===============================================================
|
||||
|
||||
and the second set of columns describe the object's cookie, if present:
|
||||
|
||||
================ ======================================================
|
||||
COLUMN DESCRIPTION
|
||||
================ ======================================================
|
||||
NETFS_COOKIE_DEF Name of netfs cookie definition
|
||||
TY Cookie type (IX - index, DT - data, hex - special)
|
||||
FL Cookie flags
|
||||
NETFS_DATA Netfs private data stored in the cookie
|
||||
OBJECT_KEY Object key } 1 column, with separating comma
|
||||
AUX_DATA Object aux data } presence may be configured
|
||||
================ ======================================================
|
||||
|
||||
The data shown may be filtered by attaching the a key to an appropriate keyring
|
||||
before viewing the file. Something like::
|
||||
|
||||
keyctl add user fscache:objlist <restrictions> @s
|
||||
|
||||
where <restrictions> are a selection of the following letters:
|
||||
|
||||
== =========================================================
|
||||
K Show hexdump of object key (don't show if not given)
|
||||
A Show hexdump of object aux data (don't show if not given)
|
||||
== =========================================================
|
||||
|
||||
and the following paired letters:
|
||||
|
||||
== =========================================================
|
||||
C Show objects that have a cookie
|
||||
c Show objects that don't have a cookie
|
||||
B Show objects that are busy
|
||||
b Show objects that aren't busy
|
||||
W Show objects that have pending writes
|
||||
w Show objects that don't have pending writes
|
||||
R Show objects that have outstanding reads
|
||||
r Show objects that don't have outstanding reads
|
||||
S Show objects that have work queued
|
||||
s Show objects that don't have work queued
|
||||
== =========================================================
|
||||
|
||||
If neither side of a letter pair is given, then both are implied. For example:
|
||||
|
||||
keyctl add user fscache:objlist KB @s
|
||||
|
||||
shows objects that are busy, and lists their object keys, but does not dump
|
||||
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
|
||||
not implied.
|
||||
|
||||
By default all objects and all fields will be shown.
|
||||
|
||||
|
||||
Debugging
|
||||
=========
|
||||
|
||||
If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
|
||||
debugging enabled by adjusting the value in::
|
||||
|
||||
/sys/module/fscache/parameters/debug
|
||||
|
||||
This is a bitmask of debugging streams to enable:
|
||||
|
||||
======= ======= =============================== =======================
|
||||
BIT VALUE STREAM POINT
|
||||
======= ======= =============================== =======================
|
||||
0 1 Cache management Function entry trace
|
||||
1 2 Function exit trace
|
||||
2 4 General
|
||||
3 8 Cookie management Function entry trace
|
||||
4 16 Function exit trace
|
||||
5 32 General
|
||||
6 64 Page handling Function entry trace
|
||||
7 128 Function exit trace
|
||||
8 256 General
|
||||
9 512 Operation management Function entry trace
|
||||
10 1024 Function exit trace
|
||||
11 2048 General
|
||||
======= ======= =============================== =======================
|
||||
|
||||
The appropriate set of values should be OR'd together and the result written to
|
||||
the control file. For example::
|
||||
|
||||
echo $((1|8|64)) >/sys/module/fscache/parameters/debug
|
||||
|
||||
will turn on all function entry debugging.
|
@ -1,448 +0,0 @@
|
||||
==========================
|
||||
General Filesystem Caching
|
||||
==========================
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
========
|
||||
|
||||
This facility is a general purpose cache for network filesystems, though it
|
||||
could be used for caching other things such as ISO9660 filesystems too.
|
||||
|
||||
FS-Cache mediates between cache backends (such as CacheFS) and network
|
||||
filesystems:
|
||||
|
||||
+---------+
|
||||
| | +--------------+
|
||||
| NFS |--+ | |
|
||||
| | | +-->| CacheFS |
|
||||
+---------+ | +----------+ | | /dev/hda5 |
|
||||
| | | | +--------------+
|
||||
+---------+ +-->| | |
|
||||
| | | |--+
|
||||
| AFS |----->| FS-Cache |
|
||||
| | | |--+
|
||||
+---------+ +-->| | |
|
||||
| | | | +--------------+
|
||||
+---------+ | +----------+ | | |
|
||||
| | | +-->| CacheFiles |
|
||||
| ISOFS |--+ | /var/cache |
|
||||
| | +--------------+
|
||||
+---------+
|
||||
|
||||
Or to look at it another way, FS-Cache is a module that provides a caching
|
||||
facility to a network filesystem such that the cache is transparent to the
|
||||
user:
|
||||
|
||||
+---------+
|
||||
| |
|
||||
| Server |
|
||||
| |
|
||||
+---------+
|
||||
| NETWORK
|
||||
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
|
||||
| +----------+
|
||||
V | |
|
||||
+---------+ | |
|
||||
| | | |
|
||||
| NFS |----->| FS-Cache |
|
||||
| | | |--+
|
||||
+---------+ | | | +--------------+ +--------------+
|
||||
| | | | | | | |
|
||||
V +----------+ +-->| CacheFiles |-->| Ext3 |
|
||||
+---------+ | /var/cache | | /dev/sda6 |
|
||||
| | +--------------+ +--------------+
|
||||
| VFS | ^ ^
|
||||
| | | |
|
||||
+---------+ +--------------+ |
|
||||
| KERNEL SPACE | |
|
||||
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
|
||||
| USER SPACE | |
|
||||
V | |
|
||||
+---------+ +--------------+
|
||||
| | | |
|
||||
| Process | | cachefilesd |
|
||||
| | | |
|
||||
+---------+ +--------------+
|
||||
|
||||
|
||||
FS-Cache does not follow the idea of completely loading every netfs file
|
||||
opened in its entirety into a cache before permitting it to be accessed and
|
||||
then serving the pages out of that cache rather than the netfs inode because:
|
||||
|
||||
(1) It must be practical to operate without a cache.
|
||||
|
||||
(2) The size of any accessible file must not be limited to the size of the
|
||||
cache.
|
||||
|
||||
(3) The combined size of all opened files (this includes mapped libraries)
|
||||
must not be limited to the size of the cache.
|
||||
|
||||
(4) The user should not be forced to download an entire file just to do a
|
||||
one-off access of a small portion of it (such as might be done with the
|
||||
"file" program).
|
||||
|
||||
It instead serves the cache out in PAGE_SIZE chunks as and when requested by
|
||||
the netfs('s) using it.
|
||||
|
||||
|
||||
FS-Cache provides the following facilities:
|
||||
|
||||
(1) More than one cache can be used at once. Caches can be selected
|
||||
explicitly by use of tags.
|
||||
|
||||
(2) Caches can be added / removed at any time.
|
||||
|
||||
(3) The netfs is provided with an interface that allows either party to
|
||||
withdraw caching facilities from a file (required for (2)).
|
||||
|
||||
(4) The interface to the netfs returns as few errors as possible, preferring
|
||||
rather to let the netfs remain oblivious.
|
||||
|
||||
(5) Cookies are used to represent indices, files and other objects to the
|
||||
netfs. The simplest cookie is just a NULL pointer - indicating nothing
|
||||
cached there.
|
||||
|
||||
(6) The netfs is allowed to propose - dynamically - any index hierarchy it
|
||||
desires, though it must be aware that the index search function is
|
||||
recursive, stack space is limited, and indices can only be children of
|
||||
indices.
|
||||
|
||||
(7) Data I/O is done direct to and from the netfs's pages. The netfs
|
||||
indicates that page A is at index B of the data-file represented by cookie
|
||||
C, and that it should be read or written. The cache backend may or may
|
||||
not start I/O on that page, but if it does, a netfs callback will be
|
||||
invoked to indicate completion. The I/O may be either synchronous or
|
||||
asynchronous.
|
||||
|
||||
(8) Cookies can be "retired" upon release. At this point FS-Cache will mark
|
||||
them as obsolete and the index hierarchy rooted at that point will get
|
||||
recycled.
|
||||
|
||||
(9) The netfs provides a "match" function for index searches. In addition to
|
||||
saying whether a match was made or not, this can also specify that an
|
||||
entry should be updated or deleted.
|
||||
|
||||
(10) As much as possible is done asynchronously.
|
||||
|
||||
|
||||
FS-Cache maintains a virtual indexing tree in which all indices, files, objects
|
||||
and pages are kept. Bits of this tree may actually reside in one or more
|
||||
caches.
|
||||
|
||||
FSDEF
|
||||
|
|
||||
+------------------------------------+
|
||||
| |
|
||||
NFS AFS
|
||||
| |
|
||||
+--------------------------+ +-----------+
|
||||
| | | |
|
||||
homedir mirror afs.org redhat.com
|
||||
| | |
|
||||
+------------+ +---------------+ +----------+
|
||||
| | | | | |
|
||||
00001 00002 00007 00125 vol00001 vol00002
|
||||
| | | | |
|
||||
+---+---+ +-----+ +---+ +------+------+ +-----+----+
|
||||
| | | | | | | | | | | | |
|
||||
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
|
||||
| |
|
||||
PG0 +-------+
|
||||
| |
|
||||
00001 00003
|
||||
|
|
||||
+---+---+
|
||||
| | |
|
||||
PG0 PG1 PG2
|
||||
|
||||
In the example above, you can see two netfs's being backed: NFS and AFS. These
|
||||
have different index hierarchies:
|
||||
|
||||
(*) The NFS primary index contains per-server indices. Each server index is
|
||||
indexed by NFS file handles to get data file objects. Each data file
|
||||
objects can have an array of pages, but may also have further child
|
||||
objects, such as extended attributes and directory entries. Extended
|
||||
attribute objects themselves have page-array contents.
|
||||
|
||||
(*) The AFS primary index contains per-cell indices. Each cell index contains
|
||||
per-logical-volume indices. Each of volume index contains up to three
|
||||
indices for the read-write, read-only and backup mirrors of those volumes.
|
||||
Each of these contains vnode data file objects, each of which contains an
|
||||
array of pages.
|
||||
|
||||
The very top index is the FS-Cache master index in which individual netfs's
|
||||
have entries.
|
||||
|
||||
Any index object may reside in more than one cache, provided it only has index
|
||||
children. Any index with non-index object children will be assumed to only
|
||||
reside in one cache.
|
||||
|
||||
|
||||
The netfs API to FS-Cache can be found in:
|
||||
|
||||
Documentation/filesystems/caching/netfs-api.txt
|
||||
|
||||
The cache backend API to FS-Cache can be found in:
|
||||
|
||||
Documentation/filesystems/caching/backend-api.txt
|
||||
|
||||
A description of the internal representations and object state machine can be
|
||||
found in:
|
||||
|
||||
Documentation/filesystems/caching/object.txt
|
||||
|
||||
|
||||
=======================
|
||||
STATISTICAL INFORMATION
|
||||
=======================
|
||||
|
||||
If FS-Cache is compiled with the following options enabled:
|
||||
|
||||
CONFIG_FSCACHE_STATS=y
|
||||
CONFIG_FSCACHE_HISTOGRAM=y
|
||||
|
||||
then it will gather certain statistics and display them through a number of
|
||||
proc files.
|
||||
|
||||
(*) /proc/fs/fscache/stats
|
||||
|
||||
This shows counts of a number of events that can happen in FS-Cache:
|
||||
|
||||
CLASS EVENT MEANING
|
||||
======= ======= =======================================================
|
||||
Cookies idx=N Number of index cookies allocated
|
||||
dat=N Number of data storage cookies allocated
|
||||
spc=N Number of special cookies allocated
|
||||
Objects alc=N Number of objects allocated
|
||||
nal=N Number of object allocation failures
|
||||
avl=N Number of objects that reached the available state
|
||||
ded=N Number of objects that reached the dead state
|
||||
ChkAux non=N Number of objects that didn't have a coherency check
|
||||
ok=N Number of objects that passed a coherency check
|
||||
upd=N Number of objects that needed a coherency data update
|
||||
obs=N Number of objects that were declared obsolete
|
||||
Pages mrk=N Number of pages marked as being cached
|
||||
unc=N Number of uncache page requests seen
|
||||
Acquire n=N Number of acquire cookie requests seen
|
||||
nul=N Number of acq reqs given a NULL parent
|
||||
noc=N Number of acq reqs rejected due to no cache available
|
||||
ok=N Number of acq reqs succeeded
|
||||
nbf=N Number of acq reqs rejected due to error
|
||||
oom=N Number of acq reqs failed on ENOMEM
|
||||
Lookups n=N Number of lookup calls made on cache backends
|
||||
neg=N Number of negative lookups made
|
||||
pos=N Number of positive lookups made
|
||||
crt=N Number of objects created by lookup
|
||||
tmo=N Number of lookups timed out and requeued
|
||||
Updates n=N Number of update cookie requests seen
|
||||
nul=N Number of upd reqs given a NULL parent
|
||||
run=N Number of upd reqs granted CPU time
|
||||
Relinqs n=N Number of relinquish cookie requests seen
|
||||
nul=N Number of rlq reqs given a NULL parent
|
||||
wcr=N Number of rlq reqs waited on completion of creation
|
||||
AttrChg n=N Number of attribute changed requests seen
|
||||
ok=N Number of attr changed requests queued
|
||||
nbf=N Number of attr changed rejected -ENOBUFS
|
||||
oom=N Number of attr changed failed -ENOMEM
|
||||
run=N Number of attr changed ops given CPU time
|
||||
Allocs n=N Number of allocation requests seen
|
||||
ok=N Number of successful alloc reqs
|
||||
wt=N Number of alloc reqs that waited on lookup completion
|
||||
nbf=N Number of alloc reqs rejected -ENOBUFS
|
||||
int=N Number of alloc reqs aborted -ERESTARTSYS
|
||||
ops=N Number of alloc reqs submitted
|
||||
owt=N Number of alloc reqs waited for CPU time
|
||||
abt=N Number of alloc reqs aborted due to object death
|
||||
Retrvls n=N Number of retrieval (read) requests seen
|
||||
ok=N Number of successful retr reqs
|
||||
wt=N Number of retr reqs that waited on lookup completion
|
||||
nod=N Number of retr reqs returned -ENODATA
|
||||
nbf=N Number of retr reqs rejected -ENOBUFS
|
||||
int=N Number of retr reqs aborted -ERESTARTSYS
|
||||
oom=N Number of retr reqs failed -ENOMEM
|
||||
ops=N Number of retr reqs submitted
|
||||
owt=N Number of retr reqs waited for CPU time
|
||||
abt=N Number of retr reqs aborted due to object death
|
||||
Stores n=N Number of storage (write) requests seen
|
||||
ok=N Number of successful store reqs
|
||||
agn=N Number of store reqs on a page already pending storage
|
||||
nbf=N Number of store reqs rejected -ENOBUFS
|
||||
oom=N Number of store reqs failed -ENOMEM
|
||||
ops=N Number of store reqs submitted
|
||||
run=N Number of store reqs granted CPU time
|
||||
pgs=N Number of pages given store req processing time
|
||||
rxd=N Number of store reqs deleted from tracking tree
|
||||
olm=N Number of store reqs over store limit
|
||||
VmScan nos=N Number of release reqs against pages with no pending store
|
||||
gon=N Number of release reqs against pages stored by time lock granted
|
||||
bsy=N Number of release reqs ignored due to in-progress store
|
||||
can=N Number of page stores cancelled due to release req
|
||||
Ops pend=N Number of times async ops added to pending queues
|
||||
run=N Number of times async ops given CPU time
|
||||
enq=N Number of times async ops queued for processing
|
||||
can=N Number of async ops cancelled
|
||||
rej=N Number of async ops rejected due to object lookup/create failure
|
||||
ini=N Number of async ops initialised
|
||||
dfr=N Number of async ops queued for deferred release
|
||||
rel=N Number of async ops released (should equal ini=N when idle)
|
||||
gc=N Number of deferred-release async ops garbage collected
|
||||
CacheOp alo=N Number of in-progress alloc_object() cache ops
|
||||
luo=N Number of in-progress lookup_object() cache ops
|
||||
luc=N Number of in-progress lookup_complete() cache ops
|
||||
gro=N Number of in-progress grab_object() cache ops
|
||||
upo=N Number of in-progress update_object() cache ops
|
||||
dro=N Number of in-progress drop_object() cache ops
|
||||
pto=N Number of in-progress put_object() cache ops
|
||||
syn=N Number of in-progress sync_cache() cache ops
|
||||
atc=N Number of in-progress attr_changed() cache ops
|
||||
rap=N Number of in-progress read_or_alloc_page() cache ops
|
||||
ras=N Number of in-progress read_or_alloc_pages() cache ops
|
||||
alp=N Number of in-progress allocate_page() cache ops
|
||||
als=N Number of in-progress allocate_pages() cache ops
|
||||
wrp=N Number of in-progress write_page() cache ops
|
||||
ucp=N Number of in-progress uncache_page() cache ops
|
||||
dsp=N Number of in-progress dissociate_pages() cache ops
|
||||
CacheEv nsp=N Number of object lookups/creations rejected due to lack of space
|
||||
stl=N Number of stale objects deleted
|
||||
rtr=N Number of objects retired when relinquished
|
||||
cul=N Number of objects culled
|
||||
|
||||
|
||||
(*) /proc/fs/fscache/histogram
|
||||
|
||||
cat /proc/fs/fscache/histogram
|
||||
JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
|
||||
===== ===== ========= ========= ========= ========= =========
|
||||
|
||||
This shows the breakdown of the number of times each amount of time
|
||||
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
|
||||
columns are as follows:
|
||||
|
||||
COLUMN TIME MEASUREMENT
|
||||
======= =======================================================
|
||||
OBJ INST Length of time to instantiate an object
|
||||
OP RUNS Length of time a call to process an operation took
|
||||
OBJ RUNS Length of time a call to process an object event took
|
||||
RETRV DLY Time between an requesting a read and lookup completing
|
||||
RETRIEVLS Time between beginning and end of a retrieval
|
||||
|
||||
Each row shows the number of events that took a particular range of times.
|
||||
Each step is 1 jiffy in size. The JIFS column indicates the particular
|
||||
jiffy range covered, and the SECS field the equivalent number of seconds.
|
||||
|
||||
|
||||
===========
|
||||
OBJECT LIST
|
||||
===========
|
||||
|
||||
If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
|
||||
list of all the objects currently allocated and allow them to be viewed
|
||||
through:
|
||||
|
||||
/proc/fs/fscache/objects
|
||||
|
||||
This will look something like:
|
||||
|
||||
[root@andromeda ~]# head /proc/fs/fscache/objects
|
||||
OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
|
||||
======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
|
||||
17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
|
||||
1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
|
||||
|
||||
where the first set of columns before the '|' describe the object:
|
||||
|
||||
COLUMN DESCRIPTION
|
||||
======= ===============================================================
|
||||
OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
|
||||
PARENT Debugging ID of parent object
|
||||
STAT Object state
|
||||
CHLDN Number of child objects of this object
|
||||
OPS Number of outstanding operations on this object
|
||||
OOP Number of outstanding child object management operations
|
||||
IPR
|
||||
EX Number of outstanding exclusive operations
|
||||
READS Number of outstanding read operations
|
||||
EM Object's event mask
|
||||
EV Events raised on this object
|
||||
F Object flags
|
||||
S Object work item busy state mask (1:pending 2:running)
|
||||
|
||||
and the second set of columns describe the object's cookie, if present:
|
||||
|
||||
COLUMN DESCRIPTION
|
||||
=============== =======================================================
|
||||
NETFS_COOKIE_DEF Name of netfs cookie definition
|
||||
TY Cookie type (IX - index, DT - data, hex - special)
|
||||
FL Cookie flags
|
||||
NETFS_DATA Netfs private data stored in the cookie
|
||||
OBJECT_KEY Object key } 1 column, with separating comma
|
||||
AUX_DATA Object aux data } presence may be configured
|
||||
|
||||
The data shown may be filtered by attaching the a key to an appropriate keyring
|
||||
before viewing the file. Something like:
|
||||
|
||||
keyctl add user fscache:objlist <restrictions> @s
|
||||
|
||||
where <restrictions> are a selection of the following letters:
|
||||
|
||||
K Show hexdump of object key (don't show if not given)
|
||||
A Show hexdump of object aux data (don't show if not given)
|
||||
|
||||
and the following paired letters:
|
||||
|
||||
C Show objects that have a cookie
|
||||
c Show objects that don't have a cookie
|
||||
B Show objects that are busy
|
||||
b Show objects that aren't busy
|
||||
W Show objects that have pending writes
|
||||
w Show objects that don't have pending writes
|
||||
R Show objects that have outstanding reads
|
||||
r Show objects that don't have outstanding reads
|
||||
S Show objects that have work queued
|
||||
s Show objects that don't have work queued
|
||||
|
||||
If neither side of a letter pair is given, then both are implied. For example:
|
||||
|
||||
keyctl add user fscache:objlist KB @s
|
||||
|
||||
shows objects that are busy, and lists their object keys, but does not dump
|
||||
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
|
||||
not implied.
|
||||
|
||||
By default all objects and all fields will be shown.
|
||||
|
||||
|
||||
=========
|
||||
DEBUGGING
|
||||
=========
|
||||
|
||||
If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
|
||||
debugging enabled by adjusting the value in:
|
||||
|
||||
/sys/module/fscache/parameters/debug
|
||||
|
||||
This is a bitmask of debugging streams to enable:
|
||||
|
||||
BIT VALUE STREAM POINT
|
||||
======= ======= =============================== =======================
|
||||
0 1 Cache management Function entry trace
|
||||
1 2 Function exit trace
|
||||
2 4 General
|
||||
3 8 Cookie management Function entry trace
|
||||
4 16 Function exit trace
|
||||
5 32 General
|
||||
6 64 Page handling Function entry trace
|
||||
7 128 Function exit trace
|
||||
8 256 General
|
||||
9 512 Operation management Function entry trace
|
||||
10 1024 Function exit trace
|
||||
11 2048 General
|
||||
|
||||
The appropriate set of values should be OR'd together and the result written to
|
||||
the control file. For example:
|
||||
|
||||
echo $((1|8|64)) >/sys/module/fscache/parameters/debug
|
||||
|
||||
will turn on all function entry debugging.
|
14
Documentation/filesystems/caching/index.rst
Normal file
14
Documentation/filesystems/caching/index.rst
Normal file
@ -0,0 +1,14 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Filesystem Caching
|
||||
==================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
fscache
|
||||
object
|
||||
backend-api
|
||||
cachefiles
|
||||
netfs-api
|
||||
operations
|
@ -1,6 +1,8 @@
|
||||
===============================
|
||||
FS-CACHE NETWORK FILESYSTEM API
|
||||
===============================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===============================
|
||||
FS-Cache Network Filesystem API
|
||||
===============================
|
||||
|
||||
There's an API by which a network filesystem can make use of the FS-Cache
|
||||
facilities. This is based around a number of principles:
|
||||
@ -19,7 +21,7 @@ facilities. This is based around a number of principles:
|
||||
|
||||
This API is declared in <linux/fscache.h>.
|
||||
|
||||
This document contains the following sections:
|
||||
.. This document contains the following sections:
|
||||
|
||||
(1) Network filesystem definition
|
||||
(2) Index definition
|
||||
@ -41,12 +43,11 @@ This document contains the following sections:
|
||||
(18) FS-Cache specific page flags.
|
||||
|
||||
|
||||
=============================
|
||||
NETWORK FILESYSTEM DEFINITION
|
||||
Network Filesystem Definition
|
||||
=============================
|
||||
|
||||
FS-Cache needs a description of the network filesystem. This is specified
|
||||
using a record of the following structure:
|
||||
using a record of the following structure::
|
||||
|
||||
struct fscache_netfs {
|
||||
uint32_t version;
|
||||
@ -71,7 +72,7 @@ The fields are:
|
||||
another parameter passed into the registration function.
|
||||
|
||||
For example, kAFS (linux/fs/afs/) uses the following definitions to describe
|
||||
itself:
|
||||
itself::
|
||||
|
||||
struct fscache_netfs afs_cache_netfs = {
|
||||
.version = 0,
|
||||
@ -79,8 +80,7 @@ itself:
|
||||
};
|
||||
|
||||
|
||||
================
|
||||
INDEX DEFINITION
|
||||
Index Definition
|
||||
================
|
||||
|
||||
Indices are used for two purposes:
|
||||
@ -114,11 +114,10 @@ There are some limits on indices:
|
||||
function is recursive. Too many layers will run the kernel out of stack.
|
||||
|
||||
|
||||
=================
|
||||
OBJECT DEFINITION
|
||||
Object Definition
|
||||
=================
|
||||
|
||||
To define an object, a structure of the following type should be filled out:
|
||||
To define an object, a structure of the following type should be filled out::
|
||||
|
||||
struct fscache_cookie_def
|
||||
{
|
||||
@ -149,16 +148,13 @@ This has the following fields:
|
||||
|
||||
This is one of the following values:
|
||||
|
||||
(*) FSCACHE_COOKIE_TYPE_INDEX
|
||||
|
||||
FSCACHE_COOKIE_TYPE_INDEX
|
||||
This defines an index, which is a special FS-Cache type.
|
||||
|
||||
(*) FSCACHE_COOKIE_TYPE_DATAFILE
|
||||
|
||||
FSCACHE_COOKIE_TYPE_DATAFILE
|
||||
This defines an ordinary data file.
|
||||
|
||||
(*) Any other value between 2 and 255
|
||||
|
||||
Any other value between 2 and 255
|
||||
This defines an extraordinary object such as an XATTR.
|
||||
|
||||
(2) The name of the object type (NUL terminated unless all 16 chars are used)
|
||||
@ -192,9 +188,14 @@ This has the following fields:
|
||||
|
||||
If present, the function should return one of the following values:
|
||||
|
||||
(*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
|
||||
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
|
||||
(*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
|
||||
FSCACHE_CHECKAUX_OKAY
|
||||
- the entry is okay as is
|
||||
|
||||
FSCACHE_CHECKAUX_NEEDS_UPDATE
|
||||
- the entry requires update
|
||||
|
||||
FSCACHE_CHECKAUX_OBSOLETE
|
||||
- the entry should be deleted
|
||||
|
||||
This function can also be used to extract data from the auxiliary data in
|
||||
the cache and copy it into the netfs's structures.
|
||||
@ -236,32 +237,30 @@ This has the following fields:
|
||||
This function is not required for indices as they're not permitted data.
|
||||
|
||||
|
||||
===================================
|
||||
NETWORK FILESYSTEM (UN)REGISTRATION
|
||||
Network Filesystem (Un)registration
|
||||
===================================
|
||||
|
||||
The first step is to declare the network filesystem to the cache. This also
|
||||
involves specifying the layout of the primary index (for AFS, this would be the
|
||||
"cell" level).
|
||||
|
||||
The registration function is:
|
||||
The registration function is::
|
||||
|
||||
int fscache_register_netfs(struct fscache_netfs *netfs);
|
||||
|
||||
It just takes a pointer to the netfs definition. It returns 0 or an error as
|
||||
appropriate.
|
||||
|
||||
For kAFS, registration is done as follows:
|
||||
For kAFS, registration is done as follows::
|
||||
|
||||
ret = fscache_register_netfs(&afs_cache_netfs);
|
||||
|
||||
The last step is, of course, unregistration:
|
||||
The last step is, of course, unregistration::
|
||||
|
||||
void fscache_unregister_netfs(struct fscache_netfs *netfs);
|
||||
|
||||
|
||||
================
|
||||
CACHE TAG LOOKUP
|
||||
Cache Tag Lookup
|
||||
================
|
||||
|
||||
FS-Cache permits the use of more than one cache. To permit particular index
|
||||
@ -270,7 +269,7 @@ representation tags. This step is optional; it can be left entirely up to
|
||||
FS-Cache as to which cache should be used. The problem with doing that is that
|
||||
FS-Cache will always pick the first cache that was registered.
|
||||
|
||||
To get the representation for a named tag:
|
||||
To get the representation for a named tag::
|
||||
|
||||
struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
|
||||
|
||||
@ -278,7 +277,7 @@ This takes a text string as the name and returns a representation of a tag. It
|
||||
will never return an error. It may return a dummy tag, however, if it runs out
|
||||
of memory; this will inhibit caching with this tag.
|
||||
|
||||
Any representation so obtained must be released by passing it to this function:
|
||||
Any representation so obtained must be released by passing it to this function::
|
||||
|
||||
void fscache_release_cache_tag(struct fscache_cache_tag *tag);
|
||||
|
||||
@ -286,13 +285,12 @@ The tag will be retrieved by FS-Cache when it calls the object definition
|
||||
operation select_cache().
|
||||
|
||||
|
||||
==================
|
||||
INDEX REGISTRATION
|
||||
Index Registration
|
||||
==================
|
||||
|
||||
The third step is to inform FS-Cache about part of an index hierarchy that can
|
||||
be used to locate files. This is done by requesting a cookie for each index in
|
||||
the path to the file:
|
||||
the path to the file::
|
||||
|
||||
struct fscache_cookie *
|
||||
fscache_acquire_cookie(struct fscache_cookie *parent,
|
||||
@ -339,7 +337,7 @@ must be enabled to do anything with it. A disabled cookie can be enabled by
|
||||
calling fscache_enable_cookie() (see below).
|
||||
|
||||
For example, with AFS, a cell would be added to the primary index. This index
|
||||
entry would have a dependent inode containing volume mappings within this cell:
|
||||
entry would have a dependent inode containing volume mappings within this cell::
|
||||
|
||||
cell->cache =
|
||||
fscache_acquire_cookie(afs_cache_netfs.primary_index,
|
||||
@ -349,7 +347,7 @@ entry would have a dependent inode containing volume mappings within this cell:
|
||||
cell, 0, true);
|
||||
|
||||
And then a particular volume could be added to that index by ID, creating
|
||||
another index for vnodes (AFS inode equivalents):
|
||||
another index for vnodes (AFS inode equivalents)::
|
||||
|
||||
volume->cache =
|
||||
fscache_acquire_cookie(volume->cell->cache,
|
||||
@ -359,13 +357,12 @@ another index for vnodes (AFS inode equivalents):
|
||||
volume, 0, true);
|
||||
|
||||
|
||||
======================
|
||||
DATA FILE REGISTRATION
|
||||
Data File Registration
|
||||
======================
|
||||
|
||||
The fourth step is to request a data file be created in the cache. This is
|
||||
identical to index cookie acquisition. The only difference is that the type in
|
||||
the object definition should be something other than index type.
|
||||
the object definition should be something other than index type::
|
||||
|
||||
vnode->cache =
|
||||
fscache_acquire_cookie(volume->cache,
|
||||
@ -375,15 +372,14 @@ the object definition should be something other than index type.
|
||||
vnode, vnode->status.size, true);
|
||||
|
||||
|
||||
=================================
|
||||
MISCELLANEOUS OBJECT REGISTRATION
|
||||
Miscellaneous Object Registration
|
||||
=================================
|
||||
|
||||
An optional step is to request an object of miscellaneous type be created in
|
||||
the cache. This is almost identical to index cookie acquisition. The only
|
||||
difference is that the type in the object definition should be something other
|
||||
than index type. While the parent object could be an index, it's more likely
|
||||
it would be some other type of object such as a data file.
|
||||
it would be some other type of object such as a data file::
|
||||
|
||||
xattr->cache =
|
||||
fscache_acquire_cookie(vnode->cache,
|
||||
@ -396,13 +392,12 @@ Miscellaneous objects might be used to store extended attributes or directory
|
||||
entries for example.
|
||||
|
||||
|
||||
==========================
|
||||
SETTING THE DATA FILE SIZE
|
||||
Setting the Data File Size
|
||||
==========================
|
||||
|
||||
The fifth step is to set the physical attributes of the file, such as its size.
|
||||
This doesn't automatically reserve any space in the cache, but permits the
|
||||
cache to adjust its metadata for data tracking appropriately:
|
||||
cache to adjust its metadata for data tracking appropriately::
|
||||
|
||||
int fscache_attr_changed(struct fscache_cookie *cookie);
|
||||
|
||||
@ -417,8 +412,7 @@ some point in the future, and as such, it may happen after the function returns
|
||||
to the caller. The attribute adjustment excludes read and write operations.
|
||||
|
||||
|
||||
=====================
|
||||
PAGE ALLOC/READ/WRITE
|
||||
Page alloc/read/write
|
||||
=====================
|
||||
|
||||
And the sixth step is to store and retrieve pages in the cache. There are
|
||||
@ -441,7 +435,7 @@ PAGE READ
|
||||
|
||||
Firstly, the netfs should ask FS-Cache to examine the caches and read the
|
||||
contents cached for a particular page of a particular file if present, or else
|
||||
allocate space to store the contents if not:
|
||||
allocate space to store the contents if not::
|
||||
|
||||
typedef
|
||||
void (*fscache_rw_complete_t)(struct page *page,
|
||||
@ -474,14 +468,14 @@ Else if there's a copy of the page resident in the cache:
|
||||
|
||||
(4) When the read is complete, end_io_func() will be invoked with:
|
||||
|
||||
(*) The netfs data supplied when the cookie was created.
|
||||
* The netfs data supplied when the cookie was created.
|
||||
|
||||
(*) The page descriptor.
|
||||
* The page descriptor.
|
||||
|
||||
(*) The context argument passed to the above function. This will be
|
||||
* The context argument passed to the above function. This will be
|
||||
maintained with the get_context/put_context functions mentioned above.
|
||||
|
||||
(*) An argument that's 0 on success or negative for an error code.
|
||||
* An argument that's 0 on success or negative for an error code.
|
||||
|
||||
If an error occurs, it should be assumed that the page contains no usable
|
||||
data. fscache_readpages_cancel() may need to be called.
|
||||
@ -504,11 +498,11 @@ This function may also return -ENOMEM or -EINTR, in which case it won't have
|
||||
read any data from the cache.
|
||||
|
||||
|
||||
PAGE ALLOCATE
|
||||
Page Allocate
|
||||
-------------
|
||||
|
||||
Alternatively, if there's not expected to be any data in the cache for a page
|
||||
because the file has been extended, a block can simply be allocated instead:
|
||||
because the file has been extended, a block can simply be allocated instead::
|
||||
|
||||
int fscache_alloc_page(struct fscache_cookie *cookie,
|
||||
struct page *page,
|
||||
@ -523,12 +517,12 @@ The mark_pages_cached() cookie operation will be called on the page if
|
||||
successful.
|
||||
|
||||
|
||||
PAGE WRITE
|
||||
Page Write
|
||||
----------
|
||||
|
||||
Secondly, if the netfs changes the contents of the page (either due to an
|
||||
initial download or if a user performs a write), then the page should be
|
||||
written back to the cache:
|
||||
written back to the cache::
|
||||
|
||||
int fscache_write_page(struct fscache_cookie *cookie,
|
||||
struct page *page,
|
||||
@ -566,11 +560,11 @@ place if unforeseen circumstances arose (such as a disk error).
|
||||
Writing takes place asynchronously.
|
||||
|
||||
|
||||
MULTIPLE PAGE READ
|
||||
Multiple Page Read
|
||||
------------------
|
||||
|
||||
A facility is provided to read several pages at once, as requested by the
|
||||
readpages() address space operation:
|
||||
readpages() address space operation::
|
||||
|
||||
int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
|
||||
struct address_space *mapping,
|
||||
@ -598,7 +592,7 @@ This works in a similar way to fscache_read_or_alloc_page(), except:
|
||||
be returned.
|
||||
|
||||
Otherwise, if all pages had reads dispatched, then 0 will be returned, the
|
||||
list will be empty and *nr_pages will be 0.
|
||||
list will be empty and ``*nr_pages`` will be 0.
|
||||
|
||||
(4) end_io_func will be called once for each page being read as the reads
|
||||
complete. It will be called in process context if error != 0, but it may
|
||||
@ -609,13 +603,13 @@ some of the pages being read and some being allocated. Those pages will have
|
||||
been marked appropriately and will need uncaching.
|
||||
|
||||
|
||||
CANCELLATION OF UNREAD PAGES
|
||||
Cancellation of Unread Pages
|
||||
----------------------------
|
||||
|
||||
If one or more pages are passed to fscache_read_or_alloc_pages() but not then
|
||||
read from the cache and also not read from the underlying filesystem then
|
||||
those pages will need to have any marks and reservations removed. This can be
|
||||
done by calling:
|
||||
done by calling::
|
||||
|
||||
void fscache_readpages_cancel(struct fscache_cookie *cookie,
|
||||
struct list_head *pages);
|
||||
@ -625,11 +619,10 @@ fscache_read_or_alloc_pages(). Every page in the pages list will be examined
|
||||
and any that have PG_fscache set will be uncached.
|
||||
|
||||
|
||||
==============
|
||||
PAGE UNCACHING
|
||||
Page Uncaching
|
||||
==============
|
||||
|
||||
To uncache a page, this function should be called:
|
||||
To uncache a page, this function should be called::
|
||||
|
||||
void fscache_uncache_page(struct fscache_cookie *cookie,
|
||||
struct page *page);
|
||||
@ -644,12 +637,12 @@ data file must be retired (see the relinquish cookie function below).
|
||||
|
||||
Furthermore, note that this does not cancel the asynchronous read or write
|
||||
operation started by the read/alloc and write functions, so the page
|
||||
invalidation functions must use:
|
||||
invalidation functions must use::
|
||||
|
||||
bool fscache_check_page_write(struct fscache_cookie *cookie,
|
||||
struct page *page);
|
||||
|
||||
to see if a page is being written to the cache, and:
|
||||
to see if a page is being written to the cache, and::
|
||||
|
||||
void fscache_wait_on_page_write(struct fscache_cookie *cookie,
|
||||
struct page *page);
|
||||
@ -660,7 +653,7 @@ to wait for it to finish if it is.
|
||||
When releasepage() is being implemented, a special FS-Cache function exists to
|
||||
manage the heuristics of coping with vmscan trying to eject pages, which may
|
||||
conflict with the cache trying to write pages to the cache (which may itself
|
||||
need to allocate memory):
|
||||
need to allocate memory)::
|
||||
|
||||
bool fscache_maybe_release_page(struct fscache_cookie *cookie,
|
||||
struct page *page,
|
||||
@ -676,12 +669,12 @@ storage request to complete, or it may attempt to cancel the storage request -
|
||||
in which case the page will not be stored in the cache this time.
|
||||
|
||||
|
||||
BULK INODE PAGE UNCACHE
|
||||
Bulk Image Page Uncache
|
||||
-----------------------
|
||||
|
||||
A convenience routine is provided to perform an uncache on all the pages
|
||||
attached to an inode. This assumes that the pages on the inode correspond on a
|
||||
1:1 basis with the pages in the cache.
|
||||
1:1 basis with the pages in the cache::
|
||||
|
||||
void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
|
||||
struct inode *inode);
|
||||
@ -692,12 +685,11 @@ written to the cache and for the cache to finish with the page generally. No
|
||||
error is returned.
|
||||
|
||||
|
||||
===============================
|
||||
INDEX AND DATA FILE CONSISTENCY
|
||||
Index and Data File consistency
|
||||
===============================
|
||||
|
||||
To find out whether auxiliary data for an object is up to data within the
|
||||
cache, the following function can be called:
|
||||
cache, the following function can be called::
|
||||
|
||||
int fscache_check_consistency(struct fscache_cookie *cookie,
|
||||
const void *aux_data);
|
||||
@ -708,7 +700,7 @@ data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also
|
||||
return -ENOMEM and -ERESTARTSYS.
|
||||
|
||||
To request an update of the index data for an index or other object, the
|
||||
following function should be called:
|
||||
following function should be called::
|
||||
|
||||
void fscache_update_cookie(struct fscache_cookie *cookie,
|
||||
const void *aux_data);
|
||||
@ -721,8 +713,7 @@ Note that partial updates may happen automatically at other times, such as when
|
||||
data blocks are added to a data file object.
|
||||
|
||||
|
||||
=================
|
||||
COOKIE ENABLEMENT
|
||||
Cookie Enablement
|
||||
=================
|
||||
|
||||
Cookies exist in one of two states: enabled and disabled. If a cookie is
|
||||
@ -731,7 +722,7 @@ invalidate its state; allocate, read or write backing pages - though it is
|
||||
still possible to uncache pages and relinquish the cookie.
|
||||
|
||||
The initial enablement state is set by fscache_acquire_cookie(), but the cookie
|
||||
can be enabled or disabled later. To disable a cookie, call:
|
||||
can be enabled or disabled later. To disable a cookie, call::
|
||||
|
||||
void fscache_disable_cookie(struct fscache_cookie *cookie,
|
||||
const void *aux_data,
|
||||
@ -746,7 +737,7 @@ All possible failures are handled internally. The caller should consider
|
||||
calling fscache_uncache_all_inode_pages() afterwards to make sure all page
|
||||
markings are cleared up.
|
||||
|
||||
Cookies can be enabled or reenabled with:
|
||||
Cookies can be enabled or reenabled with::
|
||||
|
||||
void fscache_enable_cookie(struct fscache_cookie *cookie,
|
||||
const void *aux_data,
|
||||
@ -771,13 +762,12 @@ In both cases, the cookie's auxiliary data buffer is updated from aux_data if
|
||||
that is non-NULL inside the enablement lock before proceeding.
|
||||
|
||||
|
||||
===============================
|
||||
MISCELLANEOUS COOKIE OPERATIONS
|
||||
Miscellaneous Cookie operations
|
||||
===============================
|
||||
|
||||
There are a number of operations that can be used to control cookies:
|
||||
|
||||
(*) Cookie pinning:
|
||||
* Cookie pinning::
|
||||
|
||||
int fscache_pin_cookie(struct fscache_cookie *cookie);
|
||||
void fscache_unpin_cookie(struct fscache_cookie *cookie);
|
||||
@ -790,7 +780,7 @@ There are a number of operations that can be used to control cookies:
|
||||
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
|
||||
-EIO if there's any other problem.
|
||||
|
||||
(*) Data space reservation:
|
||||
* Data space reservation::
|
||||
|
||||
int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
|
||||
|
||||
@ -809,11 +799,10 @@ There are a number of operations that can be used to control cookies:
|
||||
make space if it's not in use.
|
||||
|
||||
|
||||
=====================
|
||||
COOKIE UNREGISTRATION
|
||||
Cookie Unregistration
|
||||
=====================
|
||||
|
||||
To get rid of a cookie, this function should be called.
|
||||
To get rid of a cookie, this function should be called::
|
||||
|
||||
void fscache_relinquish_cookie(struct fscache_cookie *cookie,
|
||||
const void *aux_data,
|
||||
@ -835,16 +824,14 @@ the cookies for "child" indices, objects and pages have been relinquished
|
||||
first.
|
||||
|
||||
|
||||
==================
|
||||
INDEX INVALIDATION
|
||||
Index Invalidation
|
||||
==================
|
||||
|
||||
There is no direct way to invalidate an index subtree. To do this, the caller
|
||||
should relinquish and retire the cookie they have, and then acquire a new one.
|
||||
|
||||
|
||||
======================
|
||||
DATA FILE INVALIDATION
|
||||
Data File Invalidation
|
||||
======================
|
||||
|
||||
Sometimes it will be necessary to invalidate an object that contains data.
|
||||
@ -853,7 +840,7 @@ change - at which point the netfs has to throw away all the state it had for an
|
||||
inode and reload from the server.
|
||||
|
||||
To indicate that a cache object should be invalidated, the following function
|
||||
can be called:
|
||||
can be called::
|
||||
|
||||
void fscache_invalidate(struct fscache_cookie *cookie);
|
||||
|
||||
@ -868,13 +855,12 @@ auxiliary data update operation as it is very likely these will have changed.
|
||||
|
||||
Using the following function, the netfs can wait for the invalidation operation
|
||||
to have reached a point at which it can start submitting ordinary operations
|
||||
once again:
|
||||
once again::
|
||||
|
||||
void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
|
||||
|
||||
|
||||
===========================
|
||||
FS-CACHE SPECIFIC PAGE FLAG
|
||||
FS-cache Specific Page Flag
|
||||
===========================
|
||||
|
||||
FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
|
||||
@ -898,7 +884,7 @@ was given under certain circumstances.
|
||||
This bit does not overlap with such as PG_private. This means that FS-Cache
|
||||
can be used with a filesystem that uses the block buffering code.
|
||||
|
||||
There are a number of operations defined on this flag:
|
||||
There are a number of operations defined on this flag::
|
||||
|
||||
int PageFsCache(struct page *page);
|
||||
void SetPageFsCache(struct page *page)
|
@ -1,10 +1,12 @@
|
||||
====================================================
|
||||
IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
|
||||
====================================================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
====================================================
|
||||
In-Kernel Cache Object Representation and Management
|
||||
====================================================
|
||||
|
||||
By: David Howells <dhowells@redhat.com>
|
||||
|
||||
Contents:
|
||||
.. Contents:
|
||||
|
||||
(*) Representation
|
||||
|
||||
@ -18,8 +20,7 @@ Contents:
|
||||
(*) The set of events.
|
||||
|
||||
|
||||
==============
|
||||
REPRESENTATION
|
||||
Representation
|
||||
==============
|
||||
|
||||
FS-Cache maintains an in-kernel representation of each object that a netfs is
|
||||
@ -38,7 +39,7 @@ or even by no objects (it may not be cached).
|
||||
|
||||
Furthermore, both cookies and objects are hierarchical. The two hierarchies
|
||||
correspond, but the cookies tree is a superset of the union of the object trees
|
||||
of multiple caches:
|
||||
of multiple caches::
|
||||
|
||||
NETFS INDEX TREE : CACHE 1 : CACHE 2
|
||||
: :
|
||||
@ -89,8 +90,7 @@ pointers to the cookies. The cookies themselves and any objects attached to
|
||||
those cookies are hidden from it.
|
||||
|
||||
|
||||
===============================
|
||||
OBJECT MANAGEMENT STATE MACHINE
|
||||
Object Management State Machine
|
||||
===============================
|
||||
|
||||
Within FS-Cache, each active object is managed by its own individual state
|
||||
@ -124,7 +124,7 @@ is not masked, the object will be queued for processing (by calling
|
||||
fscache_enqueue_object()).
|
||||
|
||||
|
||||
PROVISION OF CPU TIME
|
||||
Provision of CPU Time
|
||||
---------------------
|
||||
|
||||
The work to be done by the various states was given CPU time by the threads of
|
||||
@ -141,7 +141,7 @@ because:
|
||||
workqueues don't necessarily have the right numbers of threads.
|
||||
|
||||
|
||||
LOCKING SIMPLIFICATION
|
||||
Locking Simplification
|
||||
----------------------
|
||||
|
||||
Because only one worker thread may be operating on any particular object's
|
||||
@ -151,8 +151,7 @@ from the cache backend's representation (fscache_object) - which may be
|
||||
requested from either end.
|
||||
|
||||
|
||||
=================
|
||||
THE SET OF STATES
|
||||
The Set of States
|
||||
=================
|
||||
|
||||
The object state machine has a set of states that it can be in. There are
|
||||
@ -275,19 +274,17 @@ memory and potentially deletes stuff from disk:
|
||||
this state.
|
||||
|
||||
|
||||
THE SET OF EVENTS
|
||||
The Set of Events
|
||||
-----------------
|
||||
|
||||
There are a number of events that can be raised to an object state machine:
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_UPDATE
|
||||
|
||||
FSCACHE_OBJECT_EV_UPDATE
|
||||
The netfs requested that an object be updated. The state machine will ask
|
||||
the cache backend to update the object, and the cache backend will ask the
|
||||
netfs for details of the change through its cookie definition ops.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_CLEARED
|
||||
|
||||
FSCACHE_OBJECT_EV_CLEARED
|
||||
This is signalled in two circumstances:
|
||||
|
||||
(a) when an object's last child object is dropped and
|
||||
@ -296,20 +293,16 @@ There are a number of events that can be raised to an object state machine:
|
||||
|
||||
This is used to proceed from the dying state.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_ERROR
|
||||
|
||||
FSCACHE_OBJECT_EV_ERROR
|
||||
This is signalled when an I/O error occurs during the processing of some
|
||||
object.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_RELEASE
|
||||
(*) FSCACHE_OBJECT_EV_RETIRE
|
||||
|
||||
FSCACHE_OBJECT_EV_RELEASE, FSCACHE_OBJECT_EV_RETIRE
|
||||
These are signalled when the netfs relinquishes a cookie it was using.
|
||||
The event selected depends on whether the netfs asks for the backing
|
||||
object to be retired (deleted) or retained.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_WITHDRAW
|
||||
|
||||
FSCACHE_OBJECT_EV_WITHDRAW
|
||||
This is signalled when the cache backend wants to withdraw an object.
|
||||
This means that the object will have to be detached from the netfs's
|
||||
cookie.
|
@ -1,10 +1,12 @@
|
||||
================================
|
||||
ASYNCHRONOUS OPERATIONS HANDLING
|
||||
================================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
================================
|
||||
Asynchronous Operations Handling
|
||||
================================
|
||||
|
||||
By: David Howells <dhowells@redhat.com>
|
||||
|
||||
Contents:
|
||||
.. Contents:
|
||||
|
||||
(*) Overview.
|
||||
|
||||
@ -17,8 +19,7 @@ Contents:
|
||||
(*) Asynchronous callback.
|
||||
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
Overview
|
||||
========
|
||||
|
||||
FS-Cache has an asynchronous operations handling facility that it uses for its
|
||||
@ -33,11 +34,10 @@ backend for completion.
|
||||
To make use of this facility, <linux/fscache-cache.h> should be #included.
|
||||
|
||||
|
||||
===============================
|
||||
OPERATION RECORD INITIALISATION
|
||||
Operation Record Initialisation
|
||||
===============================
|
||||
|
||||
An operation is recorded in an fscache_operation struct:
|
||||
An operation is recorded in an fscache_operation struct::
|
||||
|
||||
struct fscache_operation {
|
||||
union {
|
||||
@ -50,7 +50,7 @@ An operation is recorded in an fscache_operation struct:
|
||||
};
|
||||
|
||||
Someone wanting to issue an operation should allocate something with this
|
||||
struct embedded in it. They should initialise it by calling:
|
||||
struct embedded in it. They should initialise it by calling::
|
||||
|
||||
void fscache_operation_init(struct fscache_operation *op,
|
||||
fscache_operation_release_t release);
|
||||
@ -67,8 +67,7 @@ FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
|
||||
operation and waited for afterwards.
|
||||
|
||||
|
||||
==========
|
||||
PARAMETERS
|
||||
Parameters
|
||||
==========
|
||||
|
||||
There are a number of parameters that can be set in the operation record's flag
|
||||
@ -87,7 +86,7 @@ operations:
|
||||
|
||||
If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
|
||||
before submitting the operation, and the operating thread must wait for it
|
||||
to be cleared before proceeding:
|
||||
to be cleared before proceeding::
|
||||
|
||||
wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
|
||||
TASK_UNINTERRUPTIBLE);
|
||||
@ -101,7 +100,7 @@ operations:
|
||||
page to a netfs page after the backing fs has read the page in.
|
||||
|
||||
If this option is used, op->fast_work and op->processor must be
|
||||
initialised before submitting the operation:
|
||||
initialised before submitting the operation::
|
||||
|
||||
INIT_WORK(&op->fast_work, do_some_work);
|
||||
|
||||
@ -114,7 +113,7 @@ operations:
|
||||
pages that have just been fetched from a remote server.
|
||||
|
||||
If this option is used, op->slow_work and op->processor must be
|
||||
initialised before submitting the operation:
|
||||
initialised before submitting the operation::
|
||||
|
||||
fscache_operation_init_slow(op, processor)
|
||||
|
||||
@ -132,8 +131,7 @@ Furthermore, operations may be one of two types:
|
||||
operations running at the same time.
|
||||
|
||||
|
||||
=========
|
||||
PROCEDURE
|
||||
Procedure
|
||||
=========
|
||||
|
||||
Operations are used through the following procedure:
|
||||
@ -143,7 +141,7 @@ Operations are used through the following procedure:
|
||||
generic op embedded within.
|
||||
|
||||
(2) The submitting thread must then submit the operation for processing using
|
||||
one of the following two functions:
|
||||
one of the following two functions::
|
||||
|
||||
int fscache_submit_op(struct fscache_object *object,
|
||||
struct fscache_operation *op);
|
||||
@ -164,7 +162,7 @@ Operations are used through the following procedure:
|
||||
operation of conflicting exclusivity is in progress on the object.
|
||||
|
||||
If the operation is asynchronous, the manager will retain a reference to
|
||||
it, so the caller should put their reference to it by passing it to:
|
||||
it, so the caller should put their reference to it by passing it to::
|
||||
|
||||
void fscache_put_operation(struct fscache_operation *op);
|
||||
|
||||
@ -179,12 +177,12 @@ Operations are used through the following procedure:
|
||||
(4) The operation holds an effective lock upon the object, preventing other
|
||||
exclusive ops conflicting until it is released. The operation can be
|
||||
enqueued for further immediate asynchronous processing by adjusting the
|
||||
CPU time provisioning option if necessary, eg:
|
||||
CPU time provisioning option if necessary, eg::
|
||||
|
||||
op->flags &= ~FSCACHE_OP_TYPE;
|
||||
op->flags |= ~FSCACHE_OP_FAST;
|
||||
|
||||
and calling:
|
||||
and calling::
|
||||
|
||||
void fscache_enqueue_operation(struct fscache_operation *op)
|
||||
|
||||
@ -192,13 +190,12 @@ Operations are used through the following procedure:
|
||||
pools.
|
||||
|
||||
|
||||
=====================
|
||||
ASYNCHRONOUS CALLBACK
|
||||
Asynchronous Callback
|
||||
=====================
|
||||
|
||||
When used in asynchronous mode, the worker thread pool will invoke the
|
||||
processor method with a pointer to the operation. This should then get at the
|
||||
container struct by using container_of():
|
||||
container struct by using container_of()::
|
||||
|
||||
static void fscache_write_op(struct fscache_operation *_op)
|
||||
{
|
@ -1,7 +1,11 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================================
|
||||
Mounting root file system via SMB (cifs.ko)
|
||||
===========================================
|
||||
|
||||
Written 2019 by Paulo Alcantara <palcantara@suse.de>
|
||||
|
||||
Written 2019 by Aurelien Aptel <aaptel@suse.com>
|
||||
|
||||
The CONFIG_CIFS_ROOT option enables experimental root file system
|
||||
@ -32,7 +36,7 @@ Server configuration
|
||||
====================
|
||||
|
||||
To enable SMB1+UNIX extensions you will need to set these global
|
||||
settings in Samba smb.conf:
|
||||
settings in Samba smb.conf::
|
||||
|
||||
[global]
|
||||
server min protocol = NT1
|
||||
@ -41,12 +45,16 @@ settings in Samba smb.conf:
|
||||
Kernel command line
|
||||
===================
|
||||
|
||||
root=/dev/cifs
|
||||
::
|
||||
|
||||
root=/dev/cifs
|
||||
|
||||
This is just a virtual device that basically tells the kernel to mount
|
||||
the root file system via SMB protocol.
|
||||
|
||||
cifsroot=//<server-ip>/<share>[,options]
|
||||
::
|
||||
|
||||
cifsroot=//<server-ip>/<share>[,options]
|
||||
|
||||
Enables the kernel to mount the root file system via SMB that are
|
||||
located in the <server-ip> and <share> specified in this option.
|
||||
@ -65,33 +73,33 @@ options
|
||||
Examples
|
||||
========
|
||||
|
||||
Export root file system as a Samba share in smb.conf file.
|
||||
Export root file system as a Samba share in smb.conf file::
|
||||
|
||||
...
|
||||
[linux]
|
||||
path = /path/to/rootfs
|
||||
read only = no
|
||||
guest ok = yes
|
||||
force user = root
|
||||
force group = root
|
||||
browseable = yes
|
||||
writeable = yes
|
||||
admin users = root
|
||||
public = yes
|
||||
create mask = 0777
|
||||
directory mask = 0777
|
||||
...
|
||||
...
|
||||
[linux]
|
||||
path = /path/to/rootfs
|
||||
read only = no
|
||||
guest ok = yes
|
||||
force user = root
|
||||
force group = root
|
||||
browseable = yes
|
||||
writeable = yes
|
||||
admin users = root
|
||||
public = yes
|
||||
create mask = 0777
|
||||
directory mask = 0777
|
||||
...
|
||||
|
||||
Restart smb service.
|
||||
Restart smb service::
|
||||
|
||||
# systemctl restart smb
|
||||
# systemctl restart smb
|
||||
|
||||
Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and
|
||||
CONFIG_IP_PNP options enabled.
|
||||
CONFIG_IP_PNP options enabled::
|
||||
|
||||
# qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
|
||||
-kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
|
||||
-append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
|
||||
# qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
|
||||
-kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
|
||||
-append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
|
||||
|
||||
|
||||
1: https://wiki.samba.org/index.php/UNIX_Extensions
|
1670
Documentation/filesystems/coda.rst
Normal file
1670
Documentation/filesystems/coda.rst
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -1,5 +1,6 @@
|
||||
|
||||
configfs - Userspace-driven kernel object configuration.
|
||||
=======================================================
|
||||
Configfs - Userspace-driven Kernel Object Configuration
|
||||
=======================================================
|
||||
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
|
||||
@ -9,7 +10,8 @@ Copyright (c) 2005 Oracle Corporation,
|
||||
Joel Becker <joel.becker@oracle.com>
|
||||
|
||||
|
||||
[What is configfs?]
|
||||
What is configfs?
|
||||
=================
|
||||
|
||||
configfs is a ram-based filesystem that provides the converse of
|
||||
sysfs's functionality. Where sysfs is a filesystem-based view of
|
||||
@ -35,10 +37,11 @@ kernel modules backing the items must respond to this.
|
||||
Both sysfs and configfs can and should exist together on the same
|
||||
system. One is not a replacement for the other.
|
||||
|
||||
[Using configfs]
|
||||
Using configfs
|
||||
==============
|
||||
|
||||
configfs can be compiled as a module or into the kernel. You can access
|
||||
it by doing
|
||||
it by doing::
|
||||
|
||||
mount -t configfs none /config
|
||||
|
||||
@ -56,28 +59,29 @@ values. Don't mix more than one attribute in one attribute file.
|
||||
There are two types of configfs attributes:
|
||||
|
||||
* Normal attributes, which similar to sysfs attributes, are small ASCII text
|
||||
files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
|
||||
only one value per file should be used, and the same caveats from sysfs apply.
|
||||
Configfs expects write(2) to store the entire buffer at once. When writing to
|
||||
normal configfs attributes, userspace processes should first read the entire
|
||||
file, modify the portions they wish to change, and then write the entire
|
||||
buffer back.
|
||||
files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
|
||||
only one value per file should be used, and the same caveats from sysfs apply.
|
||||
Configfs expects write(2) to store the entire buffer at once. When writing to
|
||||
normal configfs attributes, userspace processes should first read the entire
|
||||
file, modify the portions they wish to change, and then write the entire
|
||||
buffer back.
|
||||
|
||||
* Binary attributes, which are somewhat similar to sysfs binary attributes,
|
||||
but with a few slight changes to semantics. The PAGE_SIZE limitation does not
|
||||
apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
|
||||
The write(2) calls from user space are buffered, and the attributes'
|
||||
write_bin_attribute method will be invoked on the final close, therefore it is
|
||||
imperative for user-space to check the return code of close(2) in order to
|
||||
verify that the operation finished successfully.
|
||||
To avoid a malicious user OOMing the kernel, there's a per-binary attribute
|
||||
maximum buffer value.
|
||||
but with a few slight changes to semantics. The PAGE_SIZE limitation does not
|
||||
apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
|
||||
The write(2) calls from user space are buffered, and the attributes'
|
||||
write_bin_attribute method will be invoked on the final close, therefore it is
|
||||
imperative for user-space to check the return code of close(2) in order to
|
||||
verify that the operation finished successfully.
|
||||
To avoid a malicious user OOMing the kernel, there's a per-binary attribute
|
||||
maximum buffer value.
|
||||
|
||||
When an item needs to be destroyed, remove it with rmdir(2). An
|
||||
item cannot be destroyed if any other item has a link to it (via
|
||||
symlink(2)). Links can be removed via unlink(2).
|
||||
|
||||
[Configuring FakeNBD: an Example]
|
||||
Configuring FakeNBD: an Example
|
||||
===============================
|
||||
|
||||
Imagine there's a Network Block Device (NBD) driver that allows you to
|
||||
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
|
||||
@ -86,14 +90,14 @@ sysadmins use to configure FakeNBD, but somehow that program has to tell
|
||||
the driver about it. Here's where configfs comes in.
|
||||
|
||||
When the FakeNBD driver is loaded, it registers itself with configfs.
|
||||
readdir(3) sees this just fine:
|
||||
readdir(3) sees this just fine::
|
||||
|
||||
# ls /config
|
||||
fakenbd
|
||||
|
||||
A fakenbd connection can be created with mkdir(2). The name is
|
||||
arbitrary, but likely the tool will make some use of the name. Perhaps
|
||||
it is a uuid or a disk name:
|
||||
it is a uuid or a disk name::
|
||||
|
||||
# mkdir /config/fakenbd/disk1
|
||||
# ls /config/fakenbd/disk1
|
||||
@ -102,7 +106,7 @@ it is a uuid or a disk name:
|
||||
The target attribute contains the IP address of the server FakeNBD will
|
||||
connect to. The device attribute is the device on the server.
|
||||
Predictably, the rw attribute determines whether the connection is
|
||||
read-only or read-write.
|
||||
read-only or read-write::
|
||||
|
||||
# echo 10.0.0.1 > /config/fakenbd/disk1/target
|
||||
# echo /dev/sda1 > /config/fakenbd/disk1/device
|
||||
@ -111,7 +115,8 @@ read-only or read-write.
|
||||
That's it. That's all there is. Now the device is configured, via the
|
||||
shell no less.
|
||||
|
||||
[Coding With configfs]
|
||||
Coding With configfs
|
||||
====================
|
||||
|
||||
Every object in configfs is a config_item. A config_item reflects an
|
||||
object in the subsystem. It has attributes that match values on that
|
||||
@ -130,7 +135,10 @@ appears as a directory at the top of the configfs filesystem. A
|
||||
subsystem is also a config_group, and can do everything a config_group
|
||||
can.
|
||||
|
||||
[struct config_item]
|
||||
struct config_item
|
||||
==================
|
||||
|
||||
::
|
||||
|
||||
struct config_item {
|
||||
char *ci_name;
|
||||
@ -168,7 +176,10 @@ By itself, a config_item cannot do much more than appear in configfs.
|
||||
Usually a subsystem wants the item to display and/or store attributes,
|
||||
among other things. For that, it needs a type.
|
||||
|
||||
[struct config_item_type]
|
||||
struct config_item_type
|
||||
=======================
|
||||
|
||||
::
|
||||
|
||||
struct configfs_item_operations {
|
||||
void (*release)(struct config_item *);
|
||||
@ -192,7 +203,10 @@ allocated dynamically will need to provide the ct_item_ops->release()
|
||||
method. This method is called when the config_item's reference count
|
||||
reaches zero.
|
||||
|
||||
[struct configfs_attribute]
|
||||
struct configfs_attribute
|
||||
=========================
|
||||
|
||||
::
|
||||
|
||||
struct configfs_attribute {
|
||||
char *ca_name;
|
||||
@ -214,7 +228,10 @@ be called whenever userspace asks for a read(2) on the attribute. If an
|
||||
attribute is writable and provides a ->store method, that method will be
|
||||
be called whenever userspace asks for a write(2) on the attribute.
|
||||
|
||||
[struct configfs_bin_attribute]
|
||||
struct configfs_bin_attribute
|
||||
=============================
|
||||
|
||||
::
|
||||
|
||||
struct configfs_bin_attribute {
|
||||
struct configfs_attribute cb_attr;
|
||||
@ -240,11 +257,12 @@ will happen for write(2). The reads/writes are bufferred so only a
|
||||
single read/write will occur; the attributes' need not concern itself
|
||||
with it.
|
||||
|
||||
[struct config_group]
|
||||
struct config_group
|
||||
===================
|
||||
|
||||
A config_item cannot live in a vacuum. The only way one can be created
|
||||
is via mkdir(2) on a config_group. This will trigger creation of a
|
||||
child item.
|
||||
child item::
|
||||
|
||||
struct config_group {
|
||||
struct config_item cg_item;
|
||||
@ -264,7 +282,7 @@ The config_group structure contains a config_item. Properly configuring
|
||||
that item means that a group can behave as an item in its own right.
|
||||
However, it can do more: it can create child items or groups. This is
|
||||
accomplished via the group operations specified on the group's
|
||||
config_item_type.
|
||||
config_item_type::
|
||||
|
||||
struct configfs_group_operations {
|
||||
struct config_item *(*make_item)(struct config_group *group,
|
||||
@ -279,7 +297,8 @@ config_item_type.
|
||||
};
|
||||
|
||||
A group creates child items by providing the
|
||||
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
|
||||
ct_group_ops->make_item() method. If provided, this method is called from
|
||||
mkdir(2) in the group's directory. The subsystem allocates a new
|
||||
config_item (or more likely, its container structure), initializes it,
|
||||
and returns it to configfs. Configfs will then populate the filesystem
|
||||
tree to reflect the new item.
|
||||
@ -296,13 +315,14 @@ upon item allocation. If a subsystem has no work to do, it may omit
|
||||
the ct_group_ops->drop_item() method, and configfs will call
|
||||
config_item_put() on the item on behalf of the subsystem.
|
||||
|
||||
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
|
||||
is called, configfs WILL remove the item from the filesystem tree
|
||||
(assuming that it has no children to keep it busy). The subsystem is
|
||||
responsible for responding to this. If the subsystem has references to
|
||||
the item in other threads, the memory is safe. It may take some time
|
||||
for the item to actually disappear from the subsystem's usage. But it
|
||||
is gone from configfs.
|
||||
Important:
|
||||
drop_item() is void, and as such cannot fail. When rmdir(2)
|
||||
is called, configfs WILL remove the item from the filesystem tree
|
||||
(assuming that it has no children to keep it busy). The subsystem is
|
||||
responsible for responding to this. If the subsystem has references to
|
||||
the item in other threads, the memory is safe. It may take some time
|
||||
for the item to actually disappear from the subsystem's usage. But it
|
||||
is gone from configfs.
|
||||
|
||||
When drop_item() is called, the item's linkage has already been torn
|
||||
down. It no longer has a reference on its parent and has no place in
|
||||
@ -319,10 +339,11 @@ is implemented in the configfs rmdir(2) code. ->drop_item() will not be
|
||||
called, as the item has not been dropped. rmdir(2) will fail, as the
|
||||
directory is not empty.
|
||||
|
||||
[struct configfs_subsystem]
|
||||
struct configfs_subsystem
|
||||
=========================
|
||||
|
||||
A subsystem must register itself, usually at module_init time. This
|
||||
tells configfs to make the subsystem appear in the file tree.
|
||||
tells configfs to make the subsystem appear in the file tree::
|
||||
|
||||
struct configfs_subsystem {
|
||||
struct config_group su_group;
|
||||
@ -332,17 +353,19 @@ tells configfs to make the subsystem appear in the file tree.
|
||||
int configfs_register_subsystem(struct configfs_subsystem *subsys);
|
||||
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
|
||||
|
||||
A subsystem consists of a toplevel config_group and a mutex.
|
||||
A subsystem consists of a toplevel config_group and a mutex.
|
||||
The group is where child config_items are created. For a subsystem,
|
||||
this group is usually defined statically. Before calling
|
||||
configfs_register_subsystem(), the subsystem must have initialized the
|
||||
group via the usual group _init() functions, and it must also have
|
||||
initialized the mutex.
|
||||
When the register call returns, the subsystem is live, and it
|
||||
|
||||
When the register call returns, the subsystem is live, and it
|
||||
will be visible via configfs. At that point, mkdir(2) can be called and
|
||||
the subsystem must be ready for it.
|
||||
|
||||
[An Example]
|
||||
An Example
|
||||
==========
|
||||
|
||||
The best example of these basic concepts is the simple_children
|
||||
subsystem/group and the simple_child item in
|
||||
@ -350,7 +373,8 @@ samples/configfs/configfs_sample.c. It shows a trivial object displaying
|
||||
and storing an attribute, and a simple group creating and destroying
|
||||
these children.
|
||||
|
||||
[Hierarchy Navigation and the Subsystem Mutex]
|
||||
Hierarchy Navigation and the Subsystem Mutex
|
||||
============================================
|
||||
|
||||
There is an extra bonus that configfs provides. The config_groups and
|
||||
config_items are arranged in a hierarchy due to the fact that they
|
||||
@ -375,7 +399,8 @@ be in its parent's cg_children list for the same duration. This allows
|
||||
a subsystem to trust ci_parent and cg_children while they hold the
|
||||
mutex.
|
||||
|
||||
[Item Aggregation Via symlink(2)]
|
||||
Item Aggregation Via symlink(2)
|
||||
===============================
|
||||
|
||||
configfs provides a simple group via the group->item parent/child
|
||||
relationship. Often, however, a larger environment requires aggregation
|
||||
@ -403,7 +428,8 @@ A config_item cannot be removed while it links to any other item, nor
|
||||
can it be removed while an item links to it. Dangling symlinks are not
|
||||
allowed in configfs.
|
||||
|
||||
[Automatically Created Subgroups]
|
||||
Automatically Created Subgroups
|
||||
===============================
|
||||
|
||||
A new config_group may want to have two types of child config_items.
|
||||
While this could be codified by magic names in ->make_item(), it is much
|
||||
@ -433,7 +459,8 @@ As a consequence of this, default groups cannot be removed directly via
|
||||
rmdir(2). They also are not considered when rmdir(2) on the parent
|
||||
group is checking for children.
|
||||
|
||||
[Dependent Subsystems]
|
||||
Dependent Subsystems
|
||||
====================
|
||||
|
||||
Sometimes other drivers depend on particular configfs items. For
|
||||
example, ocfs2 mounts depend on a heartbeat region item. If that
|
||||
@ -460,9 +487,11 @@ succeeds, then heartbeat knows the region is safe to give to ocfs2.
|
||||
If it fails, it was being torn down anyway, and heartbeat can gracefully
|
||||
pass up an error.
|
||||
|
||||
[Committable Items]
|
||||
Committable Items
|
||||
=================
|
||||
|
||||
NOTE: Committable items are currently unimplemented.
|
||||
Note:
|
||||
Committable items are currently unimplemented.
|
||||
|
||||
Some config_items cannot have a valid initial state. That is, no
|
||||
default values can be specified for the item's attributes such that the
|
||||
@ -504,5 +533,3 @@ As rmdir(2) does not work in the "live" directory, an item must be
|
||||
shutdown, or "uncommitted". Again, this is done via rename(2), this
|
||||
time from the "live" directory back to the "pending" one. The subsystem
|
||||
is notified by the ct_group_ops->uncommit_object() method.
|
||||
|
||||
|
@ -74,7 +74,7 @@ are zeroed out and converted to written extents before being returned to avoid
|
||||
exposure of uninitialized data through mmap.
|
||||
|
||||
These filesystems may be used for inspiration:
|
||||
- ext2: see Documentation/filesystems/ext2.txt
|
||||
- ext2: see Documentation/filesystems/ext2.rst
|
||||
- ext4: see Documentation/filesystems/ext4/
|
||||
- xfs: see Documentation/admin-guide/xfs.rst
|
||||
|
||||
|
@ -166,16 +166,17 @@ file::
|
||||
};
|
||||
|
||||
struct debugfs_regset32 {
|
||||
struct debugfs_reg32 *regs;
|
||||
const struct debugfs_reg32 *regs;
|
||||
int nregs;
|
||||
void __iomem *base;
|
||||
struct device *dev; /* Optional device for Runtime PM */
|
||||
};
|
||||
|
||||
debugfs_create_regset32(const char *name, umode_t mode,
|
||||
struct dentry *parent,
|
||||
struct debugfs_regset32 *regset);
|
||||
|
||||
void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs,
|
||||
void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs,
|
||||
int nregs, void __iomem *base, char *prefix);
|
||||
|
||||
The "base" argument may be 0, but you may want to build the reg32 array
|
||||
|
36
Documentation/filesystems/devpts.rst
Normal file
36
Documentation/filesystems/devpts.rst
Normal file
@ -0,0 +1,36 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================
|
||||
The Devpts Filesystem
|
||||
=====================
|
||||
|
||||
Each mount of the devpts filesystem is now distinct such that ptys
|
||||
and their indicies allocated in one mount are independent from ptys
|
||||
and their indicies in all other mounts.
|
||||
|
||||
All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node
|
||||
with permissions ``0000``.
|
||||
|
||||
To retain backwards compatibility the a ptmx device node (aka any node
|
||||
created with ``mknod name c 5 2``) when opened will look for an instance
|
||||
of devpts under the name ``pts`` in the same directory as the ptmx device
|
||||
node.
|
||||
|
||||
As an option instead of placing a ``/dev/ptmx`` device node at ``/dev/ptmx``
|
||||
it is possible to place a symlink to ``/dev/pts/ptmx`` at ``/dev/ptmx`` or
|
||||
to bind mount ``/dev/ptx/ptmx`` to ``/dev/ptmx``. If you opt for using
|
||||
the devpts filesystem in this manner devpts should be mounted with
|
||||
the ``ptmxmode=0666``, or ``chmod 0666 /dev/pts/ptmx`` should be called.
|
||||
|
||||
Total count of pty pairs in all instances is limited by sysctls::
|
||||
|
||||
kernel.pty.max = 4096 - global limit
|
||||
kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
|
||||
kernel.pty.nr - current count of ptys
|
||||
|
||||
Per-instance limit could be set by adding mount option ``max=<count>``.
|
||||
|
||||
This feature was added in kernel 3.4 together with
|
||||
``sysctl kernel.pty.reserve``.
|
||||
|
||||
In kernels older than 3.4 sysctl ``kernel.pty.max`` works as per-instance limit.
|
@ -1,26 +0,0 @@
|
||||
Each mount of the devpts filesystem is now distinct such that ptys
|
||||
and their indicies allocated in one mount are independent from ptys
|
||||
and their indicies in all other mounts.
|
||||
|
||||
All mounts of the devpts filesystem now create a /dev/pts/ptmx node
|
||||
with permissions 0000.
|
||||
|
||||
To retain backwards compatibility the a ptmx device node (aka any node
|
||||
created with "mknod name c 5 2") when opened will look for an instance
|
||||
of devpts under the name "pts" in the same directory as the ptmx device
|
||||
node.
|
||||
|
||||
As an option instead of placing a /dev/ptmx device node at /dev/ptmx
|
||||
it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or
|
||||
to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using
|
||||
the devpts filesystem in this manner devpts should be mounted with
|
||||
the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called.
|
||||
|
||||
Total count of pty pairs in all instances is limited by sysctls:
|
||||
kernel.pty.max = 4096 - global limit
|
||||
kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
|
||||
kernel.pty.nr - current count of ptys
|
||||
|
||||
Per-instance limit could be set by adding mount option "max=<count>".
|
||||
This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve.
|
||||
In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit.
|
@ -1,5 +1,8 @@
|
||||
Linux Directory Notification
|
||||
============================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============================
|
||||
Linux Directory Notification
|
||||
============================
|
||||
|
||||
Stephen Rothwell <sfr@canb.auug.org.au>
|
||||
|
||||
@ -12,6 +15,7 @@ being delivered using signals.
|
||||
The application decides which "events" it wants to be notified about.
|
||||
The currently defined events are:
|
||||
|
||||
========= =====================================================
|
||||
DN_ACCESS A file in the directory was accessed (read)
|
||||
DN_MODIFY A file in the directory was modified (write,truncate)
|
||||
DN_CREATE A file was created in the directory
|
||||
@ -19,6 +23,7 @@ The currently defined events are:
|
||||
DN_RENAME A file in the directory was renamed
|
||||
DN_ATTRIB A file in the directory had its attributes
|
||||
changed (chmod,chown)
|
||||
========= =====================================================
|
||||
|
||||
Usually, the application must reregister after each notification, but
|
||||
if DN_MULTISHOT is or'ed with the event mask, then the registration will
|
||||
@ -36,7 +41,7 @@ especially important if DN_MULTISHOT is specified. Note that SIGRTMIN
|
||||
is often blocked, so it is better to use (at least) SIGRTMIN + 1.
|
||||
|
||||
Implementation expectations (features and bugs :-))
|
||||
---------------------------
|
||||
---------------------------------------------------
|
||||
|
||||
The notification should work for any local access to files even if the
|
||||
actual file system is on a remote server. This implies that remote
|
||||
@ -67,4 +72,4 @@ See tools/testing/selftests/filesystems/dnotify_test.c for an example.
|
||||
NOTE
|
||||
----
|
||||
Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
|
||||
See Documentation/filesystems/inotify.txt for more information on it.
|
||||
See Documentation/filesystems/inotify.rst for more information on it.
|
@ -24,3 +24,20 @@ files that are not well-known standardized variables are created
|
||||
as immutable files. This doesn't prevent removal - "chattr -i" will work -
|
||||
but it does prevent this kind of failure from being accomplished
|
||||
accidentally.
|
||||
|
||||
.. warning ::
|
||||
When a content of an UEFI variable in /sys/firmware/efi/efivars is
|
||||
displayed, for example using "hexdump", pay attention that the first
|
||||
4 bytes of the output represent the UEFI variable attributes,
|
||||
in little-endian format.
|
||||
|
||||
Practically the output of each efivar is composed of:
|
||||
|
||||
+-----------------------------------+
|
||||
|4_bytes_of_attributes + efivar_data|
|
||||
+-----------------------------------+
|
||||
|
||||
*See also:*
|
||||
|
||||
- Documentation/admin-guide/acpi/ssdt-overlays.rst
|
||||
- Documentation/ABI/stable/sysfs-firmware-efi-vars
|
||||
|
@ -1,3 +1,5 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============
|
||||
Fiemap Ioctl
|
||||
============
|
||||
@ -10,9 +12,9 @@ returns a list of extents.
|
||||
Request Basics
|
||||
--------------
|
||||
|
||||
A fiemap request is encoded within struct fiemap:
|
||||
A fiemap request is encoded within struct fiemap::
|
||||
|
||||
struct fiemap {
|
||||
struct fiemap {
|
||||
__u64 fm_start; /* logical offset (inclusive) at
|
||||
* which to start mapping (in) */
|
||||
__u64 fm_length; /* logical length of mapping which
|
||||
@ -23,7 +25,7 @@ struct fiemap {
|
||||
__u32 fm_extent_count; /* size of fm_extents array (in) */
|
||||
__u32 fm_reserved;
|
||||
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
|
||||
};
|
||||
};
|
||||
|
||||
|
||||
fm_start, and fm_length specify the logical range within the file
|
||||
@ -51,12 +53,12 @@ nothing to prevent the file from changing between calls to FIEMAP.
|
||||
|
||||
The following flags can be set in fm_flags:
|
||||
|
||||
* FIEMAP_FLAG_SYNC
|
||||
If this flag is set, the kernel will sync the file before mapping extents.
|
||||
FIEMAP_FLAG_SYNC
|
||||
If this flag is set, the kernel will sync the file before mapping extents.
|
||||
|
||||
* FIEMAP_FLAG_XATTR
|
||||
If this flag is set, the extents returned will describe the inodes
|
||||
extended attribute lookup tree, instead of its data tree.
|
||||
FIEMAP_FLAG_XATTR
|
||||
If this flag is set, the extents returned will describe the inodes
|
||||
extended attribute lookup tree, instead of its data tree.
|
||||
|
||||
|
||||
Extent Mapping
|
||||
@ -75,18 +77,18 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
|
||||
flag set (see the next section on extent flags).
|
||||
|
||||
Each extent is described by a single fiemap_extent structure as
|
||||
returned in fm_extents.
|
||||
returned in fm_extents::
|
||||
|
||||
struct fiemap_extent {
|
||||
__u64 fe_logical; /* logical offset in bytes for the start of
|
||||
* the extent */
|
||||
__u64 fe_physical; /* physical offset in bytes for the start
|
||||
* of the extent */
|
||||
__u64 fe_length; /* length in bytes for the extent */
|
||||
__u64 fe_reserved64[2];
|
||||
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
|
||||
__u32 fe_reserved[3];
|
||||
};
|
||||
struct fiemap_extent {
|
||||
__u64 fe_logical; /* logical offset in bytes for the start of
|
||||
* the extent */
|
||||
__u64 fe_physical; /* physical offset in bytes for the start
|
||||
* of the extent */
|
||||
__u64 fe_length; /* length in bytes for the extent */
|
||||
__u64 fe_reserved64[2];
|
||||
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
|
||||
__u32 fe_reserved[3];
|
||||
};
|
||||
|
||||
All offsets and lengths are in bytes and mirror those on disk. It is valid
|
||||
for an extents logical offset to start before the request or its logical
|
||||
@ -114,26 +116,27 @@ worry about all present and future flags which might imply unaligned
|
||||
data. Note that the opposite is not true - it would be valid for
|
||||
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
|
||||
|
||||
* FIEMAP_EXTENT_LAST
|
||||
This is generally the last extent in the file. A mapping attempt past
|
||||
this extent may return nothing. Some implementations set this flag to
|
||||
indicate this extent is the last one in the range queried by the user
|
||||
(via fiemap->fm_length).
|
||||
FIEMAP_EXTENT_LAST
|
||||
This is generally the last extent in the file. A mapping attempt past
|
||||
this extent may return nothing. Some implementations set this flag to
|
||||
indicate this extent is the last one in the range queried by the user
|
||||
(via fiemap->fm_length).
|
||||
|
||||
* FIEMAP_EXTENT_UNKNOWN
|
||||
The location of this extent is currently unknown. This may indicate
|
||||
the data is stored on an inaccessible volume or that no storage has
|
||||
been allocated for the file yet.
|
||||
FIEMAP_EXTENT_UNKNOWN
|
||||
The location of this extent is currently unknown. This may indicate
|
||||
the data is stored on an inaccessible volume or that no storage has
|
||||
been allocated for the file yet.
|
||||
|
||||
* FIEMAP_EXTENT_DELALLOC
|
||||
- This will also set FIEMAP_EXTENT_UNKNOWN.
|
||||
Delayed allocation - while there is data for this extent, its
|
||||
physical location has not been allocated yet.
|
||||
FIEMAP_EXTENT_DELALLOC
|
||||
This will also set FIEMAP_EXTENT_UNKNOWN.
|
||||
|
||||
* FIEMAP_EXTENT_ENCODED
|
||||
This extent does not consist of plain filesystem blocks but is
|
||||
encoded (e.g. encrypted or compressed). Reading the data in this
|
||||
extent via I/O to the block device will have undefined results.
|
||||
Delayed allocation - while there is data for this extent, its
|
||||
physical location has not been allocated yet.
|
||||
|
||||
FIEMAP_EXTENT_ENCODED
|
||||
This extent does not consist of plain filesystem blocks but is
|
||||
encoded (e.g. encrypted or compressed). Reading the data in this
|
||||
extent via I/O to the block device will have undefined results.
|
||||
|
||||
Note that it is *always* undefined to try to update the data
|
||||
in-place by writing to the indicated location without the
|
||||
@ -145,32 +148,32 @@ unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is
|
||||
clear; user applications must not try reading or writing to the
|
||||
filesystem via the block device under any other circumstances.
|
||||
|
||||
* FIEMAP_EXTENT_DATA_ENCRYPTED
|
||||
- This will also set FIEMAP_EXTENT_ENCODED
|
||||
The data in this extent has been encrypted by the file system.
|
||||
FIEMAP_EXTENT_DATA_ENCRYPTED
|
||||
This will also set FIEMAP_EXTENT_ENCODED
|
||||
The data in this extent has been encrypted by the file system.
|
||||
|
||||
* FIEMAP_EXTENT_NOT_ALIGNED
|
||||
Extent offsets and length are not guaranteed to be block aligned.
|
||||
FIEMAP_EXTENT_NOT_ALIGNED
|
||||
Extent offsets and length are not guaranteed to be block aligned.
|
||||
|
||||
* FIEMAP_EXTENT_DATA_INLINE
|
||||
FIEMAP_EXTENT_DATA_INLINE
|
||||
This will also set FIEMAP_EXTENT_NOT_ALIGNED
|
||||
Data is located within a meta data block.
|
||||
Data is located within a meta data block.
|
||||
|
||||
* FIEMAP_EXTENT_DATA_TAIL
|
||||
FIEMAP_EXTENT_DATA_TAIL
|
||||
This will also set FIEMAP_EXTENT_NOT_ALIGNED
|
||||
Data is packed into a block with data from other files.
|
||||
Data is packed into a block with data from other files.
|
||||
|
||||
* FIEMAP_EXTENT_UNWRITTEN
|
||||
Unwritten extent - the extent is allocated but its data has not been
|
||||
initialized. This indicates the extent's data will be all zero if read
|
||||
through the filesystem but the contents are undefined if read directly from
|
||||
the device.
|
||||
FIEMAP_EXTENT_UNWRITTEN
|
||||
Unwritten extent - the extent is allocated but its data has not been
|
||||
initialized. This indicates the extent's data will be all zero if read
|
||||
through the filesystem but the contents are undefined if read directly from
|
||||
the device.
|
||||
|
||||
* FIEMAP_EXTENT_MERGED
|
||||
This will be set when a file does not support extents, i.e., it uses a block
|
||||
based addressing scheme. Since returning an extent for each block back to
|
||||
userspace would be highly inefficient, the kernel will try to merge most
|
||||
adjacent blocks into 'extents'.
|
||||
FIEMAP_EXTENT_MERGED
|
||||
This will be set when a file does not support extents, i.e., it uses a block
|
||||
based addressing scheme. Since returning an extent for each block back to
|
||||
userspace would be highly inefficient, the kernel will try to merge most
|
||||
adjacent blocks into 'extents'.
|
||||
|
||||
|
||||
VFS -> File System Implementation
|
||||
@ -179,23 +182,23 @@ VFS -> File System Implementation
|
||||
File systems wishing to support fiemap must implement a ->fiemap callback on
|
||||
their inode_operations structure. The fs ->fiemap call is responsible for
|
||||
defining its set of supported fiemap flags, and calling a helper function on
|
||||
each discovered extent:
|
||||
each discovered extent::
|
||||
|
||||
struct inode_operations {
|
||||
struct inode_operations {
|
||||
...
|
||||
|
||||
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
|
||||
u64 len);
|
||||
|
||||
->fiemap is passed struct fiemap_extent_info which describes the
|
||||
fiemap request:
|
||||
fiemap request::
|
||||
|
||||
struct fiemap_extent_info {
|
||||
struct fiemap_extent_info {
|
||||
unsigned int fi_flags; /* Flags as passed from user */
|
||||
unsigned int fi_extents_mapped; /* Number of mapped extents */
|
||||
unsigned int fi_extents_max; /* Size of fiemap_extent array */
|
||||
struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
|
||||
};
|
||||
};
|
||||
|
||||
It is intended that the file system should not need to access any of this
|
||||
structure directly. Filesystem handlers should be tolerant to signals and return
|
||||
@ -203,9 +206,9 @@ EINTR once fatal signal received.
|
||||
|
||||
|
||||
Flag checking should be done at the beginning of the ->fiemap callback via the
|
||||
fiemap_check_flags() helper:
|
||||
fiemap_check_flags() helper::
|
||||
|
||||
int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
|
||||
int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
|
||||
|
||||
The struct fieinfo should be passed in as received from ioctl_fiemap(). The
|
||||
set of fiemap flags which the fs understands should be passed via fs_flags. If
|
||||
@ -216,10 +219,10 @@ ioctl_fiemap().
|
||||
|
||||
|
||||
For each extent in the request range, the file system should call
|
||||
the helper function, fiemap_fill_next_extent():
|
||||
the helper function, fiemap_fill_next_extent()::
|
||||
|
||||
int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
|
||||
u64 phys, u64 len, u32 flags, u32 dev);
|
||||
int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
|
||||
u64 phys, u64 len, u32 flags, u32 dev);
|
||||
|
||||
fiemap_fill_next_extent() will use the passed values to populate the
|
||||
next free extent in the fm_extents array. 'General' extent flags will
|
@ -1,5 +1,8 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===================================
|
||||
File management in the Linux kernel
|
||||
-----------------------------------
|
||||
===================================
|
||||
|
||||
This document describes how locking for files (struct file)
|
||||
and file descriptor table (struct files) works.
|
||||
@ -34,7 +37,7 @@ appear atomic. Here are the locking rules for
|
||||
the fdtable structure -
|
||||
|
||||
1. All references to the fdtable must be done through
|
||||
the files_fdtable() macro :
|
||||
the files_fdtable() macro::
|
||||
|
||||
struct fdtable *fdt;
|
||||
|
||||
@ -61,7 +64,8 @@ the fdtable structure -
|
||||
4. To look up the file structure given an fd, a reader
|
||||
must use either fcheck() or fcheck_files() APIs. These
|
||||
take care of barrier requirements due to lock-free lookup.
|
||||
An example :
|
||||
|
||||
An example::
|
||||
|
||||
struct file *file;
|
||||
|
||||
@ -77,7 +81,7 @@ the fdtable structure -
|
||||
of the fd (fget()/fget_light()) are lock-free, it is possible
|
||||
that look-up may race with the last put() operation on the
|
||||
file structure. This is avoided using atomic_long_inc_not_zero()
|
||||
on ->f_count :
|
||||
on ->f_count::
|
||||
|
||||
rcu_read_lock();
|
||||
file = fcheck_files(files, fd);
|
||||
@ -106,7 +110,8 @@ the fdtable structure -
|
||||
holding files->file_lock. If ->file_lock is dropped, then
|
||||
another thread expand the files thereby creating a new
|
||||
fdtable and making the earlier fdtable pointer stale.
|
||||
For example :
|
||||
|
||||
For example::
|
||||
|
||||
spin_lock(&files->file_lock);
|
||||
fd = locate_fd(files, file, start);
|
@ -1,3 +1,9 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============
|
||||
Fuse I/O Modes
|
||||
==============
|
||||
|
||||
Fuse supports the following I/O modes:
|
||||
|
||||
- direct-io
|
@ -24,6 +24,22 @@ algorithms work.
|
||||
splice
|
||||
locking
|
||||
directory-locking
|
||||
devpts
|
||||
dnotify
|
||||
fiemap
|
||||
files
|
||||
locks
|
||||
mandatory-locking
|
||||
mount_api
|
||||
quota
|
||||
seq_file
|
||||
sharedsubtree
|
||||
sysfs-pci
|
||||
sysfs-tagging
|
||||
|
||||
automount-support
|
||||
|
||||
caching/index
|
||||
|
||||
porting
|
||||
|
||||
@ -57,7 +73,10 @@ Documentation for filesystem implementations.
|
||||
befs
|
||||
bfs
|
||||
btrfs
|
||||
cifs/cifsroot
|
||||
ceph
|
||||
coda
|
||||
configfs
|
||||
cramfs
|
||||
debugfs
|
||||
dlmfs
|
||||
@ -73,6 +92,7 @@ Documentation for filesystem implementations.
|
||||
hfsplus
|
||||
hpfs
|
||||
fuse
|
||||
fuse-io
|
||||
inotify
|
||||
isofs
|
||||
nilfs2
|
||||
@ -88,6 +108,7 @@ Documentation for filesystem implementations.
|
||||
ramfs-rootfs-initramfs
|
||||
relay
|
||||
romfs
|
||||
spufs/index
|
||||
squashfs
|
||||
sysfs
|
||||
sysv-fs
|
||||
@ -97,4 +118,6 @@ Documentation for filesystem implementations.
|
||||
udf
|
||||
virtiofs
|
||||
vfat
|
||||
xfs-delayed-logging-design
|
||||
xfs-self-describing-metadata
|
||||
zonefs
|
||||
|
@ -1,4 +1,8 @@
|
||||
File Locking Release Notes
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========================
|
||||
File Locking Release Notes
|
||||
==========================
|
||||
|
||||
Andy Walker <andy@lysaker.kvaerner.no>
|
||||
|
||||
@ -6,7 +10,7 @@
|
||||
|
||||
|
||||
1. What's New?
|
||||
--------------
|
||||
==============
|
||||
|
||||
1.1 Broken Flock Emulation
|
||||
--------------------------
|
||||
@ -25,7 +29,7 @@ anyway (see the file "Documentation/process/changes.rst".)
|
||||
---------------------------
|
||||
|
||||
1.2.1 Typical Problems - Sendmail
|
||||
---------------------------------
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Because sendmail was unable to use the old flock() emulation, many sendmail
|
||||
installations use fcntl() instead of flock(). This is true of Slackware 3.0
|
||||
for example. This gave rise to some other subtle problems if sendmail was
|
||||
@ -37,7 +41,7 @@ to lock solid with deadlocked processes.
|
||||
|
||||
|
||||
1.2.2 The Solution
|
||||
------------------
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
The solution I have chosen, after much experimentation and discussion,
|
||||
is to make flock() and fcntl() locks oblivious to each other. Both can
|
||||
exists, and neither will have any effect on the other.
|
||||
@ -54,7 +58,7 @@ fcntl(), with all the problems that implies.
|
||||
---------------------------------------
|
||||
|
||||
Mandatory locking, as described in
|
||||
'Documentation/filesystems/mandatory-locking.txt' was prior to this release a
|
||||
'Documentation/filesystems/mandatory-locking.rst' was prior to this release a
|
||||
general configuration option that was valid for all mounted filesystems. This
|
||||
had a number of inherent dangers, not the least of which was the ability to
|
||||
freeze an NFS server by asking it to read a file for which a mandatory lock
|
@ -1,8 +1,13 @@
|
||||
Mandatory File Locking For The Linux Operating System
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====================================================
|
||||
Mandatory File Locking For The Linux Operating System
|
||||
=====================================================
|
||||
|
||||
Andy Walker <andy@lysaker.kvaerner.no>
|
||||
|
||||
15 April 1996
|
||||
|
||||
(Updated September 2007)
|
||||
|
||||
0. Why you should avoid mandatory locking
|
||||
@ -53,15 +58,17 @@ possible on existing user code. The scheme is based on marking individual files
|
||||
as candidates for mandatory locking, and using the existing fcntl()/lockf()
|
||||
interface for applying locks just as if they were normal, advisory locks.
|
||||
|
||||
Note 1: In saying "file" in the paragraphs above I am actually not telling
|
||||
the whole truth. System V locking is based on fcntl(). The granularity of
|
||||
fcntl() is such that it allows the locking of byte ranges in files, in addition
|
||||
to entire files, so the mandatory locking rules also have byte level
|
||||
granularity.
|
||||
.. Note::
|
||||
|
||||
Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
|
||||
borrowing the fcntl() locking scheme from System V. The mandatory locking
|
||||
scheme is defined by the System V Interface Definition (SVID) Version 3.
|
||||
1. In saying "file" in the paragraphs above I am actually not telling
|
||||
the whole truth. System V locking is based on fcntl(). The granularity of
|
||||
fcntl() is such that it allows the locking of byte ranges in files, in
|
||||
addition to entire files, so the mandatory locking rules also have byte
|
||||
level granularity.
|
||||
|
||||
2. POSIX.1 does not specify any scheme for mandatory locking, despite
|
||||
borrowing the fcntl() locking scheme from System V. The mandatory locking
|
||||
scheme is defined by the System V Interface Definition (SVID) Version 3.
|
||||
|
||||
2. Marking a file for mandatory locking
|
||||
---------------------------------------
|
@ -1,8 +1,10 @@
|
||||
====================
|
||||
FILESYSTEM MOUNT API
|
||||
====================
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
CONTENTS
|
||||
====================
|
||||
fILESYSTEM Mount API
|
||||
====================
|
||||
|
||||
.. CONTENTS
|
||||
|
||||
(1) Overview.
|
||||
|
||||
@ -21,8 +23,7 @@ CONTENTS
|
||||
(8) Parameter helper functions.
|
||||
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
Overview
|
||||
========
|
||||
|
||||
The creation of new mounts is now to be done in a multistep process:
|
||||
@ -43,7 +44,7 @@ The creation of new mounts is now to be done in a multistep process:
|
||||
|
||||
(7) Destroy the context.
|
||||
|
||||
To support this, the file_system_type struct gains two new fields:
|
||||
To support this, the file_system_type struct gains two new fields::
|
||||
|
||||
int (*init_fs_context)(struct fs_context *fc);
|
||||
const struct fs_parameter_description *parameters;
|
||||
@ -57,12 +58,11 @@ Note that security initialisation is done *after* the filesystem is called so
|
||||
that the namespaces may be adjusted first.
|
||||
|
||||
|
||||
======================
|
||||
THE FILESYSTEM CONTEXT
|
||||
The Filesystem context
|
||||
======================
|
||||
|
||||
The creation and reconfiguration of a superblock is governed by a filesystem
|
||||
context. This is represented by the fs_context structure:
|
||||
context. This is represented by the fs_context structure::
|
||||
|
||||
struct fs_context {
|
||||
const struct fs_context_operations *ops;
|
||||
@ -86,78 +86,106 @@ context. This is represented by the fs_context structure:
|
||||
|
||||
The fs_context fields are as follows:
|
||||
|
||||
(*) const struct fs_context_operations *ops
|
||||
* ::
|
||||
|
||||
const struct fs_context_operations *ops
|
||||
|
||||
These are operations that can be done on a filesystem context (see
|
||||
below). This must be set by the ->init_fs_context() file_system_type
|
||||
operation.
|
||||
|
||||
(*) struct file_system_type *fs_type
|
||||
* ::
|
||||
|
||||
struct file_system_type *fs_type
|
||||
|
||||
A pointer to the file_system_type of the filesystem that is being
|
||||
constructed or reconfigured. This retains a reference on the type owner.
|
||||
|
||||
(*) void *fs_private
|
||||
* ::
|
||||
|
||||
void *fs_private
|
||||
|
||||
A pointer to the file system's private data. This is where the filesystem
|
||||
will need to store any options it parses.
|
||||
|
||||
(*) struct dentry *root
|
||||
* ::
|
||||
|
||||
struct dentry *root
|
||||
|
||||
A pointer to the root of the mountable tree (and indirectly, the
|
||||
superblock thereof). This is filled in by the ->get_tree() op. If this
|
||||
is set, an active reference on root->d_sb must also be held.
|
||||
|
||||
(*) struct user_namespace *user_ns
|
||||
(*) struct net *net_ns
|
||||
* ::
|
||||
|
||||
struct user_namespace *user_ns
|
||||
struct net *net_ns
|
||||
|
||||
There are a subset of the namespaces in use by the invoking process. They
|
||||
retain references on each namespace. The subscribed namespaces may be
|
||||
replaced by the filesystem to reflect other sources, such as the parent
|
||||
mount superblock on an automount.
|
||||
|
||||
(*) const struct cred *cred
|
||||
* ::
|
||||
|
||||
const struct cred *cred
|
||||
|
||||
The mounter's credentials. This retains a reference on the credentials.
|
||||
|
||||
(*) char *source
|
||||
* ::
|
||||
|
||||
char *source
|
||||
|
||||
This specifies the source. It may be a block device (e.g. /dev/sda1) or
|
||||
something more exotic, such as the "host:/path" that NFS desires.
|
||||
|
||||
(*) char *subtype
|
||||
* ::
|
||||
|
||||
char *subtype
|
||||
|
||||
This is a string to be added to the type displayed in /proc/mounts to
|
||||
qualify it (used by FUSE). This is available for the filesystem to set if
|
||||
desired.
|
||||
|
||||
(*) void *security
|
||||
* ::
|
||||
|
||||
void *security
|
||||
|
||||
A place for the LSMs to hang their security data for the superblock. The
|
||||
relevant security operations are described below.
|
||||
|
||||
(*) void *s_fs_info
|
||||
* ::
|
||||
|
||||
void *s_fs_info
|
||||
|
||||
The proposed s_fs_info for a new superblock, set in the superblock by
|
||||
sget_fc(). This can be used to distinguish superblocks.
|
||||
|
||||
(*) unsigned int sb_flags
|
||||
(*) unsigned int sb_flags_mask
|
||||
* ::
|
||||
|
||||
unsigned int sb_flags
|
||||
unsigned int sb_flags_mask
|
||||
|
||||
Which bits SB_* flags are to be set/cleared in super_block::s_flags.
|
||||
|
||||
(*) unsigned int s_iflags
|
||||
* ::
|
||||
|
||||
unsigned int s_iflags
|
||||
|
||||
These will be bitwise-OR'd with s->s_iflags when a superblock is created.
|
||||
|
||||
(*) enum fs_context_purpose
|
||||
* ::
|
||||
|
||||
enum fs_context_purpose
|
||||
|
||||
This indicates the purpose for which the context is intended. The
|
||||
available values are:
|
||||
|
||||
FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount
|
||||
FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
|
||||
FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
|
||||
========================== ======================================
|
||||
FS_CONTEXT_FOR_MOUNT, New superblock for explicit mount
|
||||
FS_CONTEXT_FOR_SUBMOUNT New automatic submount of extant mount
|
||||
FS_CONTEXT_FOR_RECONFIGURE Change an existing mount
|
||||
========================== ======================================
|
||||
|
||||
The mount context is created by calling vfs_new_fs_context() or
|
||||
vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
|
||||
@ -176,11 +204,10 @@ mount context. For instance, NFS might pin the appropriate protocol version
|
||||
module.
|
||||
|
||||
|
||||
=================================
|
||||
THE FILESYSTEM CONTEXT OPERATIONS
|
||||
The Filesystem Context Operations
|
||||
=================================
|
||||
|
||||
The filesystem context points to a table of operations:
|
||||
The filesystem context points to a table of operations::
|
||||
|
||||
struct fs_context_operations {
|
||||
void (*free)(struct fs_context *fc);
|
||||
@ -195,24 +222,32 @@ The filesystem context points to a table of operations:
|
||||
These operations are invoked by the various stages of the mount procedure to
|
||||
manage the filesystem context. They are as follows:
|
||||
|
||||
(*) void (*free)(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
void (*free)(struct fs_context *fc);
|
||||
|
||||
Called to clean up the filesystem-specific part of the filesystem context
|
||||
when the context is destroyed. It should be aware that parts of the
|
||||
context may have been removed and NULL'd out by ->get_tree().
|
||||
|
||||
(*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
|
||||
* ::
|
||||
|
||||
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
|
||||
|
||||
Called when a filesystem context has been duplicated to duplicate the
|
||||
filesystem-private data. An error may be returned to indicate failure to
|
||||
do this.
|
||||
|
||||
[!] Note that even if this fails, put_fs_context() will be called
|
||||
.. Warning::
|
||||
|
||||
Note that even if this fails, put_fs_context() will be called
|
||||
immediately thereafter, so ->dup() *must* make the
|
||||
filesystem-private data safe for ->free().
|
||||
|
||||
(*) int (*parse_param)(struct fs_context *fc,
|
||||
struct struct fs_parameter *param);
|
||||
* ::
|
||||
|
||||
int (*parse_param)(struct fs_context *fc,
|
||||
struct struct fs_parameter *param);
|
||||
|
||||
Called when a parameter is being added to the filesystem context. param
|
||||
points to the key name and maybe a value object. VFS-specific options
|
||||
@ -224,7 +259,9 @@ manage the filesystem context. They are as follows:
|
||||
|
||||
If successful, 0 should be returned or a negative error code otherwise.
|
||||
|
||||
(*) int (*parse_monolithic)(struct fs_context *fc, void *data);
|
||||
* ::
|
||||
|
||||
int (*parse_monolithic)(struct fs_context *fc, void *data);
|
||||
|
||||
Called when the mount(2) system call is invoked to pass the entire data
|
||||
page in one go. If this is expected to be just a list of "key[=val]"
|
||||
@ -236,7 +273,9 @@ manage the filesystem context. They are as follows:
|
||||
finds it's the standard key-val list then it may pass it off to
|
||||
generic_parse_monolithic().
|
||||
|
||||
(*) int (*get_tree)(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
int (*get_tree)(struct fs_context *fc);
|
||||
|
||||
Called to get or create the mountable root and superblock, using the
|
||||
information stored in the filesystem context (reconfiguration goes via a
|
||||
@ -249,7 +288,9 @@ manage the filesystem context. They are as follows:
|
||||
The phase on a userspace-driven context will be set to only allow this to
|
||||
be called once on any particular context.
|
||||
|
||||
(*) int (*reconfigure)(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
int (*reconfigure)(struct fs_context *fc);
|
||||
|
||||
Called to effect reconfiguration of a superblock using information stored
|
||||
in the filesystem context. It may detach any resources it desires from
|
||||
@ -259,19 +300,20 @@ manage the filesystem context. They are as follows:
|
||||
On success it should return 0. In the case of an error, it should return
|
||||
a negative error code.
|
||||
|
||||
[NOTE] reconfigure is intended as a replacement for remount_fs.
|
||||
.. Note:: reconfigure is intended as a replacement for remount_fs.
|
||||
|
||||
|
||||
===========================
|
||||
FILESYSTEM CONTEXT SECURITY
|
||||
Filesystem context Security
|
||||
===========================
|
||||
|
||||
The filesystem context contains a security pointer that the LSMs can use for
|
||||
building up a security context for the superblock to be mounted. There are a
|
||||
number of operations used by the new mount code for this purpose:
|
||||
|
||||
(*) int security_fs_context_alloc(struct fs_context *fc,
|
||||
struct dentry *reference);
|
||||
* ::
|
||||
|
||||
int security_fs_context_alloc(struct fs_context *fc,
|
||||
struct dentry *reference);
|
||||
|
||||
Called to initialise fc->security (which is preset to NULL) and allocate
|
||||
any resources needed. It should return 0 on success or a negative error
|
||||
@ -283,22 +325,28 @@ number of operations used by the new mount code for this purpose:
|
||||
non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
|
||||
it indicates the automount point.
|
||||
|
||||
(*) int security_fs_context_dup(struct fs_context *fc,
|
||||
struct fs_context *src_fc);
|
||||
* ::
|
||||
|
||||
int security_fs_context_dup(struct fs_context *fc,
|
||||
struct fs_context *src_fc);
|
||||
|
||||
Called to initialise fc->security (which is preset to NULL) and allocate
|
||||
any resources needed. The original filesystem context is pointed to by
|
||||
src_fc and may be used for reference. It should return 0 on success or a
|
||||
negative error code on failure.
|
||||
|
||||
(*) void security_fs_context_free(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
void security_fs_context_free(struct fs_context *fc);
|
||||
|
||||
Called to clean up anything attached to fc->security. Note that the
|
||||
contents may have been transferred to a superblock and the pointer cleared
|
||||
during get_tree.
|
||||
|
||||
(*) int security_fs_context_parse_param(struct fs_context *fc,
|
||||
struct fs_parameter *param);
|
||||
* ::
|
||||
|
||||
int security_fs_context_parse_param(struct fs_context *fc,
|
||||
struct fs_parameter *param);
|
||||
|
||||
Called for each mount parameter, including the source. The arguments are
|
||||
as for the ->parse_param() method. It should return 0 to indicate that
|
||||
@ -310,7 +358,9 @@ number of operations used by the new mount code for this purpose:
|
||||
(provided the value pointer is NULL'd out). If it is stolen, 1 must be
|
||||
returned to prevent it being passed to the filesystem.
|
||||
|
||||
(*) int security_fs_context_validate(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
int security_fs_context_validate(struct fs_context *fc);
|
||||
|
||||
Called after all the options have been parsed to validate the collection
|
||||
as a whole and to do any necessary allocation so that
|
||||
@ -320,36 +370,43 @@ number of operations used by the new mount code for this purpose:
|
||||
In the case of reconfiguration, the target superblock will be accessible
|
||||
via fc->root.
|
||||
|
||||
(*) int security_sb_get_tree(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
int security_sb_get_tree(struct fs_context *fc);
|
||||
|
||||
Called during the mount procedure to verify that the specified superblock
|
||||
is allowed to be mounted and to transfer the security data there. It
|
||||
should return 0 or a negative error code.
|
||||
|
||||
(*) void security_sb_reconfigure(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
void security_sb_reconfigure(struct fs_context *fc);
|
||||
|
||||
Called to apply any reconfiguration to an LSM's context. It must not
|
||||
fail. Error checking and resource allocation must be done in advance by
|
||||
the parameter parsing and validation hooks.
|
||||
|
||||
(*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
|
||||
unsigned int mnt_flags);
|
||||
* ::
|
||||
|
||||
int security_sb_mountpoint(struct fs_context *fc,
|
||||
struct path *mountpoint,
|
||||
unsigned int mnt_flags);
|
||||
|
||||
Called during the mount procedure to verify that the root dentry attached
|
||||
to the context is permitted to be attached to the specified mountpoint.
|
||||
It should return 0 on success or a negative error code on failure.
|
||||
|
||||
|
||||
==========================
|
||||
VFS FILESYSTEM CONTEXT API
|
||||
VFS Filesystem context API
|
||||
==========================
|
||||
|
||||
There are four operations for creating a filesystem context and one for
|
||||
destroying a context:
|
||||
|
||||
(*) struct fs_context *fs_context_for_mount(
|
||||
struct file_system_type *fs_type,
|
||||
unsigned int sb_flags);
|
||||
* ::
|
||||
|
||||
struct fs_context *fs_context_for_mount(struct file_system_type *fs_type,
|
||||
unsigned int sb_flags);
|
||||
|
||||
Allocate a filesystem context for the purpose of setting up a new mount,
|
||||
whether that be with a new superblock or sharing an existing one. This
|
||||
@ -359,7 +416,9 @@ destroying a context:
|
||||
fs_type specifies the filesystem type that will manage the context and
|
||||
sb_flags presets the superblock flags stored therein.
|
||||
|
||||
(*) struct fs_context *fs_context_for_reconfigure(
|
||||
* ::
|
||||
|
||||
struct fs_context *fs_context_for_reconfigure(
|
||||
struct dentry *dentry,
|
||||
unsigned int sb_flags,
|
||||
unsigned int sb_flags_mask);
|
||||
@ -369,7 +428,9 @@ destroying a context:
|
||||
configured. sb_flags and sb_flags_mask indicate which superblock flags
|
||||
need changing and to what.
|
||||
|
||||
(*) struct fs_context *fs_context_for_submount(
|
||||
* ::
|
||||
|
||||
struct fs_context *fs_context_for_submount(
|
||||
struct file_system_type *fs_type,
|
||||
struct dentry *reference);
|
||||
|
||||
@ -382,7 +443,9 @@ destroying a context:
|
||||
Note that it's not a requirement that the reference dentry be of the same
|
||||
filesystem type as fs_type.
|
||||
|
||||
(*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
|
||||
* ::
|
||||
|
||||
struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
|
||||
|
||||
Duplicate a filesystem context, copying any options noted and duplicating
|
||||
or additionally referencing any resources held therein. This is available
|
||||
@ -392,14 +455,18 @@ destroying a context:
|
||||
|
||||
The purpose in the new context is inherited from the old one.
|
||||
|
||||
(*) void put_fs_context(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
void put_fs_context(struct fs_context *fc);
|
||||
|
||||
Destroy a filesystem context, releasing any resources it holds. This
|
||||
calls the ->free() operation. This is intended to be called by anyone who
|
||||
created a filesystem context.
|
||||
|
||||
[!] filesystem contexts are not refcounted, so this causes unconditional
|
||||
destruction.
|
||||
.. Warning::
|
||||
|
||||
filesystem contexts are not refcounted, so this causes unconditional
|
||||
destruction.
|
||||
|
||||
In all the above operations, apart from the put op, the return is a mount
|
||||
context pointer or a negative error code.
|
||||
@ -407,8 +474,10 @@ context pointer or a negative error code.
|
||||
For the remaining operations, if an error occurs, a negative error code will be
|
||||
returned.
|
||||
|
||||
(*) int vfs_parse_fs_param(struct fs_context *fc,
|
||||
struct fs_parameter *param);
|
||||
* ::
|
||||
|
||||
int vfs_parse_fs_param(struct fs_context *fc,
|
||||
struct fs_parameter *param);
|
||||
|
||||
Supply a single mount parameter to the filesystem context. This include
|
||||
the specification of the source/device which is specified as the "source"
|
||||
@ -423,53 +492,64 @@ returned.
|
||||
|
||||
The parameter value is typed and can be one of:
|
||||
|
||||
fs_value_is_flag, Parameter not given a value.
|
||||
fs_value_is_string, Value is a string
|
||||
fs_value_is_blob, Value is a binary blob
|
||||
fs_value_is_filename, Value is a filename* + dirfd
|
||||
fs_value_is_file, Value is an open file (file*)
|
||||
==================== =============================
|
||||
fs_value_is_flag Parameter not given a value
|
||||
fs_value_is_string Value is a string
|
||||
fs_value_is_blob Value is a binary blob
|
||||
fs_value_is_filename Value is a filename* + dirfd
|
||||
fs_value_is_file Value is an open file (file*)
|
||||
==================== =============================
|
||||
|
||||
If there is a value, that value is stored in a union in the struct in one
|
||||
of param->{string,blob,name,file}. Note that the function may steal and
|
||||
clear the pointer, but then becomes responsible for disposing of the
|
||||
object.
|
||||
|
||||
(*) int vfs_parse_fs_string(struct fs_context *fc, const char *key,
|
||||
const char *value, size_t v_size);
|
||||
* ::
|
||||
|
||||
int vfs_parse_fs_string(struct fs_context *fc, const char *key,
|
||||
const char *value, size_t v_size);
|
||||
|
||||
A wrapper around vfs_parse_fs_param() that copies the value string it is
|
||||
passed.
|
||||
|
||||
(*) int generic_parse_monolithic(struct fs_context *fc, void *data);
|
||||
* ::
|
||||
|
||||
int generic_parse_monolithic(struct fs_context *fc, void *data);
|
||||
|
||||
Parse a sys_mount() data page, assuming the form to be a text list
|
||||
consisting of key[=val] options separated by commas. Each item in the
|
||||
list is passed to vfs_mount_option(). This is the default when the
|
||||
->parse_monolithic() method is NULL.
|
||||
|
||||
(*) int vfs_get_tree(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
int vfs_get_tree(struct fs_context *fc);
|
||||
|
||||
Get or create the mountable root and superblock, using the parameters in
|
||||
the filesystem context to select/configure the superblock. This invokes
|
||||
the ->get_tree() method.
|
||||
|
||||
(*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
|
||||
* ::
|
||||
|
||||
struct vfsmount *vfs_create_mount(struct fs_context *fc);
|
||||
|
||||
Create a mount given the parameters in the specified filesystem context.
|
||||
Note that this does not attach the mount to anything.
|
||||
|
||||
|
||||
===========================
|
||||
SUPERBLOCK CREATION HELPERS
|
||||
Superblock Creation Helpers
|
||||
===========================
|
||||
|
||||
A number of VFS helpers are available for use by filesystems for the creation
|
||||
or looking up of superblocks.
|
||||
|
||||
(*) struct super_block *
|
||||
sget_fc(struct fs_context *fc,
|
||||
int (*test)(struct super_block *sb, struct fs_context *fc),
|
||||
int (*set)(struct super_block *sb, struct fs_context *fc));
|
||||
* ::
|
||||
|
||||
struct super_block *
|
||||
sget_fc(struct fs_context *fc,
|
||||
int (*test)(struct super_block *sb, struct fs_context *fc),
|
||||
int (*set)(struct super_block *sb, struct fs_context *fc));
|
||||
|
||||
This is the core routine. If test is non-NULL, it searches for an
|
||||
existing superblock matching the criteria held in the fs_context, using
|
||||
@ -482,10 +562,12 @@ or looking up of superblocks.
|
||||
|
||||
The following helpers all wrap sget_fc():
|
||||
|
||||
(*) int vfs_get_super(struct fs_context *fc,
|
||||
enum vfs_get_super_keying keying,
|
||||
int (*fill_super)(struct super_block *sb,
|
||||
struct fs_context *fc))
|
||||
* ::
|
||||
|
||||
int vfs_get_super(struct fs_context *fc,
|
||||
enum vfs_get_super_keying keying,
|
||||
int (*fill_super)(struct super_block *sb,
|
||||
struct fs_context *fc))
|
||||
|
||||
This creates/looks up a deviceless superblock. The keying indicates how
|
||||
many superblocks of this type may exist and in what manner they may be
|
||||
@ -515,14 +597,14 @@ PARAMETER DESCRIPTION
|
||||
=====================
|
||||
|
||||
Parameters are described using structures defined in linux/fs_parser.h.
|
||||
There's a core description struct that links everything together:
|
||||
There's a core description struct that links everything together::
|
||||
|
||||
struct fs_parameter_description {
|
||||
const struct fs_parameter_spec *specs;
|
||||
const struct fs_parameter_enum *enums;
|
||||
};
|
||||
|
||||
For example:
|
||||
For example::
|
||||
|
||||
enum {
|
||||
Opt_autocell,
|
||||
@ -539,10 +621,12 @@ For example:
|
||||
|
||||
The members are as follows:
|
||||
|
||||
(1) const struct fs_parameter_specification *specs;
|
||||
(1) ::
|
||||
|
||||
const struct fs_parameter_specification *specs;
|
||||
|
||||
Table of parameter specifications, terminated with a null entry, where the
|
||||
entries are of type:
|
||||
entries are of type::
|
||||
|
||||
struct fs_parameter_spec {
|
||||
const char *name;
|
||||
@ -558,6 +642,7 @@ The members are as follows:
|
||||
|
||||
The 'type' field indicates the desired value type and must be one of:
|
||||
|
||||
======================= ======================= =====================
|
||||
TYPE NAME EXPECTED VALUE RESULT IN
|
||||
======================= ======================= =====================
|
||||
fs_param_is_flag No value n/a
|
||||
@ -573,19 +658,23 @@ The members are as follows:
|
||||
fs_param_is_blockdev Blockdev path * Needs lookup
|
||||
fs_param_is_path Path * Needs lookup
|
||||
fs_param_is_fd File descriptor result->int_32
|
||||
======================= ======================= =====================
|
||||
|
||||
Note that if the value is of fs_param_is_bool type, fs_parse() will try
|
||||
to match any string value against "0", "1", "no", "yes", "false", "true".
|
||||
|
||||
Each parameter can also be qualified with 'flags':
|
||||
|
||||
======================= ================================================
|
||||
fs_param_v_optional The value is optional
|
||||
fs_param_neg_with_no result->negated set if key is prefixed with "no"
|
||||
fs_param_neg_with_empty result->negated set if value is ""
|
||||
fs_param_deprecated The parameter is deprecated.
|
||||
======================= ================================================
|
||||
|
||||
These are wrapped with a number of convenience wrappers:
|
||||
|
||||
======================= ===============================================
|
||||
MACRO SPECIFIES
|
||||
======================= ===============================================
|
||||
fsparam_flag() fs_param_is_flag
|
||||
@ -602,9 +691,10 @@ The members are as follows:
|
||||
fsparam_bdev() fs_param_is_blockdev
|
||||
fsparam_path() fs_param_is_path
|
||||
fsparam_fd() fs_param_is_fd
|
||||
======================= ===============================================
|
||||
|
||||
all of which take two arguments, name string and option number - for
|
||||
example:
|
||||
example::
|
||||
|
||||
static const struct fs_parameter_spec afs_param_specs[] = {
|
||||
fsparam_flag ("autocell", Opt_autocell),
|
||||
@ -618,10 +708,12 @@ The members are as follows:
|
||||
of arguments to specify the type and the flags for anything that doesn't
|
||||
match one of the above macros.
|
||||
|
||||
(2) const struct fs_parameter_enum *enums;
|
||||
(2) ::
|
||||
|
||||
const struct fs_parameter_enum *enums;
|
||||
|
||||
Table of enum value names to integer mappings, terminated with a null
|
||||
entry. This is of type:
|
||||
entry. This is of type::
|
||||
|
||||
struct fs_parameter_enum {
|
||||
u8 opt;
|
||||
@ -630,7 +722,7 @@ The members are as follows:
|
||||
};
|
||||
|
||||
Where the array is an unsorted list of { parameter ID, name }-keyed
|
||||
elements that indicate the value to map to, e.g.:
|
||||
elements that indicate the value to map to, e.g.::
|
||||
|
||||
static const struct fs_parameter_enum afs_param_enums[] = {
|
||||
{ Opt_bar, "x", 1},
|
||||
@ -648,18 +740,19 @@ CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
|
||||
userspace using the fsinfo() syscall.
|
||||
|
||||
|
||||
==========================
|
||||
PARAMETER HELPER FUNCTIONS
|
||||
Parameter Helper Functions
|
||||
==========================
|
||||
|
||||
A number of helper functions are provided to help a filesystem or an LSM
|
||||
process the parameters it is given.
|
||||
|
||||
(*) int lookup_constant(const struct constant_table tbl[],
|
||||
const char *name, int not_found);
|
||||
* ::
|
||||
|
||||
int lookup_constant(const struct constant_table tbl[],
|
||||
const char *name, int not_found);
|
||||
|
||||
Look up a constant by name in a table of name -> integer mappings. The
|
||||
table is an array of elements of the following type:
|
||||
table is an array of elements of the following type::
|
||||
|
||||
struct constant_table {
|
||||
const char *name;
|
||||
@ -669,9 +762,11 @@ process the parameters it is given.
|
||||
If a match is found, the corresponding value is returned. If a match
|
||||
isn't found, the not_found value is returned instead.
|
||||
|
||||
(*) bool validate_constant_table(const struct constant_table *tbl,
|
||||
size_t tbl_size,
|
||||
int low, int high, int special);
|
||||
* ::
|
||||
|
||||
bool validate_constant_table(const struct constant_table *tbl,
|
||||
size_t tbl_size,
|
||||
int low, int high, int special);
|
||||
|
||||
Validate a constant table. Checks that all the elements are appropriately
|
||||
ordered, that there are no duplicates and that the values are between low
|
||||
@ -682,16 +777,20 @@ process the parameters it is given.
|
||||
If all is good, true is returned. If the table is invalid, errors are
|
||||
logged to dmesg and false is returned.
|
||||
|
||||
(*) bool fs_validate_description(const struct fs_parameter_description *desc);
|
||||
* ::
|
||||
|
||||
bool fs_validate_description(const struct fs_parameter_description *desc);
|
||||
|
||||
This performs some validation checks on a parameter description. It
|
||||
returns true if the description is good and false if it is not. It will
|
||||
log errors to dmesg if validation fails.
|
||||
|
||||
(*) int fs_parse(struct fs_context *fc,
|
||||
const struct fs_parameter_description *desc,
|
||||
struct fs_parameter *param,
|
||||
struct fs_parse_result *result);
|
||||
* ::
|
||||
|
||||
int fs_parse(struct fs_context *fc,
|
||||
const struct fs_parameter_description *desc,
|
||||
struct fs_parameter *param,
|
||||
struct fs_parse_result *result);
|
||||
|
||||
This is the main interpreter of parameters. It uses the parameter
|
||||
description to look up a parameter by key name and to convert that to an
|
||||
@ -711,14 +810,16 @@ process the parameters it is given.
|
||||
parameter is matched, but the value is erroneous, -EINVAL will be
|
||||
returned; otherwise the parameter's option number will be returned.
|
||||
|
||||
(*) int fs_lookup_param(struct fs_context *fc,
|
||||
struct fs_parameter *value,
|
||||
bool want_bdev,
|
||||
struct path *_path);
|
||||
* ::
|
||||
|
||||
int fs_lookup_param(struct fs_context *fc,
|
||||
struct fs_parameter *value,
|
||||
bool want_bdev,
|
||||
struct path *_path);
|
||||
|
||||
This takes a parameter that carries a string or filename type and attempts
|
||||
to do a path lookup on it. If the parameter expects a blockdev, a check
|
||||
is made that the inode actually represents one.
|
||||
|
||||
Returns 0 if successful and *_path will be set; returns a negative error
|
||||
code if not.
|
||||
Returns 0 if successful and ``*_path`` will be set; returns a negative
|
||||
error code if not.
|
@ -119,9 +119,7 @@ it comes to that question::
|
||||
|
||||
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
|
||||
|
||||
Create an /etc/pvfs2tab file::
|
||||
|
||||
Localhost is fine for your pvfs2tab file:
|
||||
Create an /etc/pvfs2tab file (localhost is fine)::
|
||||
|
||||
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
|
||||
/etc/pvfs2tab
|
||||
|
@ -1871,7 +1871,7 @@ unbindable mount is unbindable
|
||||
|
||||
For more information on mount propagation see:
|
||||
|
||||
Documentation/filesystems/sharedsubtree.txt
|
||||
Documentation/filesystems/sharedsubtree.rst
|
||||
|
||||
|
||||
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
|
||||
|
@ -1,4 +1,6 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===============
|
||||
Quota subsystem
|
||||
===============
|
||||
|
||||
@ -39,6 +41,7 @@ Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
|
||||
This command is used to send a notification about any of the above mentioned
|
||||
events. Each message has six attributes. These are (type of the argument is
|
||||
in parentheses):
|
||||
|
||||
QUOTA_NL_A_QTYPE (u32)
|
||||
- type of quota being exceeded (one of USRQUOTA, GRPQUOTA)
|
||||
QUOTA_NL_A_EXCESS_ID (u64)
|
||||
@ -48,20 +51,34 @@ in parentheses):
|
||||
- UID of a user who caused the event
|
||||
QUOTA_NL_A_WARNING (u32)
|
||||
- what kind of limit is exceeded:
|
||||
QUOTA_NL_IHARDWARN - inode hardlimit
|
||||
QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer
|
||||
than given grace period
|
||||
QUOTA_NL_ISOFTWARN - inode softlimit
|
||||
QUOTA_NL_BHARDWARN - space (block) hardlimit
|
||||
QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded
|
||||
longer than given grace period.
|
||||
QUOTA_NL_BSOFTWARN - space (block) softlimit
|
||||
|
||||
QUOTA_NL_IHARDWARN
|
||||
inode hardlimit
|
||||
QUOTA_NL_ISOFTLONGWARN
|
||||
inode softlimit is exceeded longer
|
||||
than given grace period
|
||||
QUOTA_NL_ISOFTWARN
|
||||
inode softlimit
|
||||
QUOTA_NL_BHARDWARN
|
||||
space (block) hardlimit
|
||||
QUOTA_NL_BSOFTLONGWARN
|
||||
space (block) softlimit is exceeded
|
||||
longer than given grace period.
|
||||
QUOTA_NL_BSOFTWARN
|
||||
space (block) softlimit
|
||||
|
||||
- four warnings are also defined for the event when user stops
|
||||
exceeding some limit:
|
||||
QUOTA_NL_IHARDBELOW - inode hardlimit
|
||||
QUOTA_NL_ISOFTBELOW - inode softlimit
|
||||
QUOTA_NL_BHARDBELOW - space (block) hardlimit
|
||||
QUOTA_NL_BSOFTBELOW - space (block) softlimit
|
||||
|
||||
QUOTA_NL_IHARDBELOW
|
||||
inode hardlimit
|
||||
QUOTA_NL_ISOFTBELOW
|
||||
inode softlimit
|
||||
QUOTA_NL_BHARDBELOW
|
||||
space (block) hardlimit
|
||||
QUOTA_NL_BSOFTBELOW
|
||||
space (block) softlimit
|
||||
|
||||
QUOTA_NL_A_DEV_MAJOR (u32)
|
||||
- major number of a device with the affected filesystem
|
||||
QUOTA_NL_A_DEV_MINOR (u32)
|
@ -71,7 +71,7 @@ be allowed write access to a ramfs mount.
|
||||
|
||||
A ramfs derivative called tmpfs was created to add size limits, and the ability
|
||||
to write the data to swap space. Normal users can be allowed write access to
|
||||
tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
|
||||
tmpfs mounts. See Documentation/filesystems/tmpfs.rst for more information.
|
||||
|
||||
What is rootfs?
|
||||
---------------
|
||||
|
@ -1,6 +1,11 @@
|
||||
The seq_file interface
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
======================
|
||||
The seq_file Interface
|
||||
======================
|
||||
|
||||
Copyright 2003 Jonathan Corbet <corbet@lwn.net>
|
||||
|
||||
This file is originally from the LWN.net Driver Porting series at
|
||||
http://lwn.net/Articles/driver-porting/
|
||||
|
||||
@ -43,7 +48,7 @@ loadable module which creates a file called /proc/sequence. The file, when
|
||||
read, simply produces a set of increasing integer values, one per line. The
|
||||
sequence will continue until the user loses patience and finds something
|
||||
better to do. The file is seekable, in that one can do something like the
|
||||
following:
|
||||
following::
|
||||
|
||||
dd if=/proc/sequence of=out1 count=1
|
||||
dd if=/proc/sequence skip=1 of=out2 count=1
|
||||
@ -55,16 +60,18 @@ wanting to see the full source for this module can find it at
|
||||
http://lwn.net/Articles/22359/).
|
||||
|
||||
Deprecated create_proc_entry
|
||||
============================
|
||||
|
||||
Note that the above article uses create_proc_entry which was removed in
|
||||
kernel 3.10. Current versions require the following update
|
||||
kernel 3.10. Current versions require the following update::
|
||||
|
||||
- entry = create_proc_entry("sequence", 0, NULL);
|
||||
- if (entry)
|
||||
- entry->proc_fops = &ct_file_ops;
|
||||
+ entry = proc_create("sequence", 0, NULL, &ct_file_ops);
|
||||
- entry = create_proc_entry("sequence", 0, NULL);
|
||||
- if (entry)
|
||||
- entry->proc_fops = &ct_file_ops;
|
||||
+ entry = proc_create("sequence", 0, NULL, &ct_file_ops);
|
||||
|
||||
The iterator interface
|
||||
======================
|
||||
|
||||
Modules implementing a virtual file with seq_file must implement an
|
||||
iterator object that allows stepping through the data of interest
|
||||
@ -99,7 +106,7 @@ position. The pos passed to start() will always be either zero, or
|
||||
the most recent pos used in the previous session.
|
||||
|
||||
For our simple sequence example,
|
||||
the start() function looks like:
|
||||
the start() function looks like::
|
||||
|
||||
static void *ct_seq_start(struct seq_file *s, loff_t *pos)
|
||||
{
|
||||
@ -129,7 +136,7 @@ move the iterator forward to the next position in the sequence. The
|
||||
example module can simply increment the position by one; more useful
|
||||
modules will do what is needed to step through some data structure. The
|
||||
next() function returns a new iterator, or NULL if the sequence is
|
||||
complete. Here's the example version:
|
||||
complete. Here's the example version::
|
||||
|
||||
static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos)
|
||||
{
|
||||
@ -141,10 +148,10 @@ complete. Here's the example version:
|
||||
The stop() function closes a session; its job, of course, is to clean
|
||||
up. If dynamic memory is allocated for the iterator, stop() is the
|
||||
place to free it; if a lock was taken by start(), stop() must release
|
||||
that lock. The value that *pos was set to by the last next() call
|
||||
that lock. The value that ``*pos`` was set to by the last next() call
|
||||
before stop() is remembered, and used for the first start() call of
|
||||
the next session unless lseek() has been called on the file; in that
|
||||
case next start() will be asked to start at position zero.
|
||||
case next start() will be asked to start at position zero::
|
||||
|
||||
static void ct_seq_stop(struct seq_file *s, void *v)
|
||||
{
|
||||
@ -152,7 +159,7 @@ case next start() will be asked to start at position zero.
|
||||
}
|
||||
|
||||
Finally, the show() function should format the object currently pointed to
|
||||
by the iterator for output. The example module's show() function is:
|
||||
by the iterator for output. The example module's show() function is::
|
||||
|
||||
static int ct_seq_show(struct seq_file *s, void *v)
|
||||
{
|
||||
@ -169,7 +176,7 @@ generated output before returning SEQ_SKIP, that output will be dropped.
|
||||
|
||||
We will look at seq_printf() in a moment. But first, the definition of the
|
||||
seq_file iterator is finished by creating a seq_operations structure with
|
||||
the four functions we have just defined:
|
||||
the four functions we have just defined::
|
||||
|
||||
static const struct seq_operations ct_seq_ops = {
|
||||
.start = ct_seq_start,
|
||||
@ -194,6 +201,7 @@ other locks while the iterator is active.
|
||||
|
||||
|
||||
Formatted output
|
||||
================
|
||||
|
||||
The seq_file code manages positioning within the output created by the
|
||||
iterator and getting it into the user's buffer. But, for that to work, that
|
||||
@ -203,7 +211,7 @@ been defined which make this task easy.
|
||||
Most code will simply use seq_printf(), which works pretty much like
|
||||
printk(), but which requires the seq_file pointer as an argument.
|
||||
|
||||
For straight character output, the following functions may be used:
|
||||
For straight character output, the following functions may be used::
|
||||
|
||||
seq_putc(struct seq_file *m, char c);
|
||||
seq_puts(struct seq_file *m, const char *s);
|
||||
@ -213,7 +221,7 @@ The first two output a single character and a string, just like one would
|
||||
expect. seq_escape() is like seq_puts(), except that any character in s
|
||||
which is in the string esc will be represented in octal form in the output.
|
||||
|
||||
There are also a pair of functions for printing filenames:
|
||||
There are also a pair of functions for printing filenames::
|
||||
|
||||
int seq_path(struct seq_file *m, const struct path *path,
|
||||
const char *esc);
|
||||
@ -226,8 +234,10 @@ the path relative to the current process's filesystem root. If a different
|
||||
root is desired, it can be used with seq_path_root(). If it turns out that
|
||||
path cannot be reached from root, seq_path_root() returns SEQ_SKIP.
|
||||
|
||||
A function producing complicated output may want to check
|
||||
A function producing complicated output may want to check::
|
||||
|
||||
bool seq_has_overflowed(struct seq_file *m);
|
||||
|
||||
and avoid further seq_<output> calls if true is returned.
|
||||
|
||||
A true return from seq_has_overflowed means that the seq_file buffer will
|
||||
@ -236,6 +246,7 @@ buffer and retry printing.
|
||||
|
||||
|
||||
Making it all work
|
||||
==================
|
||||
|
||||
So far, we have a nice set of functions which can produce output within the
|
||||
seq_file system, but we have not yet turned them into a file that a user
|
||||
@ -244,7 +255,7 @@ creation of a set of file_operations which implement the operations on that
|
||||
file. The seq_file interface provides a set of canned operations which do
|
||||
most of the work. The virtual file author still must implement the open()
|
||||
method, however, to hook everything up. The open function is often a single
|
||||
line, as in the example module:
|
||||
line, as in the example module::
|
||||
|
||||
static int ct_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
@ -263,7 +274,7 @@ by the iterator functions.
|
||||
There is also a wrapper function to seq_open() called seq_open_private(). It
|
||||
kmallocs a zero filled block of memory and stores a pointer to it in the
|
||||
private field of the seq_file structure, returning 0 on success. The
|
||||
block size is specified in a third parameter to the function, e.g.:
|
||||
block size is specified in a third parameter to the function, e.g.::
|
||||
|
||||
static int ct_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
@ -273,7 +284,7 @@ block size is specified in a third parameter to the function, e.g.:
|
||||
|
||||
There is also a variant function, __seq_open_private(), which is functionally
|
||||
identical except that, if successful, it returns the pointer to the allocated
|
||||
memory block, allowing further initialisation e.g.:
|
||||
memory block, allowing further initialisation e.g.::
|
||||
|
||||
static int ct_open(struct inode *inode, struct file *file)
|
||||
{
|
||||
@ -295,7 +306,7 @@ frees the memory allocated in the corresponding open.
|
||||
|
||||
The other operations of interest - read(), llseek(), and release() - are
|
||||
all implemented by the seq_file code itself. So a virtual file's
|
||||
file_operations structure will look like:
|
||||
file_operations structure will look like::
|
||||
|
||||
static const struct file_operations ct_file_ops = {
|
||||
.owner = THIS_MODULE,
|
||||
@ -309,7 +320,7 @@ There is also a seq_release_private() which passes the contents of the
|
||||
seq_file private field to kfree() before releasing the structure.
|
||||
|
||||
The final step is the creation of the /proc file itself. In the example
|
||||
code, that is done in the initialization code in the usual way:
|
||||
code, that is done in the initialization code in the usual way::
|
||||
|
||||
static int ct_init(void)
|
||||
{
|
||||
@ -325,9 +336,10 @@ And that is pretty much it.
|
||||
|
||||
|
||||
seq_list
|
||||
========
|
||||
|
||||
If your file will be iterating through a linked list, you may find these
|
||||
routines useful:
|
||||
routines useful::
|
||||
|
||||
struct list_head *seq_list_start(struct list_head *head,
|
||||
loff_t pos);
|
||||
@ -338,15 +350,16 @@ routines useful:
|
||||
|
||||
These helpers will interpret pos as a position within the list and iterate
|
||||
accordingly. Your start() and next() functions need only invoke the
|
||||
seq_list_* helpers with a pointer to the appropriate list_head structure.
|
||||
``seq_list_*`` helpers with a pointer to the appropriate list_head structure.
|
||||
|
||||
|
||||
The extra-simple version
|
||||
========================
|
||||
|
||||
For extremely simple virtual files, there is an even easier interface. A
|
||||
module can define only the show() function, which should create all the
|
||||
output that the virtual file will contain. The file's open() method then
|
||||
calls:
|
||||
calls::
|
||||
|
||||
int single_open(struct file *file,
|
||||
int (*show)(struct seq_file *m, void *p),
|
@ -1,7 +1,10 @@
|
||||
Shared Subtrees
|
||||
---------------
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
Contents:
|
||||
===============
|
||||
Shared Subtrees
|
||||
===============
|
||||
|
||||
.. Contents:
|
||||
1) Overview
|
||||
2) Features
|
||||
3) Setting mount states
|
||||
@ -41,31 +44,38 @@ replicas continue to be exactly same.
|
||||
|
||||
Here is an example:
|
||||
|
||||
Let's say /mnt has a mount that is shared.
|
||||
mount --make-shared /mnt
|
||||
Let's say /mnt has a mount that is shared::
|
||||
|
||||
mount --make-shared /mnt
|
||||
|
||||
Note: mount(8) command now supports the --make-shared flag,
|
||||
so the sample 'smount' program is no longer needed and has been
|
||||
removed.
|
||||
|
||||
# mount --bind /mnt /tmp
|
||||
::
|
||||
|
||||
# mount --bind /mnt /tmp
|
||||
|
||||
The above command replicates the mount at /mnt to the mountpoint /tmp
|
||||
and the contents of both the mounts remain identical.
|
||||
|
||||
#ls /mnt
|
||||
a b c
|
||||
::
|
||||
|
||||
#ls /tmp
|
||||
a b c
|
||||
#ls /mnt
|
||||
a b c
|
||||
|
||||
Now let's say we mount a device at /tmp/a
|
||||
# mount /dev/sd0 /tmp/a
|
||||
#ls /tmp
|
||||
a b c
|
||||
|
||||
#ls /tmp/a
|
||||
t1 t2 t3
|
||||
Now let's say we mount a device at /tmp/a::
|
||||
|
||||
#ls /mnt/a
|
||||
t1 t2 t3
|
||||
# mount /dev/sd0 /tmp/a
|
||||
|
||||
#ls /tmp/a
|
||||
t1 t2 t3
|
||||
|
||||
#ls /mnt/a
|
||||
t1 t2 t3
|
||||
|
||||
Note that the mount has propagated to the mount at /mnt as well.
|
||||
|
||||
@ -123,14 +133,15 @@ replicas continue to be exactly same.
|
||||
|
||||
2d) A unbindable mount is a unbindable private mount
|
||||
|
||||
let's say we have a mount at /mnt and we make it unbindable
|
||||
let's say we have a mount at /mnt and we make it unbindable::
|
||||
|
||||
# mount --make-unbindable /mnt
|
||||
# mount --make-unbindable /mnt
|
||||
|
||||
Let's try to bind mount this mount somewhere else.
|
||||
# mount --bind /mnt /tmp
|
||||
mount: wrong fs type, bad option, bad superblock on /mnt,
|
||||
or too many mounted file systems
|
||||
Let's try to bind mount this mount somewhere else::
|
||||
|
||||
# mount --bind /mnt /tmp
|
||||
mount: wrong fs type, bad option, bad superblock on /mnt,
|
||||
or too many mounted file systems
|
||||
|
||||
Binding a unbindable mount is a invalid operation.
|
||||
|
||||
@ -138,12 +149,12 @@ replicas continue to be exactly same.
|
||||
3) Setting mount states
|
||||
|
||||
The mount command (util-linux package) can be used to set mount
|
||||
states:
|
||||
states::
|
||||
|
||||
mount --make-shared mountpoint
|
||||
mount --make-slave mountpoint
|
||||
mount --make-private mountpoint
|
||||
mount --make-unbindable mountpoint
|
||||
mount --make-shared mountpoint
|
||||
mount --make-slave mountpoint
|
||||
mount --make-private mountpoint
|
||||
mount --make-unbindable mountpoint
|
||||
|
||||
|
||||
4) Use cases
|
||||
@ -154,9 +165,10 @@ replicas continue to be exactly same.
|
||||
|
||||
Solution:
|
||||
|
||||
The system administrator can make the mount at /cdrom shared
|
||||
mount --bind /cdrom /cdrom
|
||||
mount --make-shared /cdrom
|
||||
The system administrator can make the mount at /cdrom shared::
|
||||
|
||||
mount --bind /cdrom /cdrom
|
||||
mount --make-shared /cdrom
|
||||
|
||||
Now any process that clones off a new namespace will have a
|
||||
mount at /cdrom which is a replica of the same mount in the
|
||||
@ -172,14 +184,14 @@ replicas continue to be exactly same.
|
||||
Solution:
|
||||
|
||||
To begin with, the administrator can mark the entire mount tree
|
||||
as shareable.
|
||||
as shareable::
|
||||
|
||||
mount --make-rshared /
|
||||
mount --make-rshared /
|
||||
|
||||
A new process can clone off a new namespace. And mark some part
|
||||
of its namespace as slave
|
||||
of its namespace as slave::
|
||||
|
||||
mount --make-rslave /myprivatetree
|
||||
mount --make-rslave /myprivatetree
|
||||
|
||||
Hence forth any mounts within the /myprivatetree done by the
|
||||
process will not show up in any other namespace. However mounts
|
||||
@ -206,13 +218,13 @@ replicas continue to be exactly same.
|
||||
versions of the file depending on the path used to access that
|
||||
file.
|
||||
|
||||
An example is:
|
||||
An example is::
|
||||
|
||||
mount --make-shared /
|
||||
mount --rbind / /view/v1
|
||||
mount --rbind / /view/v2
|
||||
mount --rbind / /view/v3
|
||||
mount --rbind / /view/v4
|
||||
mount --make-shared /
|
||||
mount --rbind / /view/v1
|
||||
mount --rbind / /view/v2
|
||||
mount --rbind / /view/v3
|
||||
mount --rbind / /view/v4
|
||||
|
||||
and if /usr has a versioning filesystem mounted, then that
|
||||
mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
|
||||
@ -224,8 +236,8 @@ replicas continue to be exactly same.
|
||||
filesystem is being requested and return the corresponding
|
||||
inode.
|
||||
|
||||
5) Detailed semantics:
|
||||
-------------------
|
||||
5) Detailed semantics
|
||||
---------------------
|
||||
The section below explains the detailed semantics of
|
||||
bind, rbind, move, mount, umount and clone-namespace operations.
|
||||
|
||||
@ -235,6 +247,7 @@ replicas continue to be exactly same.
|
||||
5a) Mount states
|
||||
|
||||
A given mount can be in one of the following states
|
||||
|
||||
1) shared
|
||||
2) slave
|
||||
3) shared and slave
|
||||
@ -252,7 +265,8 @@ replicas continue to be exactly same.
|
||||
A 'shared mount' is defined as a vfsmount that belongs to a
|
||||
'peer group'.
|
||||
|
||||
For example:
|
||||
For example::
|
||||
|
||||
mount --make-shared /mnt
|
||||
mount --bind /mnt /tmp
|
||||
|
||||
@ -270,7 +284,7 @@ replicas continue to be exactly same.
|
||||
A slave mount as the name implies has a master mount from which
|
||||
mount/unmount events are received. Events do not propagate from
|
||||
the slave mount to the master. Only a shared mount can be made
|
||||
a slave by executing the following command
|
||||
a slave by executing the following command::
|
||||
|
||||
mount --make-slave mount
|
||||
|
||||
@ -290,8 +304,10 @@ replicas continue to be exactly same.
|
||||
peer group.
|
||||
|
||||
Only a slave vfsmount can be made as 'shared and slave' by
|
||||
either executing the following command
|
||||
either executing the following command::
|
||||
|
||||
mount --make-shared mount
|
||||
|
||||
or by moving the slave vfsmount under a shared vfsmount.
|
||||
|
||||
(4) Private mount
|
||||
@ -307,30 +323,32 @@ replicas continue to be exactly same.
|
||||
|
||||
|
||||
State diagram:
|
||||
|
||||
The state diagram below explains the state transition of a mount,
|
||||
in response to various commands.
|
||||
------------------------------------------------------------------------
|
||||
| |make-shared | make-slave | make-private |make-unbindab|
|
||||
--------------|------------|--------------|--------------|-------------|
|
||||
|shared |shared |*slave/private| private | unbindable |
|
||||
| | | | | |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|slave |shared | **slave | private | unbindable |
|
||||
| |and slave | | | |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|shared |shared | slave | private | unbindable |
|
||||
|and slave |and slave | | | |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|private |shared | **private | private | unbindable |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|unbindable |shared |**unbindable | private | unbindable |
|
||||
------------------------------------------------------------------------
|
||||
in response to various commands::
|
||||
|
||||
* if the shared mount is the only mount in its peer group, making it
|
||||
slave, makes it private automatically. Note that there is no master to
|
||||
which it can be slaved to.
|
||||
-----------------------------------------------------------------------
|
||||
| |make-shared | make-slave | make-private |make-unbindab|
|
||||
--------------|------------|--------------|--------------|-------------|
|
||||
|shared |shared |*slave/private| private | unbindable |
|
||||
| | | | | |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|slave |shared | **slave | private | unbindable |
|
||||
| |and slave | | | |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|shared |shared | slave | private | unbindable |
|
||||
|and slave |and slave | | | |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|private |shared | **private | private | unbindable |
|
||||
|-------------|------------|--------------|--------------|-------------|
|
||||
|unbindable |shared |**unbindable | private | unbindable |
|
||||
------------------------------------------------------------------------
|
||||
|
||||
** slaving a non-shared mount has no effect on the mount.
|
||||
* if the shared mount is the only mount in its peer group, making it
|
||||
slave, makes it private automatically. Note that there is no master to
|
||||
which it can be slaved to.
|
||||
|
||||
** slaving a non-shared mount has no effect on the mount.
|
||||
|
||||
Apart from the commands listed below, the 'move' operation also changes
|
||||
the state of a mount depending on type of the destination mount. Its
|
||||
@ -338,31 +356,32 @@ replicas continue to be exactly same.
|
||||
|
||||
5b) Bind semantics
|
||||
|
||||
Consider the following command
|
||||
Consider the following command::
|
||||
|
||||
mount --bind A/a B/b
|
||||
mount --bind A/a B/b
|
||||
|
||||
where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
|
||||
is the destination mount and 'b' is the dentry in the destination mount.
|
||||
|
||||
The outcome depends on the type of mount of 'A' and 'B'. The table
|
||||
below contains quick reference.
|
||||
---------------------------------------------------------------------------
|
||||
| BIND MOUNT OPERATION |
|
||||
|**************************************************************************
|
||||
|source(A)->| shared | private | slave | unbindable |
|
||||
| dest(B) | | | | |
|
||||
| | | | | | |
|
||||
| v | | | | |
|
||||
|**************************************************************************
|
||||
| shared | shared | shared | shared & slave | invalid |
|
||||
| | | | | |
|
||||
|non-shared| shared | private | slave | invalid |
|
||||
***************************************************************************
|
||||
below contains quick reference::
|
||||
|
||||
--------------------------------------------------------------------------
|
||||
| BIND MOUNT OPERATION |
|
||||
|************************************************************************|
|
||||
|source(A)->| shared | private | slave | unbindable |
|
||||
| dest(B) | | | | |
|
||||
| | | | | | |
|
||||
| v | | | | |
|
||||
|************************************************************************|
|
||||
| shared | shared | shared | shared & slave | invalid |
|
||||
| | | | | |
|
||||
|non-shared| shared | private | slave | invalid |
|
||||
**************************************************************************
|
||||
|
||||
Details:
|
||||
|
||||
1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
|
||||
1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
|
||||
which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
|
||||
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
|
||||
are created and mounted at the dentry 'b' on all mounts where 'B'
|
||||
@ -371,7 +390,7 @@ replicas continue to be exactly same.
|
||||
'B'. And finally the peer-group of 'C' is merged with the peer group
|
||||
of 'A'.
|
||||
|
||||
2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
|
||||
2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
|
||||
which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
|
||||
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
|
||||
are created and mounted at the dentry 'b' on all mounts where 'B'
|
||||
@ -379,7 +398,7 @@ replicas continue to be exactly same.
|
||||
'C', 'C1', .., 'Cn' with exactly the same configuration as the
|
||||
propagation tree for 'B'.
|
||||
|
||||
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
|
||||
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
|
||||
mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
|
||||
'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
|
||||
'C3' ... are created and mounted at the dentry 'b' on all mounts where
|
||||
@ -389,19 +408,19 @@ replicas continue to be exactly same.
|
||||
is made the slave of mount 'Z'. In other words, mount 'C' is in the
|
||||
state 'slave and shared'.
|
||||
|
||||
4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
|
||||
4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
|
||||
invalid operation.
|
||||
|
||||
5. 'A' is a private mount and 'B' is a non-shared(private or slave or
|
||||
5. 'A' is a private mount and 'B' is a non-shared(private or slave or
|
||||
unbindable) mount. A new mount 'C' which is clone of 'A', is created.
|
||||
Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
|
||||
|
||||
6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
|
||||
6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
|
||||
which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
|
||||
mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
|
||||
peer-group of 'A'.
|
||||
|
||||
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
|
||||
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
|
||||
new mount 'C' which is a clone of 'A' is created. Its root dentry is
|
||||
'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
|
||||
slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
|
||||
@ -409,7 +428,7 @@ replicas continue to be exactly same.
|
||||
mount/unmount on 'A' do not propagate anywhere else. Similarly
|
||||
mount/unmount on 'C' do not propagate anywhere else.
|
||||
|
||||
8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
|
||||
8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
|
||||
invalid operation. A unbindable mount cannot be bind mounted.
|
||||
|
||||
5c) Rbind semantics
|
||||
@ -422,7 +441,9 @@ replicas continue to be exactly same.
|
||||
then the subtree under the unbindable mount is pruned in the new
|
||||
location.
|
||||
|
||||
eg: let's say we have the following mount tree.
|
||||
eg:
|
||||
|
||||
let's say we have the following mount tree::
|
||||
|
||||
A
|
||||
/ \
|
||||
@ -430,12 +451,12 @@ replicas continue to be exactly same.
|
||||
/ \ / \
|
||||
D E F G
|
||||
|
||||
Let's say all the mount except the mount C in the tree are
|
||||
of a type other than unbindable.
|
||||
Let's say all the mount except the mount C in the tree are
|
||||
of a type other than unbindable.
|
||||
|
||||
If this tree is rbound to say Z
|
||||
If this tree is rbound to say Z
|
||||
|
||||
We will have the following tree at the new location.
|
||||
We will have the following tree at the new location::
|
||||
|
||||
Z
|
||||
|
|
||||
@ -457,24 +478,26 @@ replicas continue to be exactly same.
|
||||
the dentry in the destination mount.
|
||||
|
||||
The outcome depends on the type of the mount of 'A' and 'B'. The table
|
||||
below is a quick reference.
|
||||
---------------------------------------------------------------------------
|
||||
| MOVE MOUNT OPERATION |
|
||||
|**************************************************************************
|
||||
| source(A)->| shared | private | slave | unbindable |
|
||||
| dest(B) | | | | |
|
||||
| | | | | | |
|
||||
| v | | | | |
|
||||
|**************************************************************************
|
||||
| shared | shared | shared |shared and slave| invalid |
|
||||
| | | | | |
|
||||
|non-shared| shared | private | slave | unbindable |
|
||||
***************************************************************************
|
||||
NOTE: moving a mount residing under a shared mount is invalid.
|
||||
below is a quick reference::
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
| MOVE MOUNT OPERATION |
|
||||
|**************************************************************************
|
||||
| source(A)->| shared | private | slave | unbindable |
|
||||
| dest(B) | | | | |
|
||||
| | | | | | |
|
||||
| v | | | | |
|
||||
|**************************************************************************
|
||||
| shared | shared | shared |shared and slave| invalid |
|
||||
| | | | | |
|
||||
|non-shared| shared | private | slave | unbindable |
|
||||
***************************************************************************
|
||||
|
||||
.. Note:: moving a mount residing under a shared mount is invalid.
|
||||
|
||||
Details follow:
|
||||
|
||||
1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
|
||||
1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
|
||||
mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
|
||||
are created and mounted at dentry 'b' on all mounts that receive
|
||||
propagation from mount 'B'. A new propagation tree is created in the
|
||||
@ -483,7 +506,7 @@ replicas continue to be exactly same.
|
||||
propagation tree is appended to the already existing propagation tree
|
||||
of 'A'.
|
||||
|
||||
2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
|
||||
2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
|
||||
mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
|
||||
are created and mounted at dentry 'b' on all mounts that receive
|
||||
propagation from mount 'B'. The mount 'A' becomes a shared mount and a
|
||||
@ -491,7 +514,7 @@ replicas continue to be exactly same.
|
||||
'B'. This new propagation tree contains all the new mounts 'A1',
|
||||
'A2'... 'An'.
|
||||
|
||||
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
|
||||
3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
|
||||
mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
|
||||
'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
|
||||
receive propagation from mount 'B'. A new propagation tree is created
|
||||
@ -501,32 +524,32 @@ replicas continue to be exactly same.
|
||||
'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
|
||||
becomes 'shared'.
|
||||
|
||||
4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
|
||||
4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
|
||||
is invalid. Because mounting anything on the shared mount 'B' can
|
||||
create new mounts that get mounted on the mounts that receive
|
||||
propagation from 'B'. And since the mount 'A' is unbindable, cloning
|
||||
it to mount at other mountpoints is not possible.
|
||||
|
||||
5. 'A' is a private mount and 'B' is a non-shared(private or slave or
|
||||
5. 'A' is a private mount and 'B' is a non-shared(private or slave or
|
||||
unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
|
||||
|
||||
6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
|
||||
6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
|
||||
is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
|
||||
shared mount.
|
||||
|
||||
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
|
||||
7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
|
||||
The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
|
||||
continues to be a slave mount of mount 'Z'.
|
||||
|
||||
8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
|
||||
8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
|
||||
'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
|
||||
unbindable mount.
|
||||
|
||||
5e) Mount semantics
|
||||
|
||||
Consider the following command
|
||||
Consider the following command::
|
||||
|
||||
mount device B/b
|
||||
mount device B/b
|
||||
|
||||
'B' is the destination mount and 'b' is the dentry in the destination
|
||||
mount.
|
||||
@ -537,9 +560,9 @@ replicas continue to be exactly same.
|
||||
|
||||
5f) Unmount semantics
|
||||
|
||||
Consider the following command
|
||||
Consider the following command::
|
||||
|
||||
umount A
|
||||
umount A
|
||||
|
||||
where 'A' is a mount mounted on mount 'B' at dentry 'b'.
|
||||
|
||||
@ -592,10 +615,12 @@ replicas continue to be exactly same.
|
||||
|
||||
A. What is the result of the following command sequence?
|
||||
|
||||
mount --bind /mnt /mnt
|
||||
mount --make-shared /mnt
|
||||
mount --bind /mnt /tmp
|
||||
mount --move /tmp /mnt/1
|
||||
::
|
||||
|
||||
mount --bind /mnt /mnt
|
||||
mount --make-shared /mnt
|
||||
mount --bind /mnt /tmp
|
||||
mount --move /tmp /mnt/1
|
||||
|
||||
what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
|
||||
Should they all be identical? or should /mnt and /mnt/1 be
|
||||
@ -604,23 +629,27 @@ replicas continue to be exactly same.
|
||||
|
||||
B. What is the result of the following command sequence?
|
||||
|
||||
mount --make-rshared /
|
||||
mkdir -p /v/1
|
||||
mount --rbind / /v/1
|
||||
::
|
||||
|
||||
mount --make-rshared /
|
||||
mkdir -p /v/1
|
||||
mount --rbind / /v/1
|
||||
|
||||
what should be the content of /v/1/v/1 be?
|
||||
|
||||
|
||||
C. What is the result of the following command sequence?
|
||||
|
||||
mount --bind /mnt /mnt
|
||||
mount --make-shared /mnt
|
||||
mkdir -p /mnt/1/2/3 /mnt/1/test
|
||||
mount --bind /mnt/1 /tmp
|
||||
mount --make-slave /mnt
|
||||
mount --make-shared /mnt
|
||||
mount --bind /mnt/1/2 /tmp1
|
||||
mount --make-slave /mnt
|
||||
::
|
||||
|
||||
mount --bind /mnt /mnt
|
||||
mount --make-shared /mnt
|
||||
mkdir -p /mnt/1/2/3 /mnt/1/test
|
||||
mount --bind /mnt/1 /tmp
|
||||
mount --make-slave /mnt
|
||||
mount --make-shared /mnt
|
||||
mount --bind /mnt/1/2 /tmp1
|
||||
mount --make-slave /mnt
|
||||
|
||||
At this point we have the first mount at /tmp and
|
||||
its root dentry is 1. Let's call this mount 'A'
|
||||
@ -668,7 +697,8 @@ replicas continue to be exactly same.
|
||||
|
||||
step 1:
|
||||
let's say the root tree has just two directories with
|
||||
one vfsmount.
|
||||
one vfsmount::
|
||||
|
||||
root
|
||||
/ \
|
||||
tmp usr
|
||||
@ -676,14 +706,17 @@ replicas continue to be exactly same.
|
||||
And we want to replicate the tree at multiple
|
||||
mountpoints under /root/tmp
|
||||
|
||||
step2:
|
||||
mount --make-shared /root
|
||||
step 2:
|
||||
::
|
||||
|
||||
mkdir -p /tmp/m1
|
||||
|
||||
mount --rbind /root /tmp/m1
|
||||
mount --make-shared /root
|
||||
|
||||
the new tree now looks like this:
|
||||
mkdir -p /tmp/m1
|
||||
|
||||
mount --rbind /root /tmp/m1
|
||||
|
||||
the new tree now looks like this::
|
||||
|
||||
root
|
||||
/ \
|
||||
@ -697,11 +730,13 @@ replicas continue to be exactly same.
|
||||
|
||||
it has two vfsmounts
|
||||
|
||||
step3:
|
||||
step 3:
|
||||
::
|
||||
|
||||
mkdir -p /tmp/m2
|
||||
mount --rbind /root /tmp/m2
|
||||
|
||||
the new tree now looks like this:
|
||||
the new tree now looks like this::
|
||||
|
||||
root
|
||||
/ \
|
||||
@ -724,6 +759,7 @@ replicas continue to be exactly same.
|
||||
it has 6 vfsmounts
|
||||
|
||||
step 4:
|
||||
::
|
||||
mkdir -p /tmp/m3
|
||||
mount --rbind /root /tmp/m3
|
||||
|
||||
@ -740,7 +776,8 @@ replicas continue to be exactly same.
|
||||
|
||||
step 1:
|
||||
let's say the root tree has just two directories with
|
||||
one vfsmount.
|
||||
one vfsmount::
|
||||
|
||||
root
|
||||
/ \
|
||||
tmp usr
|
||||
@ -748,17 +785,20 @@ replicas continue to be exactly same.
|
||||
How do we set up the same tree at multiple locations under
|
||||
/root/tmp
|
||||
|
||||
step2:
|
||||
mount --bind /root/tmp /root/tmp
|
||||
step 2:
|
||||
::
|
||||
|
||||
mount --make-rshared /root
|
||||
mount --make-unbindable /root/tmp
|
||||
|
||||
mkdir -p /tmp/m1
|
||||
mount --bind /root/tmp /root/tmp
|
||||
|
||||
mount --rbind /root /tmp/m1
|
||||
mount --make-rshared /root
|
||||
mount --make-unbindable /root/tmp
|
||||
|
||||
the new tree now looks like this:
|
||||
mkdir -p /tmp/m1
|
||||
|
||||
mount --rbind /root /tmp/m1
|
||||
|
||||
the new tree now looks like this::
|
||||
|
||||
root
|
||||
/ \
|
||||
@ -768,11 +808,13 @@ replicas continue to be exactly same.
|
||||
/ \
|
||||
tmp usr
|
||||
|
||||
step3:
|
||||
step 3:
|
||||
::
|
||||
|
||||
mkdir -p /tmp/m2
|
||||
mount --rbind /root /tmp/m2
|
||||
|
||||
the new tree now looks like this:
|
||||
the new tree now looks like this::
|
||||
|
||||
root
|
||||
/ \
|
||||
@ -782,12 +824,13 @@ replicas continue to be exactly same.
|
||||
/ \ / \
|
||||
tmp usr tmp usr
|
||||
|
||||
step4:
|
||||
step 4:
|
||||
::
|
||||
|
||||
mkdir -p /tmp/m3
|
||||
mount --rbind /root /tmp/m3
|
||||
|
||||
the new tree now looks like this:
|
||||
the new tree now looks like this::
|
||||
|
||||
root
|
||||
/ \
|
||||
@ -801,25 +844,31 @@ replicas continue to be exactly same.
|
||||
|
||||
8A) Datastructure
|
||||
|
||||
4 new fields are introduced to struct vfsmount
|
||||
->mnt_share
|
||||
->mnt_slave_list
|
||||
->mnt_slave
|
||||
->mnt_master
|
||||
4 new fields are introduced to struct vfsmount:
|
||||
|
||||
->mnt_share links together all the mount to/from which this vfsmount
|
||||
* ->mnt_share
|
||||
* ->mnt_slave_list
|
||||
* ->mnt_slave
|
||||
* ->mnt_master
|
||||
|
||||
->mnt_share
|
||||
links together all the mount to/from which this vfsmount
|
||||
send/receives propagation events.
|
||||
|
||||
->mnt_slave_list links all the mounts to which this vfsmount propagates
|
||||
->mnt_slave_list
|
||||
links all the mounts to which this vfsmount propagates
|
||||
to.
|
||||
|
||||
->mnt_slave links together all the slaves that its master vfsmount
|
||||
->mnt_slave
|
||||
links together all the slaves that its master vfsmount
|
||||
propagates to.
|
||||
|
||||
->mnt_master points to the master vfsmount from which this vfsmount
|
||||
->mnt_master
|
||||
points to the master vfsmount from which this vfsmount
|
||||
receives propagation.
|
||||
|
||||
->mnt_flags takes two more flags to indicate the propagation status of
|
||||
->mnt_flags
|
||||
takes two more flags to indicate the propagation status of
|
||||
the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
|
||||
vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
|
||||
replicated.
|
||||
@ -842,7 +891,7 @@ replicas continue to be exactly same.
|
||||
|
||||
A example propagation tree looks as shown in the figure below.
|
||||
[ NOTE: Though it looks like a forest, if we consider all the shared
|
||||
mounts as a conceptual entity called 'pnode', it becomes a tree]
|
||||
mounts as a conceptual entity called 'pnode', it becomes a tree]::
|
||||
|
||||
|
||||
A <--> B <--> C <---> D
|
||||
@ -864,14 +913,19 @@ replicas continue to be exactly same.
|
||||
A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
|
||||
|
||||
E's ->mnt_share links with ->mnt_share of K
|
||||
'E', 'K', 'F', 'G' have their ->mnt_master point to struct
|
||||
vfsmount of 'A'
|
||||
|
||||
'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
|
||||
|
||||
'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
|
||||
|
||||
K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
|
||||
|
||||
C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
|
||||
|
||||
J and K's ->mnt_master points to struct vfsmount of C
|
||||
|
||||
and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
|
||||
|
||||
'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
|
||||
|
||||
|
||||
@ -903,6 +957,7 @@ replicas continue to be exactly same.
|
||||
Prepare phase:
|
||||
|
||||
for each mount in the source tree:
|
||||
|
||||
a) Create the necessary number of mount trees to
|
||||
be attached to each of the mounts that receive
|
||||
propagation from the destination mount.
|
||||
@ -929,11 +984,12 @@ replicas continue to be exactly same.
|
||||
Abort phase
|
||||
delete all the newly created trees.
|
||||
|
||||
NOTE: all the propagation related functionality resides in the file
|
||||
pnode.c
|
||||
.. Note::
|
||||
all the propagation related functionality resides in the file pnode.c
|
||||
|
||||
|
||||
------------------------------------------------------------------------
|
||||
|
||||
version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
|
||||
|
||||
version 0.2 (Incorporated comments from Al Viro)
|
13
Documentation/filesystems/spufs/index.rst
Normal file
13
Documentation/filesystems/spufs/index.rst
Normal file
@ -0,0 +1,13 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==============
|
||||
SPU Filesystem
|
||||
==============
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
spufs
|
||||
spu_create
|
||||
spu_run
|
131
Documentation/filesystems/spufs/spu_create.rst
Normal file
131
Documentation/filesystems/spufs/spu_create.rst
Normal file
@ -0,0 +1,131 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========
|
||||
spu_create
|
||||
==========
|
||||
|
||||
Name
|
||||
====
|
||||
spu_create - create a new spu context
|
||||
|
||||
|
||||
Synopsis
|
||||
========
|
||||
|
||||
::
|
||||
|
||||
#include <sys/types.h>
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_create(const char *pathname, int flags, mode_t mode);
|
||||
|
||||
Description
|
||||
===========
|
||||
The spu_create system call is used on PowerPC machines that implement
|
||||
the Cell Broadband Engine Architecture in order to access Synergistic
|
||||
Processor Units (SPUs). It creates a new logical context for an SPU in
|
||||
pathname and returns a handle to associated with it. pathname must
|
||||
point to a non-existing directory in the mount point of the SPU file
|
||||
system (spufs). When spu_create is successful, a directory gets cre-
|
||||
ated on pathname and it is populated with files.
|
||||
|
||||
The returned file handle can only be passed to spu_run(2) or closed,
|
||||
other operations are not defined on it. When it is closed, all associ-
|
||||
ated directory entries in spufs are removed. When the last file handle
|
||||
pointing either inside of the context directory or to this file
|
||||
descriptor is closed, the logical SPU context is destroyed.
|
||||
|
||||
The parameter flags can be zero or any bitwise or'd combination of the
|
||||
following constants:
|
||||
|
||||
SPU_RAWIO
|
||||
Allow mapping of some of the hardware registers of the SPU into
|
||||
user space. This flag requires the CAP_SYS_RAWIO capability, see
|
||||
capabilities(7).
|
||||
|
||||
The mode parameter specifies the permissions used for creating the new
|
||||
directory in spufs. mode is modified with the user's umask(2) value
|
||||
and then used for both the directory and the files contained in it. The
|
||||
file permissions mask out some more bits of mode because they typically
|
||||
support only read or write access. See stat(2) for a full list of the
|
||||
possible mode values.
|
||||
|
||||
|
||||
Return Value
|
||||
============
|
||||
spu_create returns a new file descriptor. It may return -1 to indicate
|
||||
an error condition and set errno to one of the error codes listed
|
||||
below.
|
||||
|
||||
|
||||
Errors
|
||||
======
|
||||
EACCES
|
||||
The current user does not have write access on the spufs mount
|
||||
point.
|
||||
|
||||
EEXIST An SPU context already exists at the given path name.
|
||||
|
||||
EFAULT pathname is not a valid string pointer in the current address
|
||||
space.
|
||||
|
||||
EINVAL pathname is not a directory in the spufs mount point.
|
||||
|
||||
ELOOP Too many symlinks were found while resolving pathname.
|
||||
|
||||
EMFILE The process has reached its maximum open file limit.
|
||||
|
||||
ENAMETOOLONG
|
||||
pathname was too long.
|
||||
|
||||
ENFILE The system has reached the global open file limit.
|
||||
|
||||
ENOENT Part of pathname could not be resolved.
|
||||
|
||||
ENOMEM The kernel could not allocate all resources required.
|
||||
|
||||
ENOSPC There are not enough SPU resources available to create a new
|
||||
context or the user specific limit for the number of SPU con-
|
||||
texts has been reached.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
ENOTDIR
|
||||
A part of pathname is not a directory.
|
||||
|
||||
|
||||
|
||||
Notes
|
||||
=====
|
||||
spu_create is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
Files
|
||||
=====
|
||||
pathname must point to a location beneath the mount point of spufs. By
|
||||
convention, it gets mounted in /spu.
|
||||
|
||||
|
||||
Conforming to
|
||||
=============
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
Bugs
|
||||
====
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
Author
|
||||
======
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
See Also
|
||||
========
|
||||
capabilities(7), close(2), spu_run(2), spufs(7)
|
138
Documentation/filesystems/spufs/spu_run.rst
Normal file
138
Documentation/filesystems/spufs/spu_run.rst
Normal file
@ -0,0 +1,138 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======
|
||||
spu_run
|
||||
=======
|
||||
|
||||
|
||||
Name
|
||||
====
|
||||
spu_run - execute an spu context
|
||||
|
||||
|
||||
Synopsis
|
||||
========
|
||||
|
||||
::
|
||||
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_run(int fd, unsigned int *npc, unsigned int *event);
|
||||
|
||||
Description
|
||||
===========
|
||||
The spu_run system call is used on PowerPC machines that implement the
|
||||
Cell Broadband Engine Architecture in order to access Synergistic Pro-
|
||||
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
|
||||
ate(2) to address a specific SPU context. When the context gets sched-
|
||||
uled to a physical SPU, it starts execution at the instruction pointer
|
||||
passed in npc.
|
||||
|
||||
Execution of SPU code happens synchronously, meaning that spu_run does
|
||||
not return while the SPU is still running. If there is a need to exe-
|
||||
cute SPU code in parallel with other code on either the main CPU or
|
||||
other SPUs, you need to create a new thread of execution first, e.g.
|
||||
using the pthread_create(3) call.
|
||||
|
||||
When spu_run returns, the current value of the SPU instruction pointer
|
||||
is written back to npc, so you can call spu_run again without updating
|
||||
the pointers.
|
||||
|
||||
event can be a NULL pointer or point to an extended status code that
|
||||
gets filled when spu_run returns. It can be one of the following con-
|
||||
stants:
|
||||
|
||||
SPE_EVENT_DMA_ALIGNMENT
|
||||
A DMA alignment error
|
||||
|
||||
SPE_EVENT_SPE_DATA_SEGMENT
|
||||
A DMA segmentation error
|
||||
|
||||
SPE_EVENT_SPE_DATA_STORAGE
|
||||
A DMA storage error
|
||||
|
||||
If NULL is passed as the event argument, these errors will result in a
|
||||
signal delivered to the calling process.
|
||||
|
||||
Return Value
|
||||
============
|
||||
spu_run returns the value of the spu_status register or -1 to indicate
|
||||
an error and set errno to one of the error codes listed below. The
|
||||
spu_status register value contains a bit mask of status codes and
|
||||
optionally a 14 bit code returned from the stop-and-signal instruction
|
||||
on the SPU. The bit masks for the status codes are:
|
||||
|
||||
0x02
|
||||
SPU was stopped by stop-and-signal.
|
||||
|
||||
0x04
|
||||
SPU was stopped by halt.
|
||||
|
||||
0x08
|
||||
SPU is waiting for a channel.
|
||||
|
||||
0x10
|
||||
SPU is in single-step mode.
|
||||
|
||||
0x20
|
||||
SPU has tried to execute an invalid instruction.
|
||||
|
||||
0x40
|
||||
SPU has tried to access an invalid channel.
|
||||
|
||||
0x3fff0000
|
||||
The bits masked with this value contain the code returned from
|
||||
stop-and-signal.
|
||||
|
||||
There are always one or more of the lower eight bits set or an error
|
||||
code is returned from spu_run.
|
||||
|
||||
Errors
|
||||
======
|
||||
EAGAIN or EWOULDBLOCK
|
||||
fd is in non-blocking mode and spu_run would block.
|
||||
|
||||
EBADF fd is not a valid file descriptor.
|
||||
|
||||
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
|
||||
pointer.
|
||||
|
||||
EINTR A signal occurred while spu_run was in progress. The npc value
|
||||
has been updated to the new program counter value if necessary.
|
||||
|
||||
EINVAL fd is not a file descriptor returned from spu_create(2).
|
||||
|
||||
ENOMEM Insufficient memory was available to handle a page fault result-
|
||||
ing from an MFC direct memory access.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
|
||||
Notes
|
||||
=====
|
||||
spu_run is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
Conforming to
|
||||
=============
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
Bugs
|
||||
====
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
Author
|
||||
======
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
See Also
|
||||
========
|
||||
capabilities(7), close(2), spu_create(2), spufs(7)
|
@ -1,12 +1,18 @@
|
||||
SPUFS(2) Linux Programmer's Manual SPUFS(2)
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=====
|
||||
spufs
|
||||
=====
|
||||
|
||||
Name
|
||||
====
|
||||
|
||||
NAME
|
||||
spufs - the SPU file system
|
||||
|
||||
|
||||
DESCRIPTION
|
||||
Description
|
||||
===========
|
||||
|
||||
The SPU file system is used on PowerPC machines that implement the Cell
|
||||
Broadband Engine Architecture in order to access Synergistic Processor
|
||||
Units (SPUs).
|
||||
@ -21,7 +27,9 @@ DESCRIPTION
|
||||
ally add or remove files.
|
||||
|
||||
|
||||
MOUNT OPTIONS
|
||||
Mount Options
|
||||
=============
|
||||
|
||||
uid=<uid>
|
||||
set the user owning the mount point, the default is 0 (root).
|
||||
|
||||
@ -29,7 +37,9 @@ MOUNT OPTIONS
|
||||
set the group owning the mount point, the default is 0 (root).
|
||||
|
||||
|
||||
FILES
|
||||
Files
|
||||
=====
|
||||
|
||||
The files in spufs mostly follow the standard behavior for regular sys-
|
||||
tem calls like read(2) or write(2), but often support only a subset of
|
||||
the operations supported on regular file systems. This list details the
|
||||
@ -125,14 +135,12 @@ FILES
|
||||
space is available for writing.
|
||||
|
||||
|
||||
/mbox_stat
|
||||
/ibox_stat
|
||||
/wbox_stat
|
||||
/mbox_stat, /ibox_stat, /wbox_stat
|
||||
Read-only files that contain the length of the current queue, i.e. how
|
||||
many words can be read from mbox or ibox or how many words can be
|
||||
written to wbox without blocking. The files can be read only in 4-byte
|
||||
units and return a big-endian binary integer number. The possible
|
||||
operations on an open *box_stat file are:
|
||||
operations on an open ``*box_stat`` file are:
|
||||
|
||||
read(2)
|
||||
If a count smaller than four is requested, read returns -1 and
|
||||
@ -143,12 +151,7 @@ FILES
|
||||
in EAGAIN.
|
||||
|
||||
|
||||
/npc
|
||||
/decr
|
||||
/decr_status
|
||||
/spu_tag_mask
|
||||
/event_mask
|
||||
/srr0
|
||||
/npc, /decr, /decr_status, /spu_tag_mask, /event_mask, /srr0
|
||||
Internal registers of the SPU. The representation is an ASCII string
|
||||
with the numeric value of the next instruction to be executed. These
|
||||
can be used in read/write mode for debugging, but normal operation of
|
||||
@ -157,17 +160,14 @@ FILES
|
||||
|
||||
The contents of these files are:
|
||||
|
||||
=================== ===================================
|
||||
npc Next Program Counter
|
||||
|
||||
decr SPU Decrementer
|
||||
|
||||
decr_status Decrementer Status
|
||||
|
||||
spu_tag_mask MFC tag mask for SPU DMA
|
||||
|
||||
event_mask Event mask for SPU interrupts
|
||||
|
||||
srr0 Interrupt Return address register
|
||||
=================== ===================================
|
||||
|
||||
|
||||
The possible operations on an open npc, decr, decr_status,
|
||||
@ -206,8 +206,7 @@ FILES
|
||||
from the data buffer, updating the value of the fpcr register.
|
||||
|
||||
|
||||
/signal1
|
||||
/signal2
|
||||
/signal1, /signal2
|
||||
The two signal notification channels of an SPU. These are read-write
|
||||
files that operate on a 32 bit word. Writing to one of these files
|
||||
triggers an interrupt on the SPU. The value written to the signal
|
||||
@ -233,8 +232,7 @@ FILES
|
||||
file.
|
||||
|
||||
|
||||
/signal1_type
|
||||
/signal2_type
|
||||
/signal1_type, /signal2_type
|
||||
These two files change the behavior of the signal1 and signal2 notifi-
|
||||
cation files. The contain a numerical ASCII string which is read as
|
||||
either "1" or "0". In mode 0 (overwrite), the hardware replaces the
|
||||
@ -259,263 +257,17 @@ FILES
|
||||
the previous setting.
|
||||
|
||||
|
||||
EXAMPLES
|
||||
Examples
|
||||
========
|
||||
/etc/fstab entry
|
||||
none /spu spufs gid=spu 0 0
|
||||
|
||||
|
||||
AUTHORS
|
||||
Authors
|
||||
=======
|
||||
Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
|
||||
Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
See Also
|
||||
========
|
||||
capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPUFS(2)
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spu_run - execute an spu context
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_run(int fd, unsigned int *npc, unsigned int *event);
|
||||
|
||||
DESCRIPTION
|
||||
The spu_run system call is used on PowerPC machines that implement the
|
||||
Cell Broadband Engine Architecture in order to access Synergistic Pro-
|
||||
cessor Units (SPUs). It uses the fd that was returned from spu_cre-
|
||||
ate(2) to address a specific SPU context. When the context gets sched-
|
||||
uled to a physical SPU, it starts execution at the instruction pointer
|
||||
passed in npc.
|
||||
|
||||
Execution of SPU code happens synchronously, meaning that spu_run does
|
||||
not return while the SPU is still running. If there is a need to exe-
|
||||
cute SPU code in parallel with other code on either the main CPU or
|
||||
other SPUs, you need to create a new thread of execution first, e.g.
|
||||
using the pthread_create(3) call.
|
||||
|
||||
When spu_run returns, the current value of the SPU instruction pointer
|
||||
is written back to npc, so you can call spu_run again without updating
|
||||
the pointers.
|
||||
|
||||
event can be a NULL pointer or point to an extended status code that
|
||||
gets filled when spu_run returns. It can be one of the following con-
|
||||
stants:
|
||||
|
||||
SPE_EVENT_DMA_ALIGNMENT
|
||||
A DMA alignment error
|
||||
|
||||
SPE_EVENT_SPE_DATA_SEGMENT
|
||||
A DMA segmentation error
|
||||
|
||||
SPE_EVENT_SPE_DATA_STORAGE
|
||||
A DMA storage error
|
||||
|
||||
If NULL is passed as the event argument, these errors will result in a
|
||||
signal delivered to the calling process.
|
||||
|
||||
RETURN VALUE
|
||||
spu_run returns the value of the spu_status register or -1 to indicate
|
||||
an error and set errno to one of the error codes listed below. The
|
||||
spu_status register value contains a bit mask of status codes and
|
||||
optionally a 14 bit code returned from the stop-and-signal instruction
|
||||
on the SPU. The bit masks for the status codes are:
|
||||
|
||||
0x02 SPU was stopped by stop-and-signal.
|
||||
|
||||
0x04 SPU was stopped by halt.
|
||||
|
||||
0x08 SPU is waiting for a channel.
|
||||
|
||||
0x10 SPU is in single-step mode.
|
||||
|
||||
0x20 SPU has tried to execute an invalid instruction.
|
||||
|
||||
0x40 SPU has tried to access an invalid channel.
|
||||
|
||||
0x3fff0000
|
||||
The bits masked with this value contain the code returned from
|
||||
stop-and-signal.
|
||||
|
||||
There are always one or more of the lower eight bits set or an error
|
||||
code is returned from spu_run.
|
||||
|
||||
ERRORS
|
||||
EAGAIN or EWOULDBLOCK
|
||||
fd is in non-blocking mode and spu_run would block.
|
||||
|
||||
EBADF fd is not a valid file descriptor.
|
||||
|
||||
EFAULT npc is not a valid pointer or status is neither NULL nor a valid
|
||||
pointer.
|
||||
|
||||
EINTR A signal occurred while spu_run was in progress. The npc value
|
||||
has been updated to the new program counter value if necessary.
|
||||
|
||||
EINVAL fd is not a file descriptor returned from spu_create(2).
|
||||
|
||||
ENOMEM Insufficient memory was available to handle a page fault result-
|
||||
ing from an MFC direct memory access.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
|
||||
NOTES
|
||||
spu_run is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
CONFORMING TO
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
BUGS
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
AUTHOR
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_create(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPU_RUN(2)
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
|
||||
|
||||
|
||||
|
||||
NAME
|
||||
spu_create - create a new spu context
|
||||
|
||||
|
||||
SYNOPSIS
|
||||
#include <sys/types.h>
|
||||
#include <sys/spu.h>
|
||||
|
||||
int spu_create(const char *pathname, int flags, mode_t mode);
|
||||
|
||||
DESCRIPTION
|
||||
The spu_create system call is used on PowerPC machines that implement
|
||||
the Cell Broadband Engine Architecture in order to access Synergistic
|
||||
Processor Units (SPUs). It creates a new logical context for an SPU in
|
||||
pathname and returns a handle to associated with it. pathname must
|
||||
point to a non-existing directory in the mount point of the SPU file
|
||||
system (spufs). When spu_create is successful, a directory gets cre-
|
||||
ated on pathname and it is populated with files.
|
||||
|
||||
The returned file handle can only be passed to spu_run(2) or closed,
|
||||
other operations are not defined on it. When it is closed, all associ-
|
||||
ated directory entries in spufs are removed. When the last file handle
|
||||
pointing either inside of the context directory or to this file
|
||||
descriptor is closed, the logical SPU context is destroyed.
|
||||
|
||||
The parameter flags can be zero or any bitwise or'd combination of the
|
||||
following constants:
|
||||
|
||||
SPU_RAWIO
|
||||
Allow mapping of some of the hardware registers of the SPU into
|
||||
user space. This flag requires the CAP_SYS_RAWIO capability, see
|
||||
capabilities(7).
|
||||
|
||||
The mode parameter specifies the permissions used for creating the new
|
||||
directory in spufs. mode is modified with the user's umask(2) value
|
||||
and then used for both the directory and the files contained in it. The
|
||||
file permissions mask out some more bits of mode because they typically
|
||||
support only read or write access. See stat(2) for a full list of the
|
||||
possible mode values.
|
||||
|
||||
|
||||
RETURN VALUE
|
||||
spu_create returns a new file descriptor. It may return -1 to indicate
|
||||
an error condition and set errno to one of the error codes listed
|
||||
below.
|
||||
|
||||
|
||||
ERRORS
|
||||
EACCES
|
||||
The current user does not have write access on the spufs mount
|
||||
point.
|
||||
|
||||
EEXIST An SPU context already exists at the given path name.
|
||||
|
||||
EFAULT pathname is not a valid string pointer in the current address
|
||||
space.
|
||||
|
||||
EINVAL pathname is not a directory in the spufs mount point.
|
||||
|
||||
ELOOP Too many symlinks were found while resolving pathname.
|
||||
|
||||
EMFILE The process has reached its maximum open file limit.
|
||||
|
||||
ENAMETOOLONG
|
||||
pathname was too long.
|
||||
|
||||
ENFILE The system has reached the global open file limit.
|
||||
|
||||
ENOENT Part of pathname could not be resolved.
|
||||
|
||||
ENOMEM The kernel could not allocate all resources required.
|
||||
|
||||
ENOSPC There are not enough SPU resources available to create a new
|
||||
context or the user specific limit for the number of SPU con-
|
||||
texts has been reached.
|
||||
|
||||
ENOSYS the functionality is not provided by the current system, because
|
||||
either the hardware does not provide SPUs or the spufs module is
|
||||
not loaded.
|
||||
|
||||
ENOTDIR
|
||||
A part of pathname is not a directory.
|
||||
|
||||
|
||||
|
||||
NOTES
|
||||
spu_create is meant to be used from libraries that implement a more
|
||||
abstract interface to SPUs, not to be used from regular applications.
|
||||
See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
|
||||
ommended libraries.
|
||||
|
||||
|
||||
FILES
|
||||
pathname must point to a location beneath the mount point of spufs. By
|
||||
convention, it gets mounted in /spu.
|
||||
|
||||
|
||||
CONFORMING TO
|
||||
This call is Linux specific and only implemented by the ppc64 architec-
|
||||
ture. Programs using this system call are not portable.
|
||||
|
||||
|
||||
BUGS
|
||||
The code does not yet fully implement all features lined out here.
|
||||
|
||||
|
||||
AUTHOR
|
||||
Arnd Bergmann <arndb@de.ibm.com>
|
||||
|
||||
SEE ALSO
|
||||
capabilities(7), close(2), spu_run(2), spufs(7)
|
||||
|
||||
|
||||
|
||||
Linux 2005-09-28 SPU_CREATE(2)
|
@ -1,8 +1,11 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
============================================
|
||||
Accessing PCI device resources through sysfs
|
||||
--------------------------------------------
|
||||
============================================
|
||||
|
||||
sysfs, usually mounted at /sys, provides access to PCI resources on platforms
|
||||
that support it. For example, a given bus might look like this:
|
||||
that support it. For example, a given bus might look like this::
|
||||
|
||||
/sys/devices/pci0000:17
|
||||
|-- 0000:17:00.0
|
||||
@ -30,8 +33,9 @@ This bus contains a single function device in slot 0. The domain and bus
|
||||
numbers are reproduced for convenience. Under the device directory are several
|
||||
files, each with their own function.
|
||||
|
||||
=================== =====================================================
|
||||
file function
|
||||
---- --------
|
||||
=================== =====================================================
|
||||
class PCI class (ascii, ro)
|
||||
config PCI config space (binary, rw)
|
||||
device PCI device (ascii, ro)
|
||||
@ -40,13 +44,16 @@ files, each with their own function.
|
||||
local_cpus nearby CPU mask (cpumask, ro)
|
||||
remove remove device from kernel's list (ascii, wo)
|
||||
resource PCI resource host addresses (ascii, ro)
|
||||
resource0..N PCI resource N, if present (binary, mmap, rw[1])
|
||||
resource0..N PCI resource N, if present (binary, mmap, rw\ [1]_)
|
||||
resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
|
||||
revision PCI revision (ascii, ro)
|
||||
rom PCI ROM resource, if present (binary, ro)
|
||||
subsystem_device PCI subsystem device (ascii, ro)
|
||||
subsystem_vendor PCI subsystem vendor (ascii, ro)
|
||||
vendor PCI vendor (ascii, ro)
|
||||
=================== =====================================================
|
||||
|
||||
::
|
||||
|
||||
ro - read only file
|
||||
rw - file is readable and writable
|
||||
@ -56,7 +63,7 @@ files, each with their own function.
|
||||
binary - file contains binary data
|
||||
cpumask - file contains a cpumask type
|
||||
|
||||
[1] rw for RESOURCE_IO (I/O port) regions only
|
||||
.. [1] rw for RESOURCE_IO (I/O port) regions only
|
||||
|
||||
The read only files are informational, writes to them will be ignored, with
|
||||
the exception of the 'rom' file. Writable files can be used to perform
|
||||
@ -67,11 +74,11 @@ don't support mmapping of certain resources, so be sure to check the return
|
||||
value from any attempted mmap. The most notable of these are I/O port
|
||||
resources, which also provide read/write access.
|
||||
|
||||
The 'enable' file provides a counter that indicates how many times the device
|
||||
The 'enable' file provides a counter that indicates how many times the device
|
||||
has been enabled. If the 'enable' file currently returns '4', and a '1' is
|
||||
echoed into it, it will then return '5'. Echoing a '0' into it will decrease
|
||||
the count. Even when it returns to 0, though, some of the initialisation
|
||||
may not be reversed.
|
||||
may not be reversed.
|
||||
|
||||
The 'rom' file is special in that it provides read-only access to the device's
|
||||
ROM file, if available. It's disabled by default, however, so applications
|
||||
@ -93,7 +100,7 @@ Accessing legacy resources through sysfs
|
||||
|
||||
Legacy I/O port and ISA memory resources are also provided in sysfs if the
|
||||
underlying platform supports them. They're located in the PCI class hierarchy,
|
||||
e.g.
|
||||
e.g.::
|
||||
|
||||
/sys/class/pci_bus/0000:17/
|
||||
|-- bridge -> ../../../devices/pci0000:17
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user